Pages: [1]   Go Down
Author Topic: Searching a 'large' file on SD?  (Read 876 times)
0 Members and 1 Guest are viewing this topic.
UK
Offline Offline
Jr. Member
**
Karma: 0
Posts: 89
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hello all!!

I was wondering if anyone has previously had a similar requirement to do what I am trying to do?

I have a file (27MB to round figures in size) which contains approx 260,000 lines of variable length, colon separated fields.

It has a defined structure as follows :-

Int(6):Char(6):Char(60):Char(4):Int(2):Float(3.1):Int(2):Float(3.1):Int(7):Int(7):Char(1):Char(2):Char(20):Char(60):Char(3):Char(11):Char(1):Int(3):Int(3):Int(3)

As previously mentioned these are maximums...

A snippet of the mentioned file here :-

Code:
258316:SO0003:Ysgubor-wen Ho:SO00:51:43.2:3:26.4:203500:300500:W:RH:Rho Cyn Taf:Rhondda,Cynon,Taff:X:01-MAR-1993:I:170:0:0
258317:SN6895:Ysgubor-y-coed:SN68:52:32.5:3:56.3:295500:268500:W:CE:Cered:Ceredigion:X:01-MAR-1993:I:135:0:0
258318:SO0873:Ysgwd-ffordd:SO06:52:21.1:3:20.6:273500:308500:W:PW:Powys:Powys:X:01-MAR-1998:U:136:147:0
258319:SJ1930:Ysgwennant:SJ02:52:51.9:3:11.8:330500:319500:W:PW:Powys:Powys:X:01-MAR-1993:I:125:0:0
258320:SO0537:Ysgwydd Hwch:SO02:52:1.6:3:22.6:237500:305500:W:PW:Powys:Powys:H:21-MAY-2007:U:160:0:0
258321:SO1200:Ysgwydd-gwyn-isaf Fm:SO00:51:41.8:3:16:200500:312500:W:CF:Caer:Caerphilly:FM:01-MAR-1993:I:171:0:0
258322:SO3113:Ysgyrd Fach:SO20:51:48.9:2:59.6:213500:331500:W:MM:Monm:Monmouthshire:H:01-MAR-1993:I:161:0:0
258323:SO3317:Ysgyryd Fawr:SO20:51:51.1:2:57.9:217500:333500:W:MM:Monm:Monmouthshire:H:01-MAR-1993:I:161:0:0
258324:SS5598:Yspitty:SS48:51:40:4:5.4:198500:255500:W:CT:Carm:Carmarthenshire:O:01-MAR-1993:I:159:0:0
258325:SN4826:Yspitty Ifan:SN42:51:55:4:12.2:226500:248500:W:CT:Carm:Carmarthenshire:X:01-MAR-1993:I:146:0:0
258326:SM7923:Ystafelloedd:SM62:51:52:5:12.2:223500:179500:W:PB:Pemb:Pembrokeshire:X:01-MAR-1993:I:157:0:0
258327:SN7608:Ystalyfera:SN60:51:45.7:3:47.4:208500:276500:W:NP:Nth Pt Talb:Neath Port Talbot:O:01-MAR-1993:I:160:0:0

I need to search as quickly as possible, field 3, possibly sub-searched using fields 14 and/or 13....

Clearly this would be an extremely time consuming process to begin at the beginning and search to the end.... Especially if the result was to yield nothing....  

To complicate matters further, the file contains characters which do not 'play well' with toupper() and tolower()
For example :-

Code:
30:NC3249:A' Chèir Ghorm:NC24:58:24.1:4:52:949500:232500:W:HL:Highld:Highland:X:23-JUN-2008:U:9:0:0
31:NG2605:A' Chill:NG20:57:3.5:6:30.7:805500:126500:W:HL:Highld:Highland:O:01-MAR-1993:I:39:0:0
32:NC2105:A' Chìoch:NC20:58:.2:5:1.2:905500:221500:W:HL:Highld:Highland:X:01-MAR-1993:I:15:0:0
33:NC5729:A' Chioch:NC42:58:13.9:4:25.6:929500:257500:W:HL:Highld:Highland:X:01-FEB-1998:I:16:0:0
34:NG8144:A' Chioch:NG84:57:26.3:5:38.5:844500:181500:W:HL:Highld:Highland:H:01-FEB-1998:I:24:0:0
35:NH0509:A' Chioch:NH00:57:8.1:5:12.9:809500:205500:W:HL:Highld:Highland:X:01-AUG-1994:I:33:0:0
36:NH1115:A' Chìoch:NH00:57:11.5:5:7.2:815500:211500:W:HL:Highld:Highland:H:01-MAR-1993:I:34:0:0

The sort order of the file is numerical on field 1 ... ie 1 - 258422, field 2 is random based on field 3 which is alphabetically sorted while all other fields are also random.

Some sort of caseless 'closest match' style search is what I need.

There is no possibility I can break down the file into 'A'  'B'  'C' on field 3 which was my first idea....  smiley-eek-blue smiley-eek smiley-roll-blue smiley-roll-sweat

I have already spent a significant amount of time on this problem myself, and basically achieved sweet Fanny Adam! Any and all help would most graciously be received and appreciated!!

Any ideas please?

This is one 'small' problem in a MUCH larger overall project I have brewing, further details to be announced once more progress has been made!  smiley-wink smiley-grin

Regards and thanks,

Graham
« Last Edit: August 21, 2013, 12:06:00 pm by ghlawrence2000 » Logged

Global Moderator
UK
Offline Offline
Brattain Member
*****
Karma: 290
Posts: 25769
I don't think you connected the grounds, Dave.
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

speed safety cameras?
Would it be simpler to reorganise the data and have separate index files, based on place-name/lat-long/ whatever?
Logged

"Pete, it's a fool looks for logic in the chambers of the human heart." Ulysses Everett McGill.
Do not send technical questions via personal messaging - they will be ignored.

UK
Offline Offline
Jr. Member
**
Karma: 0
Posts: 89
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

speed safety cameras?
Would it be simpler to reorganise the data and have separate index files, based on place-name/lat-long/ whatever?

Quote
There is no possibility I can break down the file into 'A'  'B'  'C' on field 3 which was my first idea....

I did try that idea long ago...  EDIT: I am going to re-visit that idea since I do think it would help significantly with search speed!! smiley-wink

Thanks anyway.

Graham
« Last Edit: August 21, 2013, 12:39:49 pm by ghlawrence2000 » Logged

Idaho, US
Offline Offline
God Member
*****
Karma: 19
Posts: 859
Special User
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

EDIT: It's a "binary" search. This is regular C code for searching an array, but could easily be adopted to work with a file on an SD card on an Arduino:

http://www.c.happycodings.com/Sorting_Searching/code3.html
« Last Edit: August 23, 2013, 02:42:30 pm by tylernt » Logged

UK
Offline Offline
Jr. Member
**
Karma: 0
Posts: 89
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

It was a journey and a half!!!  smiley-roll However, brute force and ignorance with a 'little' help from Arduino managed to spit out sections of the original file such as A.txt, B.txt, C.txt etc.....

Then a further set of processing on Arduino, dumping all unnecessary fields and spit out A1.txt, B1.txt etc..... job done... smiley-wink smiley-roll-blue

Arduino better than 32bit MS Excel !!!  smiley-wink

Further bonus of using Ardui, MUCH easier to filter stupid characters with accents, gravs and umlauts etc!! Excel got some strange ideas about suitable replacements!!

The original reason for my post was needing a rapid method of searching a 30MB file..... I now have 25 files, one for each letter of the alphabet excluding 'X', and the biggest file is only now 3MB..... Searching that sequentially is acceptably rapid on a DUE!  smiley-grin

Thanks for the help anyway!!
Logged

Pages: [1]   Go Up
Jump to: