Go Down

Topic: Searching a 'large' file on SD? (Read 964 times) previous topic - next topic

ghlawrence2000

Aug 21, 2013, 06:56 pm Last Edit: Aug 21, 2013, 07:06 pm by ghlawrence2000 Reason: 1
Hello all!!

I was wondering if anyone has previously had a similar requirement to do what I am trying to do?

I have a file (27MB to round figures in size) which contains approx 260,000 lines of variable length, colon separated fields.

It has a defined structure as follows :-

Int(6):Char(6):Char(60):Char(4):Int(2):Float(3.1):Int(2):Float(3.1):Int(7):Int(7):Char(1):Char(2):Char(20):Char(60):Char(3):Char(11):Char(1):Int(3):Int(3):Int(3)

As previously mentioned these are maximums...

A snippet of the mentioned file here :-

Code: [Select]
258316:SO0003:Ysgubor-wen Ho:SO00:51:43.2:3:26.4:203500:300500:W:RH:Rho Cyn Taf:Rhondda,Cynon,Taff:X:01-MAR-1993:I:170:0:0
258317:SN6895:Ysgubor-y-coed:SN68:52:32.5:3:56.3:295500:268500:W:CE:Cered:Ceredigion:X:01-MAR-1993:I:135:0:0
258318:SO0873:Ysgwd-ffordd:SO06:52:21.1:3:20.6:273500:308500:W:PW:Powys:Powys:X:01-MAR-1998:U:136:147:0
258319:SJ1930:Ysgwennant:SJ02:52:51.9:3:11.8:330500:319500:W:PW:Powys:Powys:X:01-MAR-1993:I:125:0:0
258320:SO0537:Ysgwydd Hwch:SO02:52:1.6:3:22.6:237500:305500:W:PW:Powys:Powys:H:21-MAY-2007:U:160:0:0
258321:SO1200:Ysgwydd-gwyn-isaf Fm:SO00:51:41.8:3:16:200500:312500:W:CF:Caer:Caerphilly:FM:01-MAR-1993:I:171:0:0
258322:SO3113:Ysgyrd Fach:SO20:51:48.9:2:59.6:213500:331500:W:MM:Monm:Monmouthshire:H:01-MAR-1993:I:161:0:0
258323:SO3317:Ysgyryd Fawr:SO20:51:51.1:2:57.9:217500:333500:W:MM:Monm:Monmouthshire:H:01-MAR-1993:I:161:0:0
258324:SS5598:Yspitty:SS48:51:40:4:5.4:198500:255500:W:CT:Carm:Carmarthenshire:O:01-MAR-1993:I:159:0:0
258325:SN4826:Yspitty Ifan:SN42:51:55:4:12.2:226500:248500:W:CT:Carm:Carmarthenshire:X:01-MAR-1993:I:146:0:0
258326:SM7923:Ystafelloedd:SM62:51:52:5:12.2:223500:179500:W:PB:Pemb:Pembrokeshire:X:01-MAR-1993:I:157:0:0
258327:SN7608:Ystalyfera:SN60:51:45.7:3:47.4:208500:276500:W:NP:Nth Pt Talb:Neath Port Talbot:O:01-MAR-1993:I:160:0:0


I need to search as quickly as possible, field 3, possibly sub-searched using fields 14 and/or 13....

Clearly this would be an extremely time consuming process to begin at the beginning and search to the end.... Especially if the result was to yield nothing....  

To complicate matters further, the file contains characters which do not 'play well' with toupper() and tolower()
For example :-

Code: [Select]
30:NC3249:A' Chèir Ghorm:NC24:58:24.1:4:52:949500:232500:W:HL:Highld:Highland:X:23-JUN-2008:U:9:0:0
31:NG2605:A' Chill:NG20:57:3.5:6:30.7:805500:126500:W:HL:Highld:Highland:O:01-MAR-1993:I:39:0:0
32:NC2105:A' Chìoch:NC20:58:.2:5:1.2:905500:221500:W:HL:Highld:Highland:X:01-MAR-1993:I:15:0:0
33:NC5729:A' Chioch:NC42:58:13.9:4:25.6:929500:257500:W:HL:Highld:Highland:X:01-FEB-1998:I:16:0:0
34:NG8144:A' Chioch:NG84:57:26.3:5:38.5:844500:181500:W:HL:Highld:Highland:H:01-FEB-1998:I:24:0:0
35:NH0509:A' Chioch:NH00:57:8.1:5:12.9:809500:205500:W:HL:Highld:Highland:X:01-AUG-1994:I:33:0:0
36:NH1115:A' Chìoch:NH00:57:11.5:5:7.2:815500:211500:W:HL:Highld:Highland:H:01-MAR-1993:I:34:0:0


The sort order of the file is numerical on field 1 ... ie 1 - 258422, field 2 is random based on field 3 which is alphabetically sorted while all other fields are also random.

Some sort of caseless 'closest match' style search is what I need.

There is no possibility I can break down the file into 'A'  'B'  'C' on field 3 which was my first idea....  :smiley-eek-blue: :smiley-eek: :smiley-roll-blue: :smiley-roll-sweat:

I have already spent a significant amount of time on this problem myself, and basically achieved sweet Fanny Adam! Any and all help would most graciously be received and appreciated!!

Any ideas please?

This is one 'small' problem in a MUCH larger overall project I have brewing, further details to be announced once more progress has been made!  ;) :D

Regards and thanks,

Graham

AWOL

speed safety cameras?
Would it be simpler to reorganise the data and have separate index files, based on place-name/lat-long/ whatever?
"Pete, it's a fool looks for logic in the chambers of the human heart." Ulysses Everett McGill.
Do not send technical questions via personal messaging - they will be ignored.

ghlawrence2000

#2
Aug 21, 2013, 07:34 pm Last Edit: Aug 21, 2013, 07:39 pm by ghlawrence2000 Reason: 1

speed safety cameras?
Would it be simpler to reorganise the data and have separate index files, based on place-name/lat-long/ whatever?


Quote
There is no possibility I can break down the file into 'A'  'B'  'C' on field 3 which was my first idea....


I did try that idea long ago...  EDIT: I am going to re-visit that idea since I do think it would help significantly with search speed!! ;)

Thanks anyway.

Graham

tylernt

#3
Aug 23, 2013, 09:07 pm Last Edit: Aug 23, 2013, 09:42 pm by tylernt Reason: 1
EDIT: It's a "binary" search. This is regular C code for searching an array, but could easily be adopted to work with a file on an SD card on an Arduino:

http://www.c.happycodings.com/Sorting_Searching/code3.html

ghlawrence2000

It was a journey and a half!!!  :smiley-roll: However, brute force and ignorance with a 'little' help from Arduino managed to spit out sections of the original file such as A.txt, B.txt, C.txt etc.....

Then a further set of processing on Arduino, dumping all unnecessary fields and spit out A1.txt, B1.txt etc..... job done... ;) :smiley-roll-blue:

Arduino better than 32bit MS Excel !!!  ;)

Further bonus of using Ardui, MUCH easier to filter stupid characters with accents, gravs and umlauts etc!! Excel got some strange ideas about suitable replacements!!

The original reason for my post was needing a rapid method of searching a 30MB file..... I now have 25 files, one for each letter of the alphabet excluding 'X', and the biggest file is only now 3MB..... Searching that sequentially is acceptably rapid on a DUE!  :D

Thanks for the help anyway!!

Go Up