[Solved] Handling megabytes of data stored in a sd card

Hello,

I am new to arduino but have a fair amount of experience in programming in general.
I have a teensy 4.1 with a SD card. In that card I have a json file (1 to 16 MB) which is basically a map, a single level of key/value pairs.
I’d like to access any value instantly (<50ms at most, <10ms ideally) by its key.
The prototype I’ve coded uses ArduinoJson to deserialize the whole document while filtering by the key.

  auto error = deserializeJson(doc, file, DeserializationOption::Filter(filter));

This works fine but takes about 3 to 6 seconds to find the value since it probably needs to read the whole file.
What would be a better approach for this kind of task? I understand that JSON is not the most optimized storage format for that, but I also can’t load a map of a few MB in memory (because I do not have MBs of memory).

Any hint on a better solution to that problem is welcome.
Don’t hesitate to ask any question for more details.

Thanks in advance.

Not knowing the format of the records.....
Could You prepare a kind index file pointing into the data file?
Sometimes "a half interval" method can cut time if the records are sorted in ascending/descending order.
Using hashing technic.

because I do not have MBs of memory

Well the Teensy 4.1 can be fitted with an extra 16MBs of PSRAM.

Sounds like your trying to use a SD card like you might in Linux\Windows. I though the use of SD access in that style required a licence for the routines which is not available to Arduinos. Maybe I am mistaken .................

hedgestock:
Any hint on a better solution to that problem is welcome.

Without knowing the typical content of a record, whether they are fixed or variable length records, the order in which they are stored or the search term you want to use I can't think of anything useful to say.

It sounds like a job for a PC or a RaspberryPi

...R

Robin2:
Without knowing the typical content of a record, whether they are fixed or variable length records, the order in which they are stored or the search term you want to use I can't think of anything useful to say.

They are variable length records, but I they probably won't exceed 100ish characters, and I can order them when initializing, since that would be read only.

srnet:
Well the Teensy 4.1 can be fitted with an extra 16MBs of PSRAM.

I wasn't aware of that, and I might look into it as a last resort solution, but I'd like to manage with the vanilla board.

Robin2:
It sounds like a job for a PC or a RaspberryPi

It does but also not really, the project is a keyboard, and baremetal programming on a rPi is too much for a side project for me. Also the teensy seems much better to behave as a hid device.

Railroader:
Could You prepare a kind index file pointing into the data file?

Yes, that sounds like what I want, since my records are ordered, but I haven't found yet how to make something point in the middle of a file. From my understanding, file are just readable streams.

I've used serial read of streams. To speed up search a parallell index file contains pointers into the data file, at the beginnings of each record. The index file pointer is then used to set a pointer into the data file and read that post. Is it possible to start reading "in the middle" of the data file?

Railroader:
I've used serial read of streams. To speed up search a parallell index file contains pointers into the data file, at the beginnings of each record. The index file pointer is then used to set a pointer into the data file and read that post. Is it possible to start reading "in the middle" of the data file?

It's definitely acceptable to start reading in the middle of the file, since it looks something like this and is ordered:

{
  "key1": "data",
  "key2": "data",
  "key3": "data",
  "key4": "data",
  ...
  "key100000ish": "data",
}

Whether it is possible (doable) or not, I do not know, I haven't found anything yet.
Thanks for helping! ^^

Okey. Fine.
Is every key present? If so, knowing the value of the first and the last key just use interpolation!

Something like (lastKeyNumber - firstKeyNumber) /recordSize * searchedKey could be the pointer into the data file.

hedgestock:
They are variable length records, but I they probably won't exceed 100ish characters, and I can order them when initializing, since that would be read only.

You have not said what the search term will be?

And can you provide an example of a long and a short record?

Could you arrange for all the records to be the size of the largest record? Fixed length records are much easier to work with.

...R

Robin2:
You have not said what the search term will be?

And can you provide an example of a long and a short record?

Could you arrange for all the records to be the size of the largest record? Fixed length records are much easier to work with.

...R

Using a parallell index file containing pointers to each record even variable length records can be handled.

Railroader:
Using a parallell index file containing pointers to each record even variable length records can be handled.

Yes, but not as straightforward as using fixed length records with fixed length fields.

What happens if someone forgets to update the index?

...R

Robin2:
Yes, but not as straightforward as using fixed length records with fixed length fields.

What happens if someone forgets to update the index?

...R

Fixed lenght records needs no index file, only some one line math.

The index is intimitly connected with the main data file. Every insert or delete of records involves an update of the index file. That's absolutely not a separate work.
If updates of the basic data file are rare its no problem to read the data file, record by record and create the index from the beginning.
OP looks like using a quite stable data file.

So,

Thanks for all your answers, I think I begin to see a solution.
My file being already large, I don't really see an issue to making every record the length of the larger one even if that multiplies the size of the data by 10. If I understood correctly, this means that I would be able to do

dataFile.seek(lineNumer * LineLength);

and assuming seek function is o(1) complexity, I'll be fast as hell, problem solved.
Sadly, I won't have the possibilty to test until tomorrow.

However, the solution from Railroader seems to involve no modification of the data file and I feel like it is cleaner. The sole problem is I do not understand how the index file would work in practice (also, how would it be faster to search my index rather than the file directly as it is a flat dictionary?). As I said, I have programming experience, and understand the concept fairly well, but I have no knowledge of the functions to call for doing so. Could you perhaps provide a small dummy sample or example for me to understand? My googling did not give good results for that...
And yes, the data is fairly stable, even read only for the moment (it might be editable sometime in the future but that's not at all a mandatory feature) so creating an index file in the initialisation is fine.
Once again, thanks for taking some time to help me. :slight_smile:

I would like to know some details about how/when You create the data file.
Another question is if You use any delimiter as the end of each record, or like a start character marking the beginning of each record.

Suppose You start with an empty data file, opened for writing. Then You are at byte zero in the data file. Write that to the first record in the index file and advance the index pointer to next index record.
Write the number of bytes You want to the data file. Write this number into the index file and advance the index pointer to its next record.
Write the next number if data bytes and increment the data byte counter. Write to the index file
And so on.

The index fike needs to use unsigned long to handle You 16 Mb data.

I'll try to post You some code, or pseudo code but that will be some 12 hours later I'm afraid.

Railroader:
The index is intimitly connected with the main data file. Every insert or delete of records involves an update of the index file. That's absolutely not a separate work.

That only happens if there is a database program that makes it happen. I'm not aware of any database programs that can run on an Arduino.

...R

Well, who knows? Op might want to do some editing inside the project?

I have no experience with sd cards in arduino but I can imagine that deserializing and searching for a key in a large file using the method you are would take a while.

Are you updating the file within your device or is it read only? If it's the later I'd do one of two things:

  1. create a separate sorted index file by and store the key, length and offset in it. of element) in the startup portion of your sketch. To access the data binary search the index and load the record that from the data file.
  2. sort the entire file at startup and then separate it into several smaller files. Keep a table of the first key in each of the files in memory so that you know which file to search for your record in.