Extracting Data From a Large JSON File

Hello, I am working on a project that relies on data from a web service that returns a large JSON file, which I save to an SD card. I have an array of IDs and I need to find the matching IDs in the JSON file and read out the parameters that correspond to those IDs. Now, the problem I'm having is that the JSON file is around ~6 MB, so it won't fit into RAM. I could filter the JSON, but the data I need is in one very large array of objects, so I can't reduce it a lot. I looked online, and it seems that I should read and parse the JSON in chunks, but I didn't find any examples of that.

The JSON looks something like this:

{
	"data": {
        "objects": [{
                "id": 239089060,
                "Data1": 97115,
                "Data2": 1239083487237,
                "Data3": 169874830,
                "Data4": 1643904238090,
                "neededData": 360
            }, {
				"id": 560213034,
                "Data1": 54987,
                "Data2": 4895203781411,
                "Data3": 154856257,
                "Data4": 5812348902040,
                "neededData": 125
            }, {
				"id": 432904509,
                "Data1": 76751,
                "Data2": 0987206196125,
                "Data3": 465145478,
                "Data4": 5498347519757,
                "neededData": 99
            }
		]
	}
}

If I have, for example, an ID array that has the values: 239089060, 432904509.
It should create another array with the values: 360, 99.

In reality there are around 5000 objects in the "objects" array, and my ID array will contain about 50 values.

Any ideas on how to accomplish this would be helpful.

If there is any information I am missing, let me know. Thank you in advance for your help.

For any SD card, the largest chunk is 512 bytes. The smallest chunk is one byte. Are you going to program this on an Arduino or on your PC?

I am not sure what you mean by this, I am looking for some insights on how to program this to parse the needed data from the JSON, without filling up the RAM. This program will run on a Raspberrry Pi Pico, programmed using the Arduino IDE.

I mean the data stored on an SD card is ALWAYS stored in 512 byte chunks, segments, sectors, what ever you want to call it. The actual real data in the very last sector may not completely fill the last sector, so will have to be handled differently.
Your data seems to be all variable length fields, resulting in variable length records, so will be really hard to process on a repetitive basis. Is there any way to make stuff fixed length when the file is created?

I might have used the wrong wording in my post. I am not looking to read the raw chunks from an SD card, I am looking to read parts of the JSON file so as to be able to parse the data and extract what's needed.

You'll need a parser that can work incrementally or in streaming mode, and store the parsing context so it can resume. Possibly GitHub - squix78/json-streaming-parser: Arduino library for parsing potentially huge json streams on devices with scarce memory may work, I haven't tried it.

1 Like

No, you used the right words. With a file that has variable length fields, there is no other way that to read each byte. With fixed length fields, you know the size of each of the records and can compute the location of the record you want and can then compute where in the record to find that field.

Ignore The fact that it is JSON.
Read the file line by line into a small buffer.
Once you see the "objects" entry, start parsing the entries
If you see an ID field, look for it in your array of 50 items.
If it's there, grab the next "neededData" field you need.
if not, go back to looking for an id field.

Depending on what your data really looks like, you may need to check for when the array ends too.

Reading line by line might help if the JSON is pretty-printed. But many APIs don't bother, and there are no line breaks (or spaces or tabs) at all. So the reliable way is to treat it as a stream, and go character by character.

Are the IDs 9-digit integers? These details matter more with C/C++. 10-digit integers would require going bigger than int32_t. What about the values?

So let's say you've got the 50 IDs to find, somehow in plain array. You could convert them one-for-one to an array of structs

struct FindThese {
  int32_t id;
  int32_t data;
  bool found;
};

Then you've got the JSON stream. Whether you use a library, or write your own state machine, the process is like

  • got an object {: now at nest level 1
  • got a field named data
  • it's an object, nest level 2
  • got a field named objects
  • it's an array [: at nest level 3
  • for each element of {"data":{"objects":[
    • got an object, nest level 4
    • got a field named id
    • got the value (after seeing the ,): it's a number; stash it
    • got a field named neededData
    • got the value: it's a number; stash it
    • object is finished } at level 4 (going back to level 3)
      • search the array for the id; if found
        • set the data
        • set found to true

Note that the order of fields within an object may vary; better to wait until the object is done.

At the end, you can go through the array of structs to see how many you got. There are some possible optimizations. For example: keep a count of how many are found. If you got them all, stop. You also have to decide how much error checking you want. Even if the JSON is properly formed and has no syntax errors, what if there are duplicates, for example. At the least, you should report whether the run was clean with no issues, or not.

1 Like

Good point. If that is the case here, it might be worth looking for a parsing library or at least some parsing code to get started.