Parsing large strings

Hi,
I am writing a program that will use wifi to get a data string, parse it and set some external LEDs as a result. I'm currently working on the parsing section.

I am attempting to put in a sample string to get my parsing right. It is rather long (1185 chars). This is causing the IDE to crash when compiling. I've tried taking out the single quotes, as the FAQ says strange " ' \ combinations can cause this but to no avail. The IDE also crashes when attempting to auto-format the code.

There is also the ~2kB limit on data in sram, so I've tried using PROGMEM, but it still crashes.

I could get around needing the full string by parsing as it is read from the network but I'm worried about buffer overflow if data is not read fast enough.

I'm using a Wifi shield, Mega 2560 and V1.0.3 of the IDE (so it is compatible with wifi).

If anyone has any suggestions or relevant experience I'd appreciate them.
The code

#include <avr/pgmspace.h>
/*
  String substring()
 
 Examples of how to use substring in a String
 
 created 27 July 2010, 
 modified 2 Apr 2012
 by Zach Eveland
 
 http://arduino.cc/en/Tutorial/StringSubstring
 
 This example code is in the public domain.
 */

void setup() {
  // Open serial communications and wait for port to open:
  Serial.begin(9600);
  while (!Serial) {
    ; // wait for serial port to connect. Needed for Leonardo only
  }

  // send an intro:
  Serial.println("\n\nString  substring():");
  Serial.println();
}

void loop() {
  // Set up a String:
  //PROGMEM is a derective to put strin in to flash mem
   PROGMEM String string = "<td >WIND %</td><td style='background-color: rgb(36, 184, 36)'>24%</td><td style='background-color: rgb(36, 184, 36)'>24%</td><td style='background-color: rgb(36, 184, 36)'>25%</td><td style='background-color: rgb(36, 184, 36)'>25%</td><td style='background-color: rgb(36, 184, 36)'>26%</td><td style='background-color: rgb(36, 184, 36)'>26%</td><td style='background-color: rgb(36, 184, 36)'>23%</td><td style='background-color: #FFCC32'>18%</td><td style='background-color: #FFCC32'>14%</td><td style='background-color: #FFCC32'>13%</td><td style='background-color: #FFCC32'>13%</td><td style='background-color: #FFCC32'>13%</td><td style='background-color: #FFCC32'>14%</td><td style='background-color: #FFCC32'>15%</td><td style='background-color: #FFCC32'>15%</td><td style='background-color: #FFCC32'>15%</td><td style='background-color: #FFCC32'>15%</td><td style='background-color: #FFCC32'>14%</td><td style='background-color: #FFCC32'>13%</td><td style='background-color: #FFCC32'>12%</td><td style='background-color: #FFCC32'>11%</td><td style='background-color: #FFCC32'>11%</td><td style='background-color: #FFCC32'>13%</td><td style='background-color: #FFCC32'>12%</td>";
  Serial.println(string);
  int one=0,two=0;
  boolean runLoop=true;
  //read first to seed loop
  one=string.indexOf(">",string.indexOf("</td>")+5);//get the time's start (end tag of colour)
  if(one<0||two<0)
  {
    runLoop=false;
  }
  one=one+1;//move to start of number
  two=string.indexOf("</td>",one);//get the time's end tag
  if(one<0||two<0)
  {
    runLoop=false;
  }

  while(runLoop)
  {//while percents left
    String time=string.substring(one,two);
    Serial.println(time);
    one=string.indexOf(">",two+5);//get the time's start (end tag of colour)
    if(one<0||two<0)
    {
      runLoop=false;
    }
    one=one+1;//move to start of number
    two=string.indexOf("</td>",one);//get the time's end tag
    if(one<0||two<0)
    {
      runLoop=false;
    }  
    //Serial.print(one);
    //Serial.println(" = one");
    //Serial.print(two);
    //Serial.println(" = two");
  }
  // do nothing while true:
  while(true);
}

The error message wouldn't fit in last post. Most of it is here, its too verbose for the forum's character limits

Exception in thread "Thread-6" java.lang.StackOverflowError
at java.util.regex.Pattern$Loop.match(Pattern.java:4275)
	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
	at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
	at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
	at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)

I am attempting to put in a sample string to get my parsing right.

You LIED. You are using a String, which is NOT the same thing, AT ALL.

Putting the String object in PROGMEM will NOT put the string that it wraps in PROGMEM, so you are wasting your time.

Besides, when you start parsing live data, you can't store that in PROGMEM.

I am attempting to put in a sample string to get my parsing right.

You are not using a string you are using a String. Not the same thing at all, but that does not seem to be the problem judging by the error message which seems to indicate that the problem lies with the underlying Java code behind the IDE.

It looks like you are going to have to take steps to reduce the size of the string. For instance does it need to be a full blown HTML table complete with details of background colours etc which will be irrelevant to the Arduino ?

The immediate problem is that the code the Arduino uses to parse your code is poorly written and not very robust. You can get around the bug by splitting that big string up into several smaller strings - just put a pair of double quotes in the middle of it so that you get something like this:

  PROGMEM String string = "<td >WIND %</td><td style='background-color: rgb(36, 184, 36)'>24%</td>"
    "<td style='background-color: rgb(36, 184, 36)'>24%</td><td style='background-color: rgb(36, 184,"
    " 36)'>25%</td><td style='background-color: rgb(36, 184, 36)'>25%</td><td style='background-color:"
    " rgb(36, 184, 36)'>26%</td><td style='background-color: rgb(36, 184, 36)'>26%</td><td style='backg"
    "round-color: rgb(36, 184, 36)'>23%</td><td style='background-color: #FFCC32'>18%</td><td style='ba"
    "ckground-color: #FFCC32'>14%</td><td style='background-color: #FFCC32'>13%</td><td style='backgrou"
    "nd-color: #FFCC32'>13%</td><td style='background-color: #FFCC32'>13%</td><td style='background-col"
    "or: #FFCC32'>14%</td><td style='background-color: #FFCC32'>15%</td><td style='background-color: #F"
    "FCC32'>15%</td><td style='background-color: #FFCC32'>15%</td><td style='background-color: #FFCC32'"
    ">15%</td><td style='background-color: #FFCC32'>14%</td><td style='background-color: #FFCC32'>13%</t"
    "d><td style='background-color: #FFCC32'>12%</td><td style='background-color: #FFCC32'>11%</td><td "
    "style='background-color: #FFCC32'>11%</td><td style='background-color: #FFCC32'>13%</td><td style="
    "'background-color: #FFCC32'>12%</td>";

All the other comments about ill-advised use of the String class and problems dealing with huge strings in RAM apply too and I'm sure you will need to address those before you sketch will work correctly, but splitting the string up should get you past this silly IDE bug. Judging by the error message I get from 1.5.2 I suspect the limit is 1024 characters in a string, but I haven't confirmed that and it may not be the same for all IDE versions anyway.

PeterH's solution worked. Turning the one string into two contiguous strings fixed the IDE error.

I have no access to the server that generates the data, so I'm stuck with it in that form.

As for string vs String, my wifi data will be in string (char []) rather that String. The upper case was just for ease of manipulation, so yea not the same thing just a lazy programmer.

Thanks all.

You have more than enough processing power to deal with the string on the fly, more so as you will just throw most of it away.

Mark

You shouldn't need to buffer anything you're not saving but if you do play buffer-then-match, you shouldn't have to buffer more than a word at a time.

Even at 115200 baud there is a serious gap between arriving serial chars.

I could get around needing the full string by parsing as it is read from the network but I'm worried about buffer overflow if data is not read fast enough.

TCP connections have flow control builtin. You shouldn't have to worry about overflow.

Find out how many millis or micros you have between serial reads. Take the baud (bit) rate and divide by (8 bits per char + 1 start bit + 1 stop bit). At 115200 the frequency of available chars can be 11520/sec, just under 87 micros, just under 1389 Arduino clock cycles. You can do a LOT in that many cycles.

OTOH you can buffer up whole lines of text, waste the time between chars and then get all the parse and lex done in one task and you'll still get the job done on time! But which way needs the extra big buffer?

you have a sd card component on the shield. Why not use that to store the html?

You can also wrap long lines in C using backslash.

The IDE code that was falling over was simply trying to parse the code to colourize it
I think.

There is a library called TextFinder.h that identifies character strings in streaming data. Some somewhat outdated examples in the IDE example code. Below is some more elaborate code that gets weather data from a returned weather web page.

// Include description files for other libraries used (if any)
#include <SPI.h>
#include <Ethernet.h>

// Define Constants
// Max string length may have to be adjusted depending on data to be extracted
#define MAX_STRING_LEN  20

// Setup vars
char tagStr[MAX_STRING_LEN] = "";
char dataStr[MAX_STRING_LEN] = "";
char tmpStr[MAX_STRING_LEN] = "";
char endTag[3] = {'<', '/', '\0'};
int len;

// Flags to differentiate XML tags from document elements (ie. data)
boolean tagFlag = false;
boolean dataFlag = false;

// Ethernet vars
byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xED };
byte ip[] = { 192, 168, 1, 102 };
byte server[] = { 140, 90, 113, 200 }; // www.weather.gov

// Start ethernet client
EthernetClient client;

void setup()
{
  Serial.begin(9600);
  Serial.println("Starting WebWx");
  Serial.println("connecting...");
  Ethernet.begin(mac, ip);
  delay(1000);

  if (client.connect(server, 80)) {
    Serial.println("connected");
    client.println("GET /xml/current_obs/KRDU.xml HTTP/1.0");    
    client.println();
    delay(2000);
  } else {
    Serial.println("connection failed");
  }  
}

void loop() {

  // Read serial data in from web:
  while (client.available()) {
    serialEvent();
  }

  if (!client.connected()) {
    Serial.println();
    Serial.println("Disconnected");
    Serial.println("==================================");
    Serial.println("");
    client.stop();

    // Time until next update
    //Serial.println("Waiting");
    for (int t = 1; t <= 15; t++) {
      delay(60000); // 1 minute
    }

    if (client.connect(server, 80)) {
      //Serial.println("Reconnected");
      client.println("GET /xml/current_obs/KRDU.xml HTTP/1.0");    
      client.println();
      delay(2000);
    } else {
      Serial.println("Reconnect failed");
    }      
  }
}

// Process each char from web
void serialEvent() {

   // Read a char
	 char inChar = client.read();
   //Serial.print(".");
  
   if (inChar == '<') {
      addChar(inChar, tmpStr);
      tagFlag = true;
      dataFlag = false;

   } else if (inChar == '>') {
      addChar(inChar, tmpStr);

      if (tagFlag) {      
         strncpy(tagStr, tmpStr, strlen(tmpStr)+1);
      }

      // Clear tmp
      clearStr(tmpStr);

      tagFlag = false;
      dataFlag = true;      
      
   } else if (inChar != 10) {
      if (tagFlag) {
         // Add tag char to string
         addChar(inChar, tmpStr);

         // Check for </XML> end tag, ignore it
         if ( tagFlag && strcmp(tmpStr, endTag) == 0 ) {
            clearStr(tmpStr);
            tagFlag = false;
            dataFlag = false;
         }
      }
      
      if (dataFlag) {
         // Add data char to string
         addChar(inChar, dataStr);
      }
   }  
  
   // If a LF, process the line
   if (inChar == 10 ) {

/*
      Serial.print("tagStr: ");
      Serial.println(tagStr);
      Serial.print("dataStr: ");
      Serial.println(dataStr);
*/

      // Find specific tags and print data
      if (matchTag("<temp_f>")) {
	      Serial.print("Temp: ");
         Serial.print(dataStr);
      }
      if (matchTag("<relative_humidity>")) {
	      Serial.print(", Humidity: ");
         Serial.print(dataStr);
      }
      if (matchTag("<pressure_in>")) {
	      Serial.print(", Pressure: ");
         Serial.print(dataStr);
         Serial.println("");
      }

      // Clear all strings
      clearStr(tmpStr);
      clearStr(tagStr);
      clearStr(dataStr);

      // Clear Flags
      tagFlag = false;
      dataFlag = false;
   }
}

/////////////////////
// Other Functions //
/////////////////////

// Function to clear a string
void clearStr (char* str) {
   int len = strlen(str);
   for (int c = 0; c < len; c++) {
      str[c] = 0;
   }
}

//Function to add a char to a string and check its length
void addChar (char ch, char* str) {
   char *tagMsg  = "<TRUNCATED_TAG>";
   char *dataMsg = "-TRUNCATED_DATA-";

   // Check the max size of the string to make sure it doesn't grow too
   // big.  If string is beyond MAX_STRING_LEN assume it is unimportant
   // and replace it with a warning message.
   if (strlen(str) > MAX_STRING_LEN - 2) {
      if (tagFlag) {
         clearStr(tagStr);
         strcpy(tagStr,tagMsg);
      }
      if (dataFlag) {
         clearStr(dataStr);
         strcpy(dataStr,dataMsg);
      }

      // Clear the temp buffer and flags to stop current processing
      clearStr(tmpStr);
      tagFlag = false;
      dataFlag = false;

   } else {
      // Add char to string
      str[strlen(str)] = ch;
   }
}

// Function to check the current tag for a specific string
boolean matchTag (char* searchTag) {
   if ( strcmp(tagStr, searchTag) == 0 ) {
      return true;
   } else {
      return false;
   }
}