Go Down

Topic: New regular expression library released (Read 25748 times) previous topic - next topic

nickgammon

Apr 30, 2011, 07:45 am Last Edit: May 01, 2011, 12:48 am by Nick Gammon Reason: 1
I am pleased to announce the release of a Regular Expression library for the Arduino. Preliminary inquiries appeared to indicate that there was no such thing currently readily available.

What are regular expressions?

If you haven't used them before, they are incredibly powerful ways of parsing text in strings. For example, you can look for a word starting with an upper-case letter, a hex string, a number with an optional leading minus sign, and so on.

They provide a somewhat easier and more powerful way of breaking up strings than using library functions like strtok, strcmp, strstr etc.


Download

The library can be downloaded from:

http://gammon.com.au/Arduino/Regexp.zip (41Kb).

Design

My design criteria for this library were:


  • Be powerful enough to be useful (ie. not a "toy")

  • Be fast enough that it could be used to do things like process input from GPS devices, RFID readers, web pages etc.

  • Be compact enough that it doesn't use most of the available memory on a microprocessor



I believe these have been met, as follows:

Power

The library processes regular expressions which are identical in syntax to those used by the Lua string.match, string.find and similar functions. This is because the code is adapted from the Lua code written by Roberto Ierusalimschy. It has been adapated enough to make it work outside the Lua structure basically.

Lua regular expressions are well-known and well-documented. Their power is such that (for example) very extensive add-ons for the World of Warcraft game are written in Lua, and use the regular expression matching to break up incoming data from the server.

My own documentation for Lua regular expressions is here:

http://www.gammon.com.au/scripts/doc.php?lua=string.find

You can not only match strings like "%d+" (for one or more digits) but you can specify "captures" which means each captured substring has its position returned, so you can easily extract it out from the original string.

Speed

The Lua regular expression matcher has been well-regarded for its speed, and this library performs well too. For example:

String to parse: "Testing: answer=42"
Regular expression: "(%a+)=(%d+)"

Time taken to match: around 2 milliseconds.

This test returned the matching text ("answer=42"), its length, plus the two captures ("answer" and "42").

Code: [Select]

match start was 9
match length was 9
Match text: 'answer=42'
Captures: 2
Capture number: 1
Text: 'answer'
Capture number: 2
Text: '42'


This shows how easily you can use regular expressions to parse incoming text (eg. GPS data in the form keyword=value).


Size

The library takes about 2392 bytes. For example a minimal test would be:

Code: [Select]

#include <Regexp.h>
void setup ()
{
 MatchState ms;
 ms.Target ("cat");  // what to search
 char result = ms.Match ("a", 0);  // look for "a"
} // end of setup

void loop () {}


This compiles to be 2842 bytes. However an "empty" sketch is 450 bytes, so the regular expression library has added 2392 bytes (2.33 Kb).

I believe this is an acceptable length. This is around 7% of the memory on a 32 Kb device. You can reduce the memory slightly by reducing the number of captures it supports (currently 32). Alternatively, if you need to do dozens of captures you can do that by changing one define, at the cost of 4 bytes per capture.


Error handling

The library "throws" exceptions by doing a non-local goto (longjmp), in exactly the same way Lua does. This keeps the code compact and simple. If there is a parsing problem then the library returns a negative number as the result of the regexp call. You can interpret those to tidy up your regular expressions to make them work properly.


Usage

You need to include the library:

Code: [Select]

#include <Regexp.h>


Since, unlike Lua, functions cannot return multiple results (eg. all the captures) the MatchState structure is used to communicate with the library. You start by setting up the string to be searched, and its length:

Code: [Select]

MatchState ms;
ms.Target ("Testing: answer=42");


You can supply either a zero-terminated string (like the above) or a char buffer and a length.

Then you call the Match method of the MatchState structure, supplying the regular expression string itself, and an zero-relative offset to commence searching from. The function returns 1 on a match, 0 on no match, and a negative number for a parsing error.

Code: [Select]

 char result = ms.Match ("(%a+)=(%d+)", 0);
 if (result == REGEXP_MATCHED)
 {
   // matching offsets in ms.capture
 }
 else if (result == REGEXP_NOMATCH)
 {
   // no match
 }
 else
 {
   // some sort of error
 }


The meanings of the various error codes are defined in Regexp.h.

If and only if you get a REGEXP_MATCHED result (that is, 1) then the captures array in the MatchState structure is set up to indicate what the address and length of each capture substring is. You can use that information to index into your supplied string and extract out the substrings, if required.

For example:

Code: [Select]

char buf [100];  // large enough to hold expected string

Serial.print ("Captures: ");
Serial.println (ms.level);

for (int j = 0; j < ms.level; j++)
 {
 Serial.print ("Capture number: ");
 Serial.println (j, DEC);
 Serial.print ("Text: '");
 Serial.print (ms.GetCapture (buf, j));
 Serial.println ("'");
 
 } // end of for each capture


Also the matching text itself (the whole match) is stored as an offset and length. You could index into your original string to extract out the matching text, if required. It may not be required. You may only need to know if a match occurred, or not.

Code: [Select]

char buf [100];  // large enough to hold expected string
Serial.print ("Matched on: ");
Serial.println (ms.GetMatch (buf));
       


The library does not attempt to "pre-extract" those strings for you on the grounds that to do so would take extra time and memory, which you, the user of the library, may not care to expend.

MatchState methods

Code: [Select]

void MatchState::Target (const char * s);
void MatchState::Target (const char * s, const unsigned int len);


These let you supply the string to be searched (the target string). It can be null-terminated, in which case the library finds the end by doing a strlen on it, or you supply the length. If you have built up a buffer from incoming text you may prefer to just supply the length.

Code: [Select]

char MatchState::Match (const char * pattern, unsigned int index = 0);


This performs the match based on the supplied null-terminated pattern, and starting at the supplied index into the target string. By modifying the index parameter you can re-match further and further through the same target string, perhaps to keep finding the same sort of string (eg. a word).

The result of the match will be > 0 if successful, 0 if no match, and < 0 if an error occurred (invalid regular expression).

Code: [Select]

char * MatchState::GetMatch (char * s);


After a successful match, this copies the matching string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on MatchLength which is calculated during the match. If no successful match was previously done, then an empty string is copied.

The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.

Code: [Select]

char * MatchState::GetCapture (char * s, const int n);


After a successful match, this copies the specified capture string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on capture [n].len which is calculated during the match. If no successful match was previously done, or this capture does not exist, then an empty string is copied.

The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.


(edit) Version 1.1 uploaded 1st May 2011. This provides the extra "helper" functions documented just above.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics


graynomad

Good one Nick, I think that will become very useful.

______
Rob
Rob Gray aka the GRAYnomad www.robgray.com

nickgammon

#3
Apr 30, 2011, 09:56 am Last Edit: May 01, 2011, 12:14 am by Nick Gammon Reason: 1
Thanks everyone! Now, for what prompted this (apart from wanting it, heh) ...

In this thread here:

http://arduino.cc/forum/index.php/topic,59813.0.html

johnnydanger wanted to parse a string of bus route information. Now this is where regexps come into their own.

The example code below shows how you can parse (by repeatedly calling regexp on the same string, and changing the start point) a whole batch of stuff into useable chunks:

Code: [Select]
#include <Regexp.h>

void setup ()
{
  Serial.begin (115200);
  Serial.println ();

  MatchState ms;
  char * str = "10s:984\n8w:1331\n43w:198\n43w:846\n43e:-293\n43e:907";

  ms.Target (str);

  unsigned int index = 0;
  char buf [100];
 
  while (true)
  {
    char result = ms.Match ("(%d+)(%a):(%-?%d+)", index);

    if (result == REGEXP_MATCHED)
    {
      Serial.println ("-----");
      Serial.print ("Matched on: ");
      Serial.println (ms.GetMatch (buf));
      Serial.println ("Captures:");
      for (int j = 0; j < ms.level; j++)
         Serial.println (ms.GetCapture (buf, j));
      // move past matching string
      index = ms.MatchStart + ms.MatchLength;
    }  // end of match
    else
      break;  // no match or regexp error

  } // end of while

}  // end of setup

void loop () {}


Results:

Code: [Select]
-----
Matched on: 10s:984
Captures:
10
s
984
-----
Matched on: 8w:1331
Captures:
8
w
1331
-----
Matched on: 43w:198
Captures:
43
w
198
-----
Matched on: 43w:846
Captures:
43
w
846
-----
Matched on: 43e:-293
Captures:
43
e
-293
-----
Matched on: 43e:907
Captures:
43
e
907


By incrementing the start index point, we can re-scan inside the same string until we run out of matches. And things like the optional minus sign are easy to implement.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

nickgammon

Further to my post yesterday, the regexp library has been enhanced (new version 1.1 available for download from the same place).

The enhanced version has more consistent data names, and a few "helper" functions to help extract out the matching string, and the captured substrings.

The problem with the substrings is that the library does not make copies of them, internally they are just a starting address and a length. Because of this they are not null-terminated. But if you want to display them, or put them somewhere else, you may prefer null-terminated strings. The helper functions, like GetMatch, simply copy the string from the original target string into a buffer you supply, and append the 0x00 byte. Your supplied buffer must therefore be large enough to hold what is being copied into it.

I have amended the original post above to reflect the amended usage.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

dias

Hi,

I've made some test with your library which is what I need ;)

This is my code :
Code: [Select]

#include <Regexp.h>

void setup ()
{
  Serial.begin (115200);
  Serial.println ();

  MatchState ms;
  char * str = "F4FFF>F5FFF,WIDE2* <UI>:!1234.31N/01234.63W#Test de recuperation de donnee";
  ms.Target (str);

  unsigned int index = 0;
  char buf [200];
 
  while (true)
  {
    char result = ms.Match ("(.+)>(.+) <UI>:(.)(%d+%p+%d+)(%u)(.)(%d+%p+%d+)(%u)(.)(.+)", index);
    if (result == REGEXP_MATCHED)
    {
      Serial.println ("-----");
      Serial.print ("Matched on : ");
      Serial.println (ms.GetMatch (buf));
      Serial.println ("Captures:");
      for (int j = 0; j < ms.level; j++)
        {
          Serial.println (ms.GetCapture (buf, j));
          Serial.println ("-----");
        }
     
char *call,*via;
call = ms.GetCapture (buf, 0);
Serial.println (call);
via = ms.GetCapture (buf, 1);
Serial.println (via);
Serial.println (call);
      index = ms.MatchStart + ms.MatchLength;
    }  // end of match
   
  } // end of while

}  // end of setup

void loop () {}



I Have to parse my data, and Get the result into char.
But the problem is when i write :
Code: [Select]

call = ms.GetCapture (buf, 0);
Serial.println (call); //this one show F4FFF
via = ms.GetCapture (buf, 1);
Serial.println (via); //this one show F5FFF,WIDE2*
Serial.println (call); //this one show F5FFF,WIDE2*


How can I get the correct "call" on the 2° call

Thanks

nickgammon

Code: [Select]
char buf [200];
...
  char *call,*via;
  call = ms.GetCapture (buf, 0);
  Serial.println (call);
  via = ms.GetCapture (buf, 1);
  Serial.println (via);
  Serial.println (call);


Both of your GetCaptures pass in the same buffer "buf". Thus call and via both point to buf (and will show the same thing). One way would be to use two buffers, eg.

Code: [Select]
char buf [200];
char buf2 [200];

...

  char *call,*via;
  call = ms.GetCapture (buf, 0);
  Serial.println (call);
  via = ms.GetCapture (buf2, 1);
  Serial.println (via);
  Serial.println (call);


That displays the right things. Or just use a single buffer and process the data before doing another GetCapture.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

alphabeta279

Ok this looks awesome, but doing a quick test is spitting out odd characters, eg:

Code: [Select]
#include <Regexp.h>

void setup ()
{
  Serial.begin (115200);
  Serial.println ();

  MatchState ms;
  char * str = "RGB:123";

  ms.Target (str);

  unsigned int index = 0;
  char buf [100];
 
  while (true)
  {
    char result = ms.Match ("%a", index);

    if (result == REGEXP_MATCHED)
    {
      Serial.println ("-----");
      Serial.print ("Matched on: ");
      Serial.println (ms.GetMatch (buf));
      Serial.println ("Captures:");
      for (int j = 0; j < ms.level; j++)
         Serial.println (ms.GetCapture (buf, j));
      // move past matching string
      index = ms.MatchStart + ms.MatchLength;
    }  // end of match
    else
      break;  // no match or regexp error

  } // end of while

}  // end of setup

void loop () {}


Will return
Code: [Select]
FmR¡}´

Any ideas?

nickgammon

Using your code unchanged I got:

Code: [Select]

-----
Matched on: R
Captures:
-----
Matched on: G
Captures:
-----
Matched on: B
Captures:
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

nickgammon

Check your baud rate setting, you should at least see "Matched on".
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

nickgammon

To get a capture put the thing you want captured in brackets, eg.

Code: [Select]

    char result = ms.Match ("(%a)", index);


Now I get:

Code: [Select]

-----
Matched on: R
Captures:
R
-----
Matched on: G
Captures:
G
-----
Matched on: B
Captures:
B
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

alphabeta279

Ah, sorry - coding too late obviously - spot on re Baud rate, was at 9600.

This is a seriously awesome piece of work, love the potential of RegEx being included in Arduino (indeed really should be in the core!).

Got it working as you describe below, ok so appreciate your thoughts on the best way to parse this data format:

100,100,150,40
255,154,246,124
etc

Essentially it's an LED control script I'll be pulling from a website with first three values being RGB brightnesses and the last being the time duration before moving onto the next row of values (ie to allow multi coloured lights)

Is it possible with your RegEx script to pull that into separate variables?

ie
int redLed = RegEX1
int greenLed = RegEX2
int blueLed = RegEX3
int timeLit = RegEX4

Might be barking up the wrong tree here?

guix

#12
Jan 08, 2014, 07:51 pm Last Edit: Jan 08, 2014, 07:57 pm by guix Reason: 1
alphabeta279, it looks like an easy job for sscanf ;)
Code: [Select]

uint8_t redLed, greenLed, blueLed;
int timeLit;

if (sscanf("255,154,246,124", "%hhu,%hhu,%hhu,%d", &redLed, &greenLed, &blueLed, &timeLit) == 4)
{
   ...
}


I used this library sometimes, but don't have access to an arduino now so I won't say a word about how to do it with regex (I don't know enough)

nickgammon

sscanf could be an easy solution (although that library is fairly large too).

A regexp would be simple enough:

Code: [Select]

(%d+),(%d+),(%d+),(%d+)


That's four captures with one or more digits separated by commas. If it matches just pull out all 4 captures and do atoi on them.

Guix's suggestion is probably the simplest for an easy case like that.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

alphabeta279

Great stuff, have tried code below, however is spitting out error of:
Code: [Select]
sketch_jan08a:14: error: cannot convert 'String' to 'const char*' for argument '1' to 'int sscanf(const char*, const char*, ...)'


Am trying to move the sscanf input string to an external variable as this'll make it simpler to handle when I start using web data feed.

Code: [Select]

#include <Regexp.h>

void setup ()
{
  Serial.begin (115200);
  Serial.println ();


String webLedFeed = "R:234,G:342,B:342,T:23,";

uint8_t redLed, greenLed, blueLed;
int timeLit;

if (sscanf(webLedFeed, "%hhu,%hhu,%hhu,%d", &redLed, &greenLed, &blueLed, &timeLit) == 4)
{
Serial.println("redLED= ");
Serial.println(redLed);
Serial.println("greenLED= ");
Serial.println(greenLed);
Serial.println("blueLED= ");
Serial.println(blueLed);
Serial.println("Time= ");
Serial.println(timeLit);
}


}
 
   
void loop () {}

Go Up