I am pleased to announce the release of a Regular Expression library for the Arduino. Preliminary inquiries appeared to indicate that there was no such thing currently readily available.
What are regular expressions?
If you haven't used them before, they are incredibly powerful ways of parsing text in strings. For example, you can look for a word starting with an upper-case letter, a hex string, a number with an optional leading minus sign, and so on.
They provide a somewhat easier and more powerful way of breaking up strings than using library functions like strtok, strcmp, strstr etc.
Download
The library can be downloaded from:
http://gammon.com.au/Arduino/Regexp.zip (41Kb).
Design
My design criteria for this library were:
- Be powerful enough to be useful (ie. not a "toy")
- Be fast enough that it could be used to do things like process input from GPS devices, RFID readers, web pages etc.
- Be compact enough that it doesn't use most of the available memory on a microprocessor
I believe these have been met, as follows:
Power
The library processes regular expressions which are identical in syntax to those used by the Lua string.match, string.find and similar functions. This is because the code is adapted from the Lua code written by Roberto Ierusalimschy. It has been adapated enough to make it work outside the Lua structure basically.
Lua regular expressions are well-known and well-documented. Their power is such that (for example) very extensive add-ons for the World of Warcraft game are written in Lua, and use the regular expression matching to break up incoming data from the server.
My own documentation for Lua regular expressions is here:
You can not only match strings like "%d+" (for one or more digits) but you can specify "captures" which means each captured substring has its position returned, so you can easily extract it out from the original string.
Speed
The Lua regular expression matcher has been well-regarded for its speed, and this library performs well too. For example:
String to parse: "Testing: answer=42"
Regular expression: "(%a+)=(%d+)"
Time taken to match: around 2 milliseconds.
This test returned the matching text ("answer=42"), its length, plus the two captures ("answer" and "42").
match start was 9
match length was 9
Match text: 'answer=42'
Captures: 2
Capture number: 1
Text: 'answer'
Capture number: 2
Text: '42'
This shows how easily you can use regular expressions to parse incoming text (eg. GPS data in the form keyword=value).
Size
The library takes about 2392 bytes. For example a minimal test would be:
#include <Regexp.h>
void setup ()
{
MatchState ms;
ms.Target ("cat"); // what to search
char result = ms.Match ("a", 0); // look for "a"
} // end of setup
void loop () {}
This compiles to be 2842 bytes. However an "empty" sketch is 450 bytes, so the regular expression library has added 2392 bytes (2.33 Kb).
I believe this is an acceptable length. This is around 7% of the memory on a 32 Kb device. You can reduce the memory slightly by reducing the number of captures it supports (currently 32). Alternatively, if you need to do dozens of captures you can do that by changing one define, at the cost of 4 bytes per capture.
Error handling
The library "throws" exceptions by doing a non-local goto (longjmp), in exactly the same way Lua does. This keeps the code compact and simple. If there is a parsing problem then the library returns a negative number as the result of the regexp call. You can interpret those to tidy up your regular expressions to make them work properly.
Usage
You need to include the library:
#include <Regexp.h>
Since, unlike Lua, functions cannot return multiple results (eg. all the captures) the MatchState structure is used to communicate with the library. You start by setting up the string to be searched, and its length:
MatchState ms;
ms.Target ("Testing: answer=42");
You can supply either a zero-terminated string (like the above) or a char buffer and a length.
Then you call the Match method of the MatchState structure, supplying the regular expression string itself, and an zero-relative offset to commence searching from. The function returns 1 on a match, 0 on no match, and a negative number for a parsing error.
char result = ms.Match ("(%a+)=(%d+)", 0);
if (result == REGEXP_MATCHED)
{
// matching offsets in ms.capture
}
else if (result == REGEXP_NOMATCH)
{
// no match
}
else
{
// some sort of error
}
The meanings of the various error codes are defined in Regexp.h.
If and only if you get a REGEXP_MATCHED result (that is, 1) then the captures array in the MatchState structure is set up to indicate what the address and length of each capture substring is. You can use that information to index into your supplied string and extract out the substrings, if required.
For example:
char buf [100]; // large enough to hold expected string
Serial.print ("Captures: ");
Serial.println (ms.level);
for (int j = 0; j < ms.level; j++)
{
Serial.print ("Capture number: ");
Serial.println (j, DEC);
Serial.print ("Text: '");
Serial.print (ms.GetCapture (buf, j));
Serial.println ("'");
} // end of for each capture
Also the matching text itself (the whole match) is stored as an offset and length. You could index into your original string to extract out the matching text, if required. It may not be required. You may only need to know if a match occurred, or not.
char buf [100]; // large enough to hold expected string
Serial.print ("Matched on: ");
Serial.println (ms.GetMatch (buf));
The library does not attempt to "pre-extract" those strings for you on the grounds that to do so would take extra time and memory, which you, the user of the library, may not care to expend.
MatchState methods
void MatchState::Target (const char * s);
void MatchState::Target (const char * s, const unsigned int len);
These let you supply the string to be searched (the target string). It can be null-terminated, in which case the library finds the end by doing a strlen on it, or you supply the length. If you have built up a buffer from incoming text you may prefer to just supply the length.
char MatchState::Match (const char * pattern, unsigned int index = 0);
This performs the match based on the supplied null-terminated pattern, and starting at the supplied index into the target string. By modifying the index parameter you can re-match further and further through the same target string, perhaps to keep finding the same sort of string (eg. a word).
The result of the match will be > 0 if successful, 0 if no match, and < 0 if an error occurred (invalid regular expression).
char * MatchState::GetMatch (char * s);
After a successful match, this copies the matching string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on MatchLength which is calculated during the match. If no successful match was previously done, then an empty string is copied.
The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.
char * MatchState::GetCapture (char * s, const int n);
After a successful match, this copies the specified capture string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on capture [n].len which is calculated during the match. If no successful match was previously done, or this capture does not exist, then an empty string is copied.
The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.
(edit) Version 1.1 uploaded 1st May 2011. This provides the extra "helper" functions documented just above.