New text / CSV / TSV / etc. parser library

I would like to introduce a lightweight and versatile text / CSV / TSV / etc. parser library based on the ideas introduced in this thread.

Introduction
The central idea is to use the types of the output variables themselves to determine the parsing behaviour. E.g., if we associate an int with a field, an integer parsing procedure is performed.

This tight coupling between fields and output variables eliminates the need to have a preconfigured, static line format. It also allows for compact code and it reduces the burden of changing code in different places when a variable type change is desired.

All basic data types and arrays of basic types are supported. This library is type safe and does not use any dynamic memory.

Examples
If all fields are of the same type, we can extract the data using an array.

TextParser parser(", ");
int a[5];
parser.parseLine("1, 2, 3, 4, 5", a);
// `a` now contains values 1, 2, 3, 4 and 5.

If we want to change the type from int to float, we simply change the type of the output variable.

float a[5];
parser.parseLine("1, 2, 3, 4, 5", a);
// `a` now contains values 1.0f, 2.0f, 3.0f, 4.0f and 5.0f.

This also works for multidimensional arrays.

int a[2][3];
parser.parseLine("1, 2, 3, 4, 5, 6", a);
// `a[0]` now contains 1, 2 and 3, `a[1]` contains 4, 5 and 6.

If the fields have different types, we can use multiple variables.

char a[10];
int b;
double c;
parser.parseLine("one, 2, 3.4", a, b, c);
// `a` now contains "one", `b` contains 2 and `c` contains 3.4.

An end of line string can be provided to strip newlines or other symbols we do not care about.

TextParser parser(" ", ".");
char words[5][6];
parser.parseLine("This is a nice line.", words);
// `words` now contains "This", "is", "a", "nice" and "line".

Further reading
The source is available on GitHub, the code is released under the MIT Open Source license. More information can be found in the online documentation. There is also an online demo available for testing purposes.

I hope this library is of use to you. Any feedback is welcome of course.

3 Likes

damn, you've been busy :wink:. Nice work!

Just adapted the Serial input basics example (Serial Input Basics - updated - #3 by Robin2) to include the library, and it seems to work great + way cleaner looking and versatile!

void parseData() {      // split the data into its parts

  TextParser parser(",");

  parser.parseLine(tempChars, messageFromPC, integerFromPC, floatFromPC);

}

simulation link:

However, in the example above. If one of the words is longer than 5 characters and/or if there's 6 words in total, can that cause the char array to be accessed "out of bounds", or are the extra characters/words just ignored?

Thanks for the compliment, but it really was not that much work (it is only 84 lines of code in total according to cloc).

The caller is responsible for the memory allocation of the output variables, so if a word is longer than 5 characters, a buffer overflow will occur in the current implementation. It should be easy to guard against this by making the size of the output string leading instead of the length of the field. I will put this on the to do list for the next release. Thank you for pointing this out.

If there are more words than output variables, the remaining data is ignored, since the structure of the output variables determines how the input is parsed (except for the flaw you pointed out).

Whoops, I got that back to front. I meant "more than 6 characters and/or more than 5 words (arrays)", but you got the point.

Yeah seems not too hard to implement. Just performing a sizeof() for each array and then using a for loop perhaps.

This library should be super handy for some debugging purposes I have.

I added a guard against string buffer overflows. The new version should be available in the package manager by now.

Example:

char words[5][4];
parser.parseLine("This is a nice line.", words);
// `words` now contains "Thi", "is", "a", "nic" and "lin".

[edit]

Something went wrong with the previous release (1.0.1). This is fixed in the current release (1.0.2), which should be available in the package manager now.

1 Like

Update

To make this library a bit more user friendly, I added support for:

  • Text based boolean representations.
  • Categorical data.
  • Integers in bases other than 10.

These additions are available in version 1.1.0, which should be available in the package manager by now.


Booleans

By default, integers are used to represent boolean values, e.g.,

bool a;
parser.parseLine("0", a);  // `a` contains `false`.
parser.parseLine("1", a);  // `a` contains `true`.

In many cases however, a text representation is used. In this case we can define the truth value as a global string,

char const truth[] = "Yes";

and use this to create a variable of type Bool.

Bool<truth> a;
parser.parseLine("Yes", a);  // `a.value` contains `true`.
parser.parseLine("No", a);   // `a.value` contains `false`.

Categorical data

For categorical data, we need to define a global zero terminated list of labels, like we did for the boolean truth value.

char const* labels[] = {"red", "green", "blue", nullptr};

These labels can then be used to create a variable of type Category.

Category<int, labels> a;
parser.parseLine("red", a);     // `a.value` contains 0.
parser.parseLine("blue", a);    // `a.value` contains 2.
parser.parseLine("yellow", a);  // `a.value` contains -1.

Integers in other bases

Integers in arbitrary bases are supported via the Number type.

Number<int, 16> a;  // Hexadecimal number.
Number<int, 2> b;   // Binary number.
parser.parseLine("0x1f, 101001", a, b);
// Now `a.value` contains 31, `b.value` contains 41.
1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.