Tokenise a string in program memory

I have a long string in program memory containing subsequences separated by delimiters.
For example

F("AAA:BBB:CCC")

I don't want to pull the entire string into RAM as it's potentially more than would fit.
Instead I want to pull back "AAA", "BBB","CCC" in turn in an iterator so I can process them indvidually.

I considered strtok_P however it seems to search within a RAM string for a substring that's in PROGMEM - this is the wrong way round for me. Additionally, strtok modifies the first arg which also won't work.


I imagine I'm looking for a function that behaves a bit like strtok but copies the tokens into a buffer that I provide.
My delimiter is just a single char at the moment so something like the api below would work.
In order to avoid having to mod the source string I'm proposing to pass in a working buffer.

char buf[100+1];
int tokenise(F("AAA:BBB:CCC"), ':', buf, 100); // pass NULL as the first arg in subsequent calls

  • Fills the given buffer with a token if there was one and returns the number of chars read in the token.
  • If no token was found then return 0
  • If the token is longer than the buffer (eg 100) then fill the buffer with whatever fits and return the length of the complete token (eg more than 100), in this case the rest of the token is ignored, and the next call will start at the next token.

I tried to work out how to read the progmem string byte by byte using pgm_read_byte() but it was beyond me as a beginner.

code tested on laptop with for following case and results

 xy:aaa:bbbbb:ccccccccc:ddd:eeeeeeee:ffff
  xy
  aaa
  bbbbb
  ccccc
  ddd
  eeeee
  ffff

code simulates pgm by only referencing str by point (could be index) and reading one character at a time

const char*
getTok (
    const char  *s,
    char        *buf,
    int          bufSize,
    char         delimiter )
{
    char  c;
    int   n = 0;

    for ( ; c = *s; s++) {
        if (delimiter == c)
            break;
        if ((bufSize - 1) > n)
            buf [n++] = c;
      //printf ("      %s: %c  %d\n", __func__, c, n);
    }
    buf [n] = 0;

    if (delimiter == c)
        s++;
    return s;
}

// ----------------------------
const char *str = "xy:aaa:bbbbb:ccccccccc:ddd:eeeeeeee:ffff";
#define B  6

void
application ()  {
    const char *s = str;
    char  buf [B];

    printf (" %s\n", str);

    do  {
        s = getTok (s, buf, B, ':');
        printf ("  %s\n", buf);
    } while (*s);
}

This seems to do what you want:

void setup()
{
  char buf[7];


  Serial.begin(115200);
  while (!Serial);


  int len = tokenise(F("AAA:BBB::CCC:DDDDDDDDDDDD"), ':', buf, sizeof buf);


  do
  {
    Serial.print(len);
    Serial.print(" characters: \"");
    Serial.print(buf);
    Serial.println("\"");


    len = tokenise((__FlashStringHelper *)NULL, ':', buf, sizeof buf);
  } while (len >= 0);


  Serial.println("End of string.");
}


void loop() {}


int tokenise(const __FlashStringHelper *FSH, const char delimiter , char *buffer, size_t bufLength)
{
  char *s = (char *) FSH;
  static const char *str = NULL;


  if (s != NULL)
  {
    str = s;
  }


  if (str == NULL) // Need a new string
  {
    buffer[0] = '\0';
    return -1;
  }


  size_t buffIndex = 0;


  while (1)
  {
    char c = pgm_read_byte(str++);


    if (c == '\0')
    {
      str = NULL; // Reached end of string
      buffer[buffIndex] = '\0';
      return buffIndex;
    }


    if (c == delimiter)
    {
      buffer[buffIndex] = '\0';
      return buffIndex;
    }


    if (buffIndex < bufLength - 1)
    {
      buffer[buffIndex++] = c;
    }
  }
}

Rather than store information in program memory in a form that needs to be processed before use, consider storing that same information, but already processed.

Thanks folk - will take a look !!

(I think late processing - ie chopping up the string - will be more memory efficient)

gcjr:
code tested on laptop with for following case and results

 xy:aaa:bbbbb:ccccccccc:ddd:eeeeeeee:ffff

xy
  aaa
  bbbbb
  ccccc
  ddd
  eeeee
  ffff




code simulates pgm by only referencing str by point (could be index) and reading one character at a time



const char*
getTok (
    const char  *s,
    char        *buf,
    int          bufSize,
    char        delimiter )
{
    char  c;
    int  n = 0;

for ( ; c = *s; s++) {
        if (delimiter == c)
            break;
        if ((bufSize - 1) > n)
            buf [n++] = c;
      //printf ("      %s: %c  %d\n", func, c, n);
    }
    buf [n] = 0;

if (delimiter == c)
        s++;
    return s;
}

// ----------------------------
const char *str = "xy:aaa:bbbbb:ccccccccc:ddd:eeeeeeee:ffff";
#define B  6

void
application ()  {
    const char *s = str;
    char  buf [B];

printf (" %s\n", str);

do  {
        s = getTok (s, buf, B, ':');
        printf ("  %s\n", buf);
    } while (*s);
}

Thanks - I need it to work on program memory. ie F("some progmem string")

That logic look spot on.
Will help avoid the pointless materialisation of the full stringin RAM.
This function seems generally useful - surprising it's not in th stdlib?
johnwasser You might want to make a PR to add this to the PM lib along with the strtok_P which isn't that useful by comparison. A few versions seem useful; the one below that's tokenise(FSH, char), also tokenise(FSH, char*) and tokenise(FSH, FSH)
Anyway thanks loads

johnwasser:
This seems to do what you want:

void setup()

{
  char buf[7];

Serial.begin(115200);
  while (!Serial);

int len = tokenise(F("AAA:BBB::CCC:DDDDDDDDDDDD"), ':', buf, sizeof buf);

do
  {
    Serial.print(len);
    Serial.print(" characters: "");
    Serial.print(buf);
    Serial.println(""");

len = tokenise((__FlashStringHelper *)NULL, ':', buf, sizeof buf);
  } while (len >= 0);

Serial.println("End of string.");
}

void loop() {}

int tokenise(const __FlashStringHelper *FSH, const char delimiter , char *buffer, size_t bufLength)
{
  char *s = (char *) FSH;
  static const char *str = NULL;

if (s != NULL)
  {
    str = s;
  }

if (str == NULL) // Need a new string
  {
    buffer[0] = '\0';
    return -1;
  }

size_t buffIndex = 0;

while (1)
  {
    char c = pgm_read_byte(str++);

if (c == '\0')
    {
      str = NULL; // Reached end of string
      buffer[buffIndex] = '\0';
      return buffIndex;
    }

if (c == delimiter)
    {
      buffer[buffIndex] = '\0';
      return buffIndex;
    }

if (buffIndex < bufLength - 1)
    {
      buffer[buffIndex++] = c;
    }
  }
}

johnlon:
(I think late processing - ie chopping up the string - will be more memory efficient)

That is going to depend a lot on how you are going to use the string. All the strings themselves will not take any less memory to store, since the token takes the same amount of memory that the terminating null would. You will save the memory needed to store an array of pointers to the strings (two bytes per string), and having to make the double reference to progmem to pull the string out via an array of pointers, but you sacrifice the ability to use the strings directly from program memory without copying to ram.

johnwasser:
This seems to do what you want:

Yes thanks John that worked !

johnlon:
I imagine I'm looking for a function that behaves a bit like strtok but copies the tokens into a buffer that I provide.
My delimiter is just a single char at the moment so something like the api below would work.
In order to avoid having to mod the source string I'm proposing to pass in a working buffer.

I know a solution has already been accepted but, this problem intrigued me. So, after some tinkering I came up with this - just in case somebody wants to use it.

// tokenize string in progmem - testing delimiterPtr for termination

// https://forum.arduino.cc/index.php?topic=663437.0

/* This demo utilizes memccpy_P().

    https://www.nongnu.org/avr-libc/user-manual/group__avr__pgmspace.html

    This function is similar to memccpy() except that src is pointer to a
    string in program space.

    void * memccpy_P ( void *  dest,
                     const void *  src,
                     int   val,
                     size_t  len
                     )

    Description of memccpy(): Copy memory area.

  The memccpy() function copies no more than len bytes from memory area src to
  memory area dest, stopping when the character val is found.

  Returns
  The memccpy() function returns a pointer to the next character in dest after
  val, or NULL if val was not found in the first len characters of src.
*/
//--------------------------------------------
// some test strings
//const char testData[] PROGMEM = "12345:67890:hello world:2121:344255:00000";
const char testData[] PROGMEM = "12:34:56:78:90:11:22:33:44:55:2121:344255:00000";
//const char testData[] PROGMEM = "xy:aaa:bbbbb:ccccccccc:567:eeeeeeee:ffff";

const byte segmentSize = 15;
char scratchPad[segmentSize]; // A buffer to pull the progmem bytes into for printing
PGM_P nextSegmentPtr = testData;// declare two pointers to the progmem string
PGM_P delimiterPtr = testData;

void setup() {
  Serial.begin(115200);
  byte len = 0;
  while (delimiterPtr) {
    delimiterPtr = (char*)memccpy_P(scratchPad, nextSegmentPtr, ':', segmentSize); // Locate character in progmem string.
    len = delimiterPtr - scratchPad;
    scratchPad[len-1] = '\0'; // Add terminator to extracted characters.  Note: len points to character following delimiter
    Serial.println(scratchPad);
    nextSegmentPtr += len; // Reposition to next segment
  }
  Serial.println("\n*** done ***");
}

void loop() {

}

No claims are made to elegance, or even proper programming but it does seem to work.

Hi dougp thanks.
Does the proposal skip over remaining chars when buf is too small to take entire token?
I think it probably returns a partial token but doesnt skip the remaining bit?

So, I guess I didn't test hard enough. Terminates on a segment too big for buffer. I'll tinker some more. No guarantees. :wink:

Further related discussion here - strpbrk() works, strpbrk_P does not - Why? [SOLVED] - Programming Questions - Arduino Forum