Weird behavior with PROGMEM and escaped chars

Hi all,

I have stored all my text descriptions by using PROGMEM and my program is working like a charm.

Later, I thought to save flash space by representing the most commons words as tokens mixed into the text strings.
Here the idea: when the program encounter a token expand it to the corresponding into a word picked from a “vocabulary” (i.e. another array of strings).
In this way I can save a lot of flash space! (I’m very short on space :frowning: )

I thought to use ascii codes greater than 127 as tokens. Those values are never used when the string contains pure text chars only.

Unfortunately, above approach seems to not work. Sometimes the tokens were expanded, sometimes not. Why?
After digging for many hours, I discovered that the same byte (one char) is translated in the right numeric value depending on the char following it. Strange, isn’t it?

The following example sketch just print out the numeric value of a 4-char string followed by the string terminator:

#define   ROWS   5
#define   COLS   5

static const unsigned char faultDescriptions[5][18] PROGMEM =
{
  "abcd",
  "\x87\x86\x85\x84", 
  "\x87\x86aa",
  "\x87\x86bb",
  "\x87\x86ii"
};

void setup() {
  // put your setup code here, to run once:
  Serial.begin(9600);

  char descr[COLS];
  unsigned char value;

  for(int row=0;row<ROWS;row++)
  {
    Serial.print("row ");
    Serial.print(row);
    Serial.print(" = ");
    
    memcpy_P (&descr, &faultDescriptions [row], COLS);
    
    // print row (skip last char: NULL terminator))
    
    for(int col=0;col<COLS-1;col++) // 
    {
      value = (unsigned char) descr[col];
      Serial.print(value);
      Serial.print(',');
    }
    Serial.println();
  }
}

void loop() { }

The produced output is:

row 0 = 97,98,99,100,        (Great! This are asci values for the "abcd" string. Computers are better than humans!)
row 1 = 135,134,133,132,     (Marvelous! this are exactly the hex-values I typed in the PROGMEM!)
row 2 = 135,170,0,0,         (Wait... why the second char has not 134 as value?)
row 3 = 135,187,0,0,         (Hey! why the second char has changed again!?)
row 4 = 135,134,105,105,     (Now the second char is right, are you kidding me stupid computer???)

I’m going crazy. Maybe the solution is in front of my eyes and I cannot see it. But it is hard to understand why the value of \x86 depends on what character I type after it in the PROGMEM section.

Maybe, for some bscure reason, the memcpy_P copies words not bytes, so the final result is influenced by that?
Or maybe, for some obscure reason, bytes greater than 127 are considered signed values somewhere?
Or maybe… what?

Any idea?

Because e.g. \x86aa is a valid integer and \x86ii is not ? So the compiler treats it like that and next truncates to put it in an unsigned char.

These are the warnings that I get:

C:\Users\sterretje\AppData\Local\Temp\arduino_modified_sketch_52481\sketch_may06a.ino:9:3: warning: hex escape sequence out of range

   "\x87\x86aa",

   ^

C:\Users\sterretje\AppData\Local\Temp\arduino_modified_sketch_52481\sketch_may06a.ino:10:3: warning: hex escape sequence out of range

   "\x87\x86bb",

   ^

Hi sterretje, thanks for your reply.

Yeah, I noticed that the hex value changes when I type chars in the A/a-F/f range... but I thought it was just my imagination.
I supposed that strings (array of chars) are interpreted byte-by-byte in any case. I was wrong.

Any workaround? What I could type in the PROGMEM section when a|b|c|d|e|f char is following my hex-code?

I'm not 100% sure this works, but try with:

"\x87\x86""aa"

In order to break the escape sequence.

You can put a space between '\x86' and 'aa'; don't know if it suites you. What is 'aa' supposed to be; part of the 'replacement string' or text? I suspect the latter.

By the way, if you're worried about flash memory usage, why do you seem to piss away flash memory by using fixed width strings and only use part of this fixed width (in your example)? It's probably a fine balance, but usually one stores single (not: an array of) strings in PROGMEM and next stores an array of pointers to those strings; this array can be in RAM or PROGMEM. If you have lots of short strings and only a few long ones, it pays of.

Danois90:
I'm not 100% sure this works, but try with:

"\x87\x86""aa"

In order to break the escape sequence.

It works!!!! :smiley:

row 0 = 97,98,99,100,0,
row 1 = 135,134,133,132,0,
row 2 = 135,134,97,97,0,
row 3 = 135,134,98,98,0,
row 4 = 135,134,105,105,0,

A big huge thanks!

sterretje:
You can put a space between '\x86' and 'aa'; don't know if it suites you. What is 'aa' supposed to be; part of the 'replacement string' or text? I suspect the latter.

Yeah, "aa" is an example of text. It can be any text like "does not", "is not", "airplane", "star trek", and so on.
If I put an extra space the I'm going to waste 1 byte per every special char, and many rows have two or even three special chars (I have 60 rows, so I'm going to loose around 100 bytes).

sterretje:
By the way, if you're worried about flash memory usage, why do you seem to piss away flash memory by using fixed width strings and only use part of this fixed width (in your example)? It's probably a fine balance, but usually one stores single (not: an array of) strings in PROGMEM and next stores an array of pointers to those strings; this array can be in RAM or PROGMEM. If you have lots of short strings and only a few long ones, it pays of.

Yeah, I thought about that. That is the next evolution step of the program. It will save another 190 bytes in my case.
Also, I will eliminate the NULL char terminating the string 'cause I'm going to read chars byte by byte.

Just one drawback: storing an array of pointers forces you to store the length of the text somewhere/somehow. So you'r going to use 2 bytes per text-descriptions. In my case (60 rows) you need 120 bytes extra.
A better approach is to use a special char as separator and loop over the mono-dimensional array until you reach the right text message.
(But I still have to check if this will not slow down too muchthe printing ops on a TFT display...)

Anyway thanks for the help guys! You're awesome!

gimpo:
Just one drawback: storing an array of pointers forces you to store the length of the text somewhere/somehow. So you'r going to use 2 bytes per text-descriptions. In my case (60 rows) you need 120 bytes extra.

For the pointer, yes. For the length, no.

You can store length in a byte (unless the message is longer than 255).
2)
A c-string is nul terminated so you can use strlen_P. Or read byte by byte till you see the '\0' as you are considering; that's one byte per string extra and equal to a byte for the length.

gimpo:
A better approach is to use a special char as separator and loop over the mono-dimensional array until you reach the right text message.
(But I still have to check if this will not slow down too muchthe printing ops on a TFT display...)

If you need the last string, it will indeed be slower than when trying to find the first string. And it will still cost you 60 bytes extra.