array of UTF-8 strings in FLASH (PROGMEM / strcpy_P / ...)

I want to have an array of UTF-8 strings in Flash-Memory.
(and some kind of “indexed access” to them)
[working with win7 / IDE 1.6.6 / leonardo]

for testing this out, I use an external Terminal-Program, that is fully UTF-8 capable.
(not the built in Arduino terminal)
(you can use e.g. Putty or the latest CoolTerm beta 1.4.6.b5)

[in “real” UTF-8, every character can have a variable byte-length of 1 … 4 byte]

first I made a test with this sketch:

// --> http://playground.arduino.cc/Code/UTF-8
// Test for normal strings with UTF-8
// Public Domain

char three[] = "3µV";
const char four[] PROGMEM = "4µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱";
String five = "5µF";
char six[] = "60€";
String seven = "70€₡₢₣₤₥₦";

char buffer[80];

void setup() 
{
  Serial.begin( 9600);
#if defined (__AVR_ATmega32U4__)
  while(!Serial);        // For Leonardo, wait for serial port
#endif

  Serial.println("\n+++++++++++++++++++++++++++++++++++++++++");
  Serial.println("Use a serial terminal that supports UTF-8");

  Serial.println(F("1µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱"));        // Good, text in flash

  // copy a string from flash memory to a buffer.  
  sprintf_P( buffer, PSTR("2µH")); // Good, text in flash
  Serial.println( buffer);

  // copy a string in ram to a buffer
  strcpy( buffer, three);          // Good
  Serial.println( buffer);

  // add one to strlen for the zero terminator
  strncpy( buffer, three, strlen(three) + 1); // Good, strlen works with UTF-8 string
  Serial.println( buffer);

  strcpy_P( buffer, four);         // Good, text in flash with PROGMEM
  Serial.println( buffer);
  
  // copy a string in flash to buffer byte for byte
  for( int i = 0 ; i < sizeof( four) ; i++)  // Good, sizeof works with UTF-8 string
  {
    buffer[i] = pgm_read_byte( four + i);
  }
  Serial.println( buffer);

  Serial.println( five);           // Good, a String class with UTF-8 character
  
  Serial.print( "array of char: \"");
  Serial.print( six);
  Serial.print( "\", strlen=");
  Serial.println( strlen(six));
  
  Serial.print( "String object: \"");
  Serial.print( seven);
  Serial.print( "\", String.length()=");
  Serial.println( seven.length());

  Serial.println( "+++++++++++++++++++++++++++++++++++++++++");

  Serial.println( "Enter a UTF-8 character and press <enter>");
  Serial.println( "The hexadecimal value will be displayed.");
}

void loop()
{
  if( Serial.available())
  {
    Serial.print( "You have entered: ");
    delay(100);      // allow the rest of the line to be received.
    while( Serial.available())
    {
      byte c = Serial.read();
      if( c != '\r' && c != '\n')  // ignore trailing CR and LF
      {
        if( c <= 0x0F)
          Serial.print( "0");
        Serial.print( c, HEX);
        Serial.print( ", ");
      }
    }
    Serial.println();
  }
}

this works,
everything is put out correctly in UTF-8 via the external UTF8-Terminal-Program.
As you can see there is used different techniques to store strings into flash and to print them out again. (PROGMEM / “F”-modifier / strcpy_P)

then I try to use the example from here to use an array of UTF-8 strings in FLASH
https://www.arduino.cc/en/Reference/PROGMEM
(on the lower part of this page)
because here is shown the “indexed access” to the strings, I would like to have…
(I modified the original example code with some UTF-8 strings)

/*
 PROGMEM string demo
 How to store a table of strings in program memory (flash),
 and retrieve them.

 Information summarized from:
 http://www.nongnu.org/avr-libc/user-manual/pgmspace.html

 Setting up a table (array) of strings in program memory is slightly complicated, but
 here is a good template to follow.

 Setting up the strings is a two-step process. First define the strings.
*/

// #include <avr/pgmspace.h>
const char string_0[] PROGMEM = "ÄÖÜß²³°~|€@";   // "String 0" etc are strings to store - change to suit.
const char string_1[] PROGMEM = "String 1";
const char string_2[] PROGMEM = "4µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱";
const char string_3[] PROGMEM = "String 3";
const char string_4[] PROGMEM = "70€₡₢₣₤₥₦";
const char string_5[] PROGMEM = "String 5";
const char string_6[] PROGMEM = "¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿";
const char string_7[] PROGMEM = "Ä ä Ö ö Ü ü ß ç ñ ò ó ô õ ÷ ø ù ú û ý þ ÿ";
const char string_8[] PROGMEM = "Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę";
const char string_9[] PROGMEM = "т у ф х ц ч ш щ ъ ы ь э ю я ѐ ё ђ ѓ є ѕ і ї ј љ њ ћ ќ ѝ ў џ Ѡ ѡ"; 

// Then set up a table to refer to your strings.

const char* const string_table[] PROGMEM = {
  string_0, string_1, string_2, string_3, string_4, string_5,
  string_6, string_7, string_8, string_9
  };

char buffer[130];    // make sure this is large enough for the largest string it must hold

void setup()
{
  Serial.begin(9600);
  while(!Serial);
  Serial.println("OK");
}


void loop()
{
  /* Using the string table in program memory requires the use of special functions to retrieve the data.
     The strcpy_P function copies a string from program space to a string in RAM ("buffer").
     Make sure your receiving string in RAM  is large enough to hold whatever
     you are retrieving from program space. */


  for (int i = 0; i < 10; i++)
  {
    strcpy_P(buffer, (char*)pgm_read_word(&(string_table[i]))); // Necessary casts and dereferencing, just copy.
    Serial.println(buffer);
    delay( 500 );
  }
}

as you can see there are 2 of the same test strings that I used within the first sketch above,
but with this second sketch the strings are not given out correctly via the external UTF8-Terminal-Program now.

why does that happen ?

in the first example sketch strcpy_P is used directly with a variable name:

strcpy_P( buffer, four);         
Serial.println( buffer);

in the second example sketch strcpy_P is used with a only pointer instead:

for (int i = 0; i < 10; i++)
  {
    strcpy_P(buffer, (char*)pgm_read_word(&(string_table[i]))); // Necessary casts and dereferencing, just copy.
    Serial.println(buffer);
    delay( 500 );
  }

is this the reason, why the second example doesn’t work in case of variable-byte-length-(UTF-8)-strings ?

How can I overcome this ?
How can I store an array of UTF-8 strings in FLASH,
and give them out correctly over an UTF8-Terminal-Program ?

I didn’t see an error in the sketch, so I tried it with an Arduino Leonardo.
The Arduino IDE 1.6.6 serial monitor in linux shows the UTF-8 characters ! That is a nice improvement. Only the character ‘ĕ’ is still missed but the others are okay.
In linux the program gtkterm shows every character correct.

You described the problem well, but you forgot to mention what is not shown correctly, for example with a screendump.

thanks a lot for testing,
this shows, that there’s no error in the sketch or in the processing of multi-byte chars on the arduino side.

I will try another Terminal program and test again

[edit] with a fresh install of Putty everything works well on my side
the CoolTerm still does not work with some 4byte UTF-8 chars and scrambled up the other (normal ASCII-) lines then as well.

But if I try this “extended example”,
nothing is displayed right even Putty failed (see attachment)

What’s wrong with this ?

“extended example”:

/*
 PROGMEM string demo
 How to store a table of strings in program memory (flash),
 and retrieve them.

 Information summarized from:
 http://www.nongnu.org/avr-libc/user-manual/pgmspace.html

 Setting up a table (array) of strings in program memory is slightly complicated, but
 here is a good template to follow.

 Setting up the strings is a two-step process. First define the strings.
*/

// #include <avr/pgmspace.h>
const char string_0[] PROGMEM = " ";
const char string_1[] PROGMEM = "examples of normal 1-byte-chars:";
const char string_2[] PROGMEM = "--------------------------------- ";
const char string_3[] PROGMEM = "! \" # $ % & ' ( ) * + , - . / ";
const char string_4[] PROGMEM = "[ \\ ] ^ _ ` { | } ~ ";
const char string_5[] PROGMEM = " ";
const char string_6[] PROGMEM = "examples of (windows-compatible) 2-byte-chars:";
const char string_7[] PROGMEM = "---------------------------------";
const char string_8[] PROGMEM = "¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿";
const char string_9[] PROGMEM = "Ä ä Ö ö Ü ü ß ç ñ ò ó ô õ ÷ ø ù ú û ý þ ÿ";
const char string_10[] PROGMEM = "Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę";
const char string_11[] PROGMEM = "т у ф х ц ч ш щ ъ ы ь э ю я ѐ ё ђ ѓ є ѕ і ї ј љ њ ћ ќ ѝ ў џ Ѡ ѡ ";
const char string_12[] PROGMEM = " ";
const char string_13[] PROGMEM = "examples of (windows-compatible) 3-byte-chars:";
const char string_14[] PROGMEM = "---------------------------------";
const char string_15[] PROGMEM = "ᄀ ᄁ ᄂ ᄃ ᄄ ᄅ ᄆ ᄇ ᄈ ᄉ ᄊ ᄋ ᄌ ᄍ ᄎ ᄏ ᄐ ᄑ ᄒ ᄓ ᄔ ᄕ ᄖ ";
const char string_16[] PROGMEM = "Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Ꮃ Ꮄ Ꮅ Ꮆ Ꮇ Ꮈ";
const char string_17[] PROGMEM = "€ ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₸ ₹ ₺ ¢ £ ¬  ̄ ¦ ¥ ₩ │ ← ↑ → ↓ ■ ○ ";




// Then set up a table to refer to your strings.

const char* const string_table[] PROGMEM = {
  string_0, string_1, string_2, string_3, string_4,string_5, string_6, string_7, string_8, string_9, string_10,
  string_11, string_12, string_13, string_14, string_15, string_16, string_17
  };

char buffer[130];    // make sure this is large enough for the largest string it must hold

void setup()
{
  Serial.begin(9600);
  while(!Serial);
  Serial.println("OK");
}


void loop()
{
  /* Using the string table in program memory requires the use of special functions to retrieve the data.
     The strcpy_P function copies a string from program space to a string in RAM ("buffer").
     Make sure your receiving string in RAM  is large enough to hold whatever
     you are retrieving from program space. */


  for (int i = 0; i < 18; i++)
  {
    strcpy_P(buffer, (char*)pgm_read_word(&(string_table[i]))); // Necessary casts and dereferencing, just copy.
    Serial.println(buffer);
    delay( 500 );
  }
}

if I do it with normal println commands
putty shows OK (see attachment)

void setup() 
{
  Serial.begin( 9600);
#if defined (__AVR_ATmega32U4__)
  while(!Serial);        // For Leonardo, wait for serial port
#endif
}

void loop()
{
  Serial.println( "examples of normal 1-byte-chars:");
  Serial.println( "---------------------------------");
  Serial.println( "! \" # $ % & ' ( ) * + , - . / ");
  Serial.println( "[ \\ ] ^ _ ` { | } ~"); 
  Serial.println();  
  Serial.println( "examples of (windows-compatible) 2-byte-chars:");
  Serial.println( "---------------------------------");
  Serial.println( "¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿");
  Serial.println( "Ä ä Ö ö Ü ü ß ç ñ ò ó ô õ ÷ ø ù ú û ý þ ÿ");
  Serial.println( "Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę");
  Serial.println( "т у ф х ц ч ш щ ъ ы ь э ю я ѐ ё ђ ѓ є ѕ і ї ј љ њ ћ ќ ѝ ў џ Ѡ ѡ"); 
  Serial.println();  
  Serial.println( "examples of (windows-compatible) 3-byte-chars:");
  Serial.println( "---------------------------------");
  Serial.println( "ᄀ ᄁ ᄂ ᄃ ᄄ ᄅ ᄆ ᄇ ᄈ ᄉ ᄊ ᄋ ᄌ ᄍ ᄎ ᄏ ᄐ ᄑ ᄒ ᄓ ᄔ ᄕ ᄖ"); 
  Serial.println( "Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Ꮃ Ꮄ Ꮅ Ꮆ Ꮇ Ꮈ");
  Serial.println( "€ ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₸ ₹ ₺ ¢ £ ¬  ̄ ¦ ¥ ₩ │ ← ↑ → ↓ ■ ○"); 
  Serial.println();
  Serial.println( "examples of (windows-compatible) 4-byte-chars:");
  Serial.println( "---------------------------------");  
  Serial.println();  
  Serial.println();  
  delay(1000);
}

putty_with_normal_println.png

strcpy_P(buffer, (char*)pgm_read_word(&(string_table[i])));

strcpy_P already expects its argument to be in PROGMEM. Try this way and what happens:

strcpy_P(buffer, string_table[i]);

Delta_G: strcpy_P(buffer, (char*)pgm_read_word(&(string_table[i])));

strcpy_P already expects its argument to be in PROGMEM. Try this way and what happens:

strcpy_P(buffer, string_table[i]);

thanks I tried it, but it didn't work this can not work since we do not want the content of string_table[ i ] into our text-buffer. as you can see string_table[ i ] is containing a list of Adresses/Pointer ....

Delta_G, I think you are confused with a list of pointers to strings. The pointer is retrieved from the list (the list is in Flash and the pointer points to Flash memory), after that the string itself is copied from Flash. See the example : https://www.arduino.cc/en/Reference/PROGMEM

Dirk67, In linux everything is fine ! It's okay, no problem at all, all characters good. The sketch prints 26 strings and the list has only 18 strings, but beside that, the characters are okay.

I'm going to restart into Windows10.

Koepel:
Dirk67, In linux everything is fine ! It’s okay, no problem at all, all characters good.
The sketch prints 26 strings and the list has only 18 strings, but beside that, the characters are okay.

thank’s again, that is really good to know,
means the code is fine.

it is weird:
after uploading to my leonardo 4 or 5 times suddenly it works on the putty console without any errors…

then I uploaded again one more time to the leonardo (the exactly same code) and there was the wrong output again on the putty console.

I think it is something with my PC or with putty or with the USB-connection or whatever … no idea :-/

[edit]:
after a random number of uploading to my leonardo (again and again) with the IDE
I get the following good result in putty → see attachment
after uploading one more time you get again a totally wrong output on the putty console …

can it be something with the bootloader ?

putty_after_new_upload_OK.png

OK now I found out something:

if I do 10 times uploading with the IDE 1.6.6* (AVR 1.6.9 )
I got only 2 times a good result (correct UTF8 output on putty)

if I do 10 times an upload with the IDE 1.6.5* (AVR 1.6.8 )
I get every time a good result (correct UTF8 output on putty)

what the hell … ???

*(both are portable versions installed on the same PC (Win7 64 bit) at the same time)

Windows 10 64-bit. Serial terminals testing UTF-8:

Arduino 1.6.6 serial monitor : bad Q Serial Terminal : bad RealTerm : crashed, not tested. PuTTY 0.64 : good PuTTY beta 0.66 : good YAT : would not work with Leonardo, crashed. Hype! Terminal : bad Tera Term 4.88 : bad Hilgraeve HyperTerminal Private Edition 7.0 Free Trial : bad

My PuTTY is always correct. Sometimes it has a rough start. When the Arduino is busy sending UTF-8 character and I start PuTTY, then sometimes PuTTY gets started half in the middle of a UTF-8 character. This synchronizes fast after a few normal charaters are received.

I don't know what could be the difference between you and me. You could join me and use Ubuntu linux. UTF-8 has been adapted by linux a while ago, and Windows is not there yet.

Could you try to upgrade the drivers for Windows ? They are in the 'driver' folder. I think dpinst-amd64.exe installs everything.

Started PuTTY many times; disconnected and connected the Leonardo; pressed the reset on the Leonardo; and so on. Whatevery I try, the result is 100% okay for UTF-8.

Started PuTTY many times; disconnected and connected the Leonardo; pressed the reset on the Leonardo; and so on. Whatevery I try, the result is 100% okay for UTF-8.

as I wrote before: the process of uploading to the leonardo makes the difference, starting putty, disconnecting, reconnecting, resetting the leonardo, ... all of that actions have no influence ...

if I have a "good upload" (by chance) with 1.6.6 (see my post above) it delivers a good (correct) UTF-8 result on putty, no matter what I do in between If I do just one more upload with 1.6.6 everything is bad again...

with 1.6.5 every single upload "is good" and delivers a good (correct) UTF-8 result on putty..

Setup: Arduino IDE 1.6.6, Windows 10 64-bit, Arduino Leonardo with bootloader from Arduino 1.6.5, PuTTY 0.64. My port is COM3 (and temporary COM9 during uploading I think).

Sequence: Upload and wait until upload has finished. Not closing the Arduino IDE. Starting PuTTY. Check characters. Fully closing PuTTY. Uploading again. and so on.

Result: 100% good UTF-8 characters.

Have you read my Reply #9 ? Did you re-install all the drivers ? Can you confirm that my sequence is the same as you are doing ?

Koepel: Have you read my Reply #9 ? Did you re-install all the drivers ?

I reinstalled the drivers as you suggested with "dpinst-amd64.exe" from the 1.6.6 installation and everything works fine now.

Thanks again for your help !

But after that my AVRISPmkII (used for other arduino projects) is not recognized anymore (by the arduino IDE / by avrdude), but this is another problem ...

You have bad luck. I hope you fix the AVRISPmkII as well.

BTW, you don’t need that intermediate buffer to print the strings:

Serial.println(reinterpret_cast<const __FlashStringHelper *> pgm_read_word(&string_table[i]));

That's new for me oqibidipo. Cool. So the pointer from the list is cast to something like F(), and inside the Serial.println() the reading from Flash memory is done.

@oqibidipo
if it does really not consume any RAM at runtime,
it’s cool, I will test that.

@Koepel
yesterday I added just one line to my extended example from here Post #3,
instead of

void setup()
{
  Serial.begin(9600);
  while(!Serial);
  Serial.println("OK");
}

I changed to

void setup()
{
  Serial.begin(9600);
  while(!Serial);
  Serial.println("OK");
  Serial.println("Größe");
}

and then everything (all the UTF-8 test strings) showed up correctly on Putty
exept this newly added additional line with this 2 “German Umlauts” (“ö” and “ß”) in it,
weird isn’t it ?

Tested also that, and 100% okay.

Why didn't you update to Windows 10 ?

You mentioned the bootloader before. Maybe it helps if you burn the new bootloader, since I'm using the bootloader of the Leonardo from Arduino IDE 1.6.5. Perhaps there is something that has a little trouble with the changing of the COM ports and perhaps the bootloader has some influence on it (it seems almost impossible to me, but you never know).

Perhaps there is some old driver somewhere. They can be found with RAPR ( driverstoreexplorer.codeplex.com ).

Perhaps it is the hardware. I'm using a USB 2.0 port on my computer, no hub.

Does changing the baudrate to 115200 make a difference ?

@Koepel: I found out another thing:

I use 1.6.6 in a portable version on an USB-Stick.

I disabled the option "save when verifying or uploading"

If you are working on a file (ino, h, cpp) with UTF-8 chars in string-declarations within this file (see my examples above), and you change something (its enough to add a space char in a string), and you DO NOT save this file before compiling and uploading, the serial output of this UTF-8 chars (via UTF-8 capable Terminal) will be with errors.

If you save this file before compiling and uploading, (either manually Ctrl+S or by enabling the option "save when verifying or uploading") the serial output of this UTF-8 chars (via UTF-8 capable Terminal) will work correctly.

--

so it seems that when the IDE is compiling "from cache" there's no proper conversion or handling of UTF-8 chars on Windows OS. The process of saving the sketch before (to a file on HDD / USB-Stick) seems to lead to a proper conversion or handling any UTF-8 chars then ...