Serial.print with UTF-8 characters

Hi,
I have updated my old "ShowInfo" sketch : Arduino Playground - ShowInfo
and added a test for UTF-8 characters.

This part:

  Serial.println(F("UTF-8 test:"));
  Serial.println(F("    Micro µ µ µ µ µ µ µ µ µ µ"));
  Serial.println(F("    Euro  € € € € € € € € € €"));
  Serial.println(F("    (c)   © © © © © © © © © ©"));

shows some characters okay, and some not. The result is attached.
The same sketch shows every time different wrong characters.
Type 'i' and to show it.
Removing the 'F()' macro doesn't help.

With Arduino 1.5.8, Arduino Mega 2560, Linux 14.04 64-bit.

Using 1200 baud makes every UTF-8 character show wrong, and using 115200 makes almost 90% show okay. That is very strange.
Adding delays or using Serial.write() does not help.

Does Arduino support UTF-8 or not ? Or only now and then ?

arduino-utf8.png

Does Arduino support UTF-8 or not ?

I believe the editor does but Serial Monitor does not.

Try a terminal application.

Need to be careful with wide chars, the F macro appears not compatible with wide chars:

const wchar_t arr[] PROGMEM = L"©€µ©€µ";

void setup() {
  Serial.begin( 115200);
  Serial.println("+++++++++++++++++++++++++++");
  
  Serial.println(F("©€µ©€µ"));
  
  Serial.println("+++++++++++++++++++++++++++");
  
  for( int idx = 0 ; idx < sizeof( arr ) ; ++idx ){
    Serial.write( pgm_read_byte( (char*)arr + idx ) );
  }
  Serial.println("\r\n+++++++++++++++++++++++++++");
}

void loop(){}

Produces this HEX output (Not the same):

2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B0D0A
C2A9E282ACC2B5C2A9E282ACC2B50D0A
2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B0D0A
A900AC20B500A900AC20B50000000D0A
2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B2B0D0A

The F() prints out just as much data even though the null is included in the second version. And it is not correct.

Works perfectly for me, using Arduino IDE 1.5.6 R2, tested on two Arduino Mega 2560 (clones from Sainsmart and "DCcEle" (which, unlike other clones or original board, uses a CH340 UART) ), Windows 8.1 64, I tried different bauds and all worked perfectly!

Thanks everyone for looking into this.
I tested it with other serial terminal applications. Some support UTF-8, some don't, but the result is always consistant.
Only the Arduino serial monitor of certain versions show this random correct and incorrect output.

I would call this : UTF-8 to the serial monitor is not supported and buggy.

Hang on guys, where are you basing this stuff, or what are you smoking!

The output I gave you was the hexidecimal output of the raw bytes, not from the IDE serial monitor.

The F() macro will not work with wide characters. You must use the method I showed you!

A page to the Playground for UTF-8 is added : Arduino Playground - UTF-8
The link to it is here : Arduino Playground - TutorialList

I still have to test the F() macro...

I think that most of what I wrote is correct, and I hope you will help to make some improvement to that page :smiley_cat:

I can't attach in file in the Playground, and I need to do that because of a bug in "Get code". So I attach the file here and use in the playground : Arduino Playground - UTF-8

Bug reported here : ( 1 ) Bug in "Get code" ( 2 ) attach in Playground - Website and Forum - Arduino Forum

utf-8.ino (2.09 KB)

@pYro_65
thank's to point out the "malfunction" of the F macro with 2-byte- (UTF-8) -characters.
but I had to modify your for-loop in the following way:

 for( int idx = 0 ; idx < (sizeof( arr )/2) ; ++idx ){
    Serial.write( pgm_read_byte( (wchar_t*)arr + idx ) );
  }

otherwise I got a an additional space (or maybe it's a '00' char) after each char (within the terminal of the arduno IDE under Win7).
I could not figure out, why that is the case (?)

That example is showing the extra bytes that are part of the wide string, showing that the contents is different to the usage of the F() macro. But this is essentially what the wide characters are. The problem is, your serial monitor is treating the data as ascii. If you save the text into a UTF file or view in a UTF enabled editor the characters will display correctly.

The way you have it outlined is how you could use it in code (if Serial.write accepted wchar_t). Except as you may notice, the second and fifth character do not print properly.

Serial.write() only prints a single byte (pgm_read_byte was used too). And using wchar_t as the pointer type causes pointer arithmetic which steps two bytes per pointer increment. ((wchar_t*)arr + idx)

const wchar_t arr[] PROGMEM = L"©€µ©€µ";

If you have a look at the HEX values you can see the characters that render properly only use the first byte of the wide char:

A900AC20B500A900AC20B50000000D0A

Here is how you could print the string correctly:

  #define arrlen( x )  (sizeof(x)/sizeof(*x))
   
  for( int idx = 0 ; idx < arrlen( arr ) ; ++idx ){
    uint16_t wchar = pgm_read_word( arr + idx ); //arr is already a wchar_t array, no cast needed.
    Serial.write( (char*)&wchar, 2 );
  }

However, still, in a non UTF monitor, you'll see spaces (or some other char) for the second byte of each wchar_t.

ah OK,
now I see that the monitor of the arduino IDE can display
1 byte UTF-8 characters without Problems
e.g. ("Ü ü Ö ö Ä ä ß µ ° ² ³ © @ ~ { [ ] }")
but real 2 byte UTF-8 characters (including €) do not work
e.g. ("₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴")

now i wanted your last suggestion in a simple to use function
I thought I could use this code:

// put UTF-string into Flash:
const wchar_t text_utf[] PROGMEM = {L"Ü ü Ö ö Ä ä ß µ ° ² ³ © € @ ~ { [ ] } \\ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴"};

// define a simple function  
// printing from Flash Memory (2byte UTF)
void printFlashUTF (const wchar_t * str)
{
    wchar_t c;
    if (!str) 
      return;
    while ((c = pgm_read_word(str++)))
      Serial.write( (char*)&c, 2 );
}

//[..]

//using this simple function
// Print from Flash Memory (2byte UTF)
printFlashUTF ((const wchar_t *) &text_utf);

is this correct so far ?

.

Writing out 2 bytes for each character is not UTF-8; that's UTF-16LE (or UCS-2). And characters like the copyright symbol just happen to print correctly because the serial monitor is interpreting the bytes as some 8-bit (not UTF-8) character encoding, perhaps ISO8859-1 or Windows-1252.

yes you are right,
thanks for pointing that out
here in the Examples it's made clear:
--> UTF-8 - Wikipedia
(there are 1 byte up to 4 byte length UTF-8 characters - the € sign for example is three bytes long [E2 82 AC])


in the meantime I found out,
that the arduino IDE is seamlessly working with UTF-8.
that means

  • the F Macro
  • the PROGMEM modifier
  • the PSTR Macro
  • strcpy strncpy
  • strcpy_P
  • String class (including .length property)
  • sizeof (with array of chars)
    everything works with UTF-8 strings of variable byte length

using Krodal's example program from here
http://playground.arduino.cc/Code/UTF-8
with some "weird" UTF-8 Characters:

// Test for normal strings with UTF-8
// Public Domain

char three[] = "3µV";
const char four[] PROGMEM = "4µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱";
String five = "5µF";
char six[] = "60€";
String seven = "70€₡₢₣₤₥₦";

char buffer[80];

void setup() 
{
  Serial.begin( 9600);
#if defined (__AVR_ATmega32U4__)
  while(!Serial);        // For Leonardo, wait for serial port
#endif

  Serial.println("\n+++++++++++++++++++++++++++++++++++++++++");
  Serial.println("Use a serial terminal that supports UTF-8");

  Serial.println(F("1µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱"));        // Good, text in flash

  // copy a string from flash memory to a buffer.  
  sprintf_P( buffer, PSTR("2µH")); // Good, text in flash
  Serial.println( buffer);

  // copy a string in ram to a buffer
  strcpy( buffer, three);          // Good
  Serial.println( buffer);

  // add one to strlen for the zero terminator
  strncpy( buffer, three, strlen(three) + 1); // Good, strlen works with UTF-8 string
  Serial.println( buffer);

  strcpy_P( buffer, four);         // Good, text in flash with PROGMEM
  Serial.println( buffer);
  
  // copy a string in flash to buffer byte for byte
  for( int i = 0 ; i < sizeof( four) ; i++)  // Good, sizeof works with UTF-8 string
  {
    buffer[i] = pgm_read_byte( four + i);
  }
  Serial.println( buffer);

  Serial.println( five);           // Good, a String class with UTF-8 character
  
  Serial.print( "array of char: \"");
  Serial.print( six);
  Serial.print( "\", strlen=");
  Serial.println( strlen(six));
  
  Serial.print( "String object: \"");
  Serial.print( seven);
  Serial.print( "\", String.length()=");
  Serial.println( seven.length());

  Serial.println( "+++++++++++++++++++++++++++++++++++++++++");

  Serial.println( "Enter a UTF-8 character and press <enter>");
  Serial.println( "The hexadecimal value will be displayed.");
}

void loop()
{
  if( Serial.available())
  {
    Serial.print( "You have entered: ");
    delay(100);      // allow the rest of the line to be received.
    while( Serial.available())
    {
      byte c = Serial.read();
      if( c != '\r' && c != '\n')  // ignore trailing CR and LF
      {
        if( c <= 0x0F)
          Serial.print( "0");
        Serial.print( c, HEX);
        Serial.print( ", ");
      }
    }
    Serial.println();
  }
}

leads to the (correct) output of a UTF-8 capable terminal:
(see also the strlen and String.length outputs...)

+++++++++++++++++++++++++++++++++++++++++
Use a serial terminal that supports UTF-8
1µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱
2µH
3µV
3µV
4µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱
4µ€ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱
5µF
array of char: "60€", strlen=5
String object: "70€₡₢₣₤₥₦", String.length()=23
+++++++++++++++++++++++++++++++++++++++++
Enter a UTF-8 character and press <enter>
The hexadecimal value will be displayed.
You have entered: E2, 82, AC,