Comparing UTF-8 chars

Hello,

for printing German special characters (umlauts) with a thermal printer I have to call a special print function.
So I wrote a subroutine that looks for umlauts and then calls the special print function for those umlauts.

Example: I call my function like this, e.g.

utf8print("Dieser kümmerliche Beitrag von 5€...";

The function looks like this:

void utf8print(const String& text) {
  for(int i=0; i <= text.length(); i++) {
    if(text.charAt(i) == 'ü') {   // this comparison doesn't work here!!!
        printer.write(0x81); 
    } else if(text.charAt(i) == '€') {
        printer.write(0xD5);
    } else {
      printer.print(text.charAt(i));
    }
   }
}

But it doesn’t work this way, the comparison fails (it doesn’t detect the ü or € char).

I read a bit and it seems it is because ü and € are UTF-8 and == only compared the first byte or something like that.

Can you please tell me how I should do this?

Board?

Hello, here is something that may be useful

http://ideone.com/NCGONA

ascii is default for arduino… not utf8.

some character are the same for both but im guessing your uncommon ones aren’t.

so hard coding the characters into your sketch wont do you much good.
regaurdless of encoding type all characters are represent in numbers 0-255.

if your only problem is only those two characters. why not hard code a comparison to the actual byte value that those characters are producing?

I’m sure it cant be too hard to either lookup a utf8 chart or make a couple debug lines to find those byte values.

[quote author=Coding Badly link=msg=4271360 date=1565507138] Board?

[/quote]

Arduino UNO

I still didn't manage to do what I wanted :(

taterking: why not hard code a comparison to the actual byte value that those characters are producing?

How do I do this?

"ü" is 0xc3 0xbc as hex value

How can I do this here?

if(text.charAt(i) == 'ü') {

cybtrash: How can I do this here? if(text.charAt(i) == 'ü') {

You can't. Sorry. The String object is for single-byte characters, see String.cpp and String.h.

You could search for the substring ü, but that might cause a lot of trouble. When there is a € in front of the ü then the index will increase. You could translate everything to 16-bits characters. I think Java uses that. That is also not 100% fail safe because to allow all characters sometimes 32-bits are used. The link by @guix calculates everything. That will work, but a mistake is easily made. The std::string operates on bytes as well, but the std has also some extras for 16-bit characters. This is about that: https://en.cppreference.com/w/cpp/locale/codecvt_utf8, but I don't know how you can use it.

I think you need a UTF-8 library. If you want to use the String object, then someone should write a UTF-8 version of the String class. As far as I know, it does not exist :'(

Can you try to search for the substring ü ? Perhaps it is just enough for you to print the special characters. You have to advance two bytes for the next character of course. You can either use the String object, or use a normal char array with the code in the link by @guix. I prefer a normal char array for an Arduino Uno.

P.S.: We prefer that you show a small sketch that shows the problem. Than we have something that we can try ourself.

This would be my approach:

[color=#00979c]struct[/color] [color=#000000]UTF8_mapping[/color] [color=#000000]{[/color]
  [color=#000000]UTF8_mapping[/color][color=#000000]([/color][color=#00979c]const[/color] [color=#00979c]char[/color] [color=#434f54]*[/color][color=#000000]utf8[/color][color=#434f54],[/color] [color=#00979c]unsigned[/color] [color=#00979c]char[/color] [color=#000000]replacement[/color][color=#000000])[/color]
    [color=#434f54]:[/color] [color=#000000]utf8[/color][color=#000000]([/color][color=#000000]utf8[/color][color=#000000])[/color][color=#434f54],[/color] [color=#d35400]length[/color][color=#000000]([/color][color=#d35400]strlen[/color][color=#000000]([/color][color=#000000]utf8[/color][color=#000000])[/color][color=#000000])[/color][color=#434f54],[/color] [color=#000000]replacement[/color][color=#000000]([/color][color=#000000]replacement[/color][color=#000000])[/color] [color=#000000]{[/color][color=#000000]}[/color]
  [color=#00979c]const[/color] [color=#00979c]char[/color] [color=#434f54]*[/color][color=#000000]utf8[/color][color=#000000];[/color]
  [b][color=#d35400]size_t[/color][/b] [color=#d35400]length[/color][color=#000000];[/color]
  [color=#00979c]unsigned[/color] [color=#00979c]char[/color] [color=#000000]replacement[/color][color=#000000];[/color]
[color=#000000]}[/color][color=#000000];[/color]

[color=#000000]UTF8_mapping[/color] [color=#000000]mappings[/color][color=#000000][[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#000000]{[/color]
  [color=#000000]{[/color][color=#005c5f]"ü"[/color][color=#434f54],[/color] [color=#000000]0x81[/color][color=#000000]}[/color][color=#434f54],[/color]
  [color=#000000]{[/color][color=#005c5f]"€"[/color][color=#434f54],[/color] [color=#000000]0xD5[/color][color=#000000]}[/color][color=#434f54],[/color]
[color=#000000]}[/color][color=#000000];[/color]

[color=#00979c]void[/color] [color=#000000]utf8print[/color][color=#000000]([/color][color=#00979c]const[/color] [color=#00979c]char[/color] [color=#434f54]*[/color][color=#000000]str[/color][color=#000000])[/color] [color=#000000]{[/color]
  [color=#5e6d03]while[/color] [color=#000000]([/color][color=#434f54]*[/color][color=#000000]str[/color][color=#000000])[/color] [color=#000000]{[/color]
    [color=#5e6d03]if[/color] [color=#000000]([/color][color=#000000]([/color][color=#00979c]unsigned[/color] [color=#00979c]char[/color][color=#000000])[/color][color=#434f54]*[/color][color=#000000]str[/color] [color=#434f54]<[/color] [color=#000000]0x80[/color][color=#000000])[/color] [color=#000000]{[/color] [color=#434f54]// ASCII[/color]
      [color=#000000]printer[/color][color=#434f54].[/color][color=#d35400]write[/color][color=#000000]([/color][color=#434f54]*[/color][color=#000000]str[/color][color=#000000])[/color][color=#000000];[/color]
      [color=#000000]str[/color] [color=#434f54]+=[/color] [color=#000000]1[/color][color=#000000];[/color]
      [color=#5e6d03]continue[/color][color=#000000];[/color]
    [color=#000000]}[/color] 
    [color=#434f54]// else: UTF-8 multiple bytes[/color]
    [color=#5e6d03]for[/color] [color=#000000]([/color][color=#00979c]auto[/color] [color=#434f54]&[/color][color=#000000]mapping[/color] [color=#434f54]:[/color] [color=#000000]mappings[/color][color=#000000])[/color] [color=#000000]{[/color]
      [color=#5e6d03]if[/color] [color=#000000]([/color][color=#d35400]strncmp[/color][color=#000000]([/color][color=#000000]str[/color][color=#434f54],[/color] [color=#000000]mapping[/color][color=#434f54].[/color][color=#000000]utf8[/color][color=#434f54],[/color] [color=#000000]mapping[/color][color=#434f54].[/color][color=#d35400]length[/color][color=#000000])[/color] [color=#434f54]==[/color] [color=#000000]0[/color][color=#000000])[/color] [color=#000000]{[/color]
        [color=#000000]printer[/color][color=#434f54].[/color][color=#d35400]write[/color][color=#000000]([/color][color=#000000]mapping[/color][color=#434f54].[/color][color=#000000]replacement[/color][color=#000000])[/color][color=#000000];[/color]
        [color=#000000]str[/color] [color=#434f54]+=[/color] [color=#000000]mapping[/color][color=#434f54].[/color][color=#d35400]length[/color][color=#000000];[/color]
        [color=#5e6d03]break[/color][color=#000000];[/color]
      [color=#000000]}[/color]
    [color=#000000]}[/color]
  [color=#000000]}[/color]
[color=#000000]}[/color]

Pieter

@PieterP, that is very nice ! I was working on the same thing and also used a struct to translate it. I will show it here anyway, but your version is so much more advanced and better.

// Translation table
struct translate_STRUCT
{
  char utf8Data[5];    // UTF-8 is maximum 4 bytes plus zero terminator
  byte printByte;
} translate[] =
{
  { "ü", 0x81 },
  { "€", 0xD5 },
  { "°", 0x01 },
  { "Ω", 0x02 },
  { "µ", 0x03 },
  { "ß", 0x04 },
};

void setup()
{
  Serial.begin( 9600);
  utf8print( "Dieser kümmerliche Beitrag von 5€...");
  Serial.println();
  utf8print( "It is 30°C. The resistor is 120Ω. The capacitor is 47µF. Straße");
  Serial.println();
}

void loop()
{
}

void utf8print(const char * pText)
{
  size_t n = strlen( pText);        // length in bytes

  for( size_t i=0; i<n; i++)
  {
    bool found = false;
    for( size_t j=0; j<sizeof(translate)/sizeof(translate[0]); j++)
    {
      size_t x = strlen( translate[j].utf8Data);
      if( strncmp( &pText[i], translate[j].utf8Data, x) == 0)
      {
        found = true;
        // The code for the printer is:
        //   printer.write( translate[j].printByte);
        Serial.print( "{0x");       
        Serial.print( translate[j].printByte, HEX);
        Serial.print( "}");
        i += x - 1;               // i is already incremented by 1, add only the extra amount
      }
    }

    if( !found)
    {
      Serial.print( pText[i]);
    }
  }
}

cybtrash:
“ü” is 0xc3 0xbc as hex value
How can I do this here?

if(text.charAt(i) == 0xC3 && text.charAt(i+1) == 0xBC) {   // ü == 0xC3BC
 i++; // skip over the second byte which is already used
}

The IDE will ensure you store UTF8 characters indeed in the String but they can end up in more than one byte.

It’s pretty easy to decode UTF8. UTF-8 uses the ASCII set for the first 128 characters. That’s handy because it means ASCII text is also valid in UTF-8 And if you go beyond you know you have more bytes (in UTF-8, a code point can take between 1 and 4 bytes - see wikipedia).

Thus Looking for extended text means looking for a sequence of bytes, not just comparing one byte. (John Wasser approach would work for the character ü indeed) and the notation char c = ‘ü’;is incorrect as the symbol ü does not fit in one byte (the char Type means a signed byte)

Note that when a string holds UTF8 the length() is not semantically correct. It will tell you the number of bytes and not the real number of symbols (for example “ü234” length is not 4 but 5 because ü takes 2 bytes) and in the String class the function charAt() behaves really more like a byteAt() (which does not exist) function as it returns the byte in a given index and does not handle UTF8