the IDE (and String class) represents string literals in UTF8 which is a variable width character encoding scheme. Depending on your symbols (glyphs), you can get 1, 2, 3 or even 4 bytes for each glyph.
The serial monitor should support UTF8 (but only for strings) that’s why your code example works.
The strlen() ou length() functions though are totally unaware of the concept of glyphs or UTF8 and only count the number of bytes used for the representation of your text.
When it comes to decoding, UTF8 was designed for backward compatibility with ASCII, that means that the first 128 characters of UTF8 correspond one-to-one with ASCII. They are encoded using a single byte with the same binary value as ASCII, so a valid ASCII text is also a valid UTF-8-encoded text.
Note that you used a special A and a… You also need to ensure your text editor is set to UTF8 if you don’t use the IDE.
if you were to run this code
void setup() {
Serial.begin(115200);
String SEnglish = String ("AZaz"); //english characters
String SRussian = String ("AЯaя"); //russian characters
Serial.print("English Str: "); Serial.println(SEnglish);
Serial.print("Russian Str: "); Serial.println(SRussian);
Serial.print("English length: "); Serial.println(SEnglish.length());
Serial.print("Russian length: "); Serial.println(SRussian.length());
Serial.print("English hex:");
for (uint16_t i = 0; i < (SEnglish.length()); i++) {
Serial.print((byte) SEnglish.charAt(i), HEX);
Serial.write(' ');
}
Serial.println();
Serial.print("Russian hex:");
for (uint16_t i = 0; i < (SRussian.length()); i++) {
Serial.print((byte) SRussian.charAt(i), HEX);
Serial.write(' ');
}
Serial.println();
}
void loop() {}
You would see in the Serial monitor @115200 bauds
[color=purple]
English Str: [color=red]A[/color]Z[color=blue]a[/color]z
Russian Str: [color=red]A[/color]Я[color=blue]a[/color]я
English length: 4
Russian length: 6
English hex:[color=red]41[/color] 5A [color=blue]61[/color] 7A
Russian hex:[color=red]41[/color] D0 AF [color=blue]61[/color] D1 8F
[/color]
You’ll recognize that
A
or
a
use actually 1 byte and are coded in the same way (0x41 and 0x61) in both Strings but your Russian characters Я and я are coded on 2 bytes (but some other glyphs could be 3 or 4 bytes long).
You can see all the glyphs in UTF8 here, and the UTF8 information for Я here (and find the D0 AF code) as an example
The way things are encoded to determine if you need 1 byte, 2 bytes, 3 bytes or 4 bytes for the glyph is determinist and needs no forward byte lookup → eg if the byte you read is strictly less than 0x80 then it’s the ASCII equivalent, if not there are some rules with continuations bytes that let you decide how many bytes you need.
The value of each individual byte indicates its UTF-8 function, as follows:
00 to 7F hex (0 to 127): first and only byte of a sequence.
80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
C2 to DF hex (194 to 223): first byte of a two-byte sequence.
E0 to EF hex (224 to 239): first byte of a three-byte sequence.
F0 to FF hex (240 to 255): first byte of a four-byte sequence.
→ you can read that in the spec.
The challenge you have is that UTF8 can represent all the 1,112,064 valid code points in Unicode… So mapping a special byte sequence to something else is quite a task.
If I were you I would identify what are the special glyphs you care about and detect them as you go through the bytes of your string and have a special bitmap for painting them. Just remember that they can be 1,2,3 or 4 bytes long, so the matching needs to take that into account so that if you find an unknown character you can skip the appropriate number of bytes and keep on with the next one.