Display text on LED matrix - how properly handle Unicode (russian) letters

I am writing some library to display text on the LED matrix.
English letters are ok, but got problem with russian letters, I believe due to Unicode - each russian letter representув by two bytes, first one is 0xD0 or 0xD1, and second one is letter itself.

Here is simplified code demonstrating the problem:

String SEnglish = String ("AZaz"); //english characters
String SRussian = String ("АЯая"); //russian characters
//
Serial.print("English Str: "); Serial.println(SEnglish);
Serial.print("Russian Str: "); Serial.println(SRussian);
//
Serial.printf("English length: %d\n", SEnglish.length());
Serial.printf("Russian length: %d\n", SRussian.length()); //0xDO is first!
//
Serial.print("English hex:");
for (uint16_t i=0;  i<(SEnglish.length()); i++) {Serial.printf(" %X", SEnglish.charAt(i));}
Serial.println();
//
Serial.print("Russian hex:");
for (uint16_t i=0;  i<(SRussian.length()); i++) {Serial.printf(" %X", SRussian.charAt(i));}

Output is:

English Str: AZaz
Russian Str: АЯая
English length: 4
Russian length: 8
English hex: 41 5A 61 7A
Russian hex: D0 90 D0 AF D0 B0 D1 8F

I do not care about other languages in this project, only english and russian, so what is proper (standard) way to process this double number of byte?
Are there standard functions to convert unicode string to non-unicode encoding?
Is this ok, if I will process every second byte (make sure every first byte is D0 or D1), or are there any better solution?
What these D0,D1 mean, I have googled a little about unicode and seems like first byte equal to D0,D1 points to some Chinese symbols…

You could make a look up table with just the pairs, for example 'D0 90'. That's a 16 bit quantity, so you just search the table, the other column is the single byte character that you want to display. Just in case you don't find a proper way.

the IDE (and String class) represents string literals in UTF8 which is a variable width character encoding scheme. Depending on your symbols (glyphs), you can get 1, 2, 3 or even 4 bytes for each glyph.

The serial monitor should support UTF8 (but only for strings) that’s why your code example works.

The strlen() ou length() functions though are totally unaware of the concept of glyphs or UTF8 and only count the number of bytes used for the representation of your text.

When it comes to decoding, UTF8 was designed for backward compatibility with ASCII, that means that the first 128 characters of UTF8 correspond one-to-one with ASCII. They are encoded using a single byte with the same binary value as ASCII, so a valid ASCII text is also a valid UTF-8-encoded text.

Note that you used a special A and a… You also need to ensure your text editor is set to UTF8 if you don’t use the IDE.

if you were to run this code

void setup() {
  Serial.begin(115200);
  String SEnglish = String ("AZaz"); //english characters
  String SRussian = String ("AЯaя"); //russian characters

  Serial.print("English Str: "); Serial.println(SEnglish);
  Serial.print("Russian Str: "); Serial.println(SRussian);
  Serial.print("English length: "); Serial.println(SEnglish.length());
  Serial.print("Russian length: "); Serial.println(SRussian.length());
  
  Serial.print("English hex:");
  for (uint16_t i = 0;  i < (SEnglish.length()); i++) {
    Serial.print((byte) SEnglish.charAt(i), HEX);
    Serial.write(' ');
  }
  Serial.println();

  Serial.print("Russian hex:");
  for (uint16_t i = 0;  i < (SRussian.length()); i++) {
    Serial.print((byte) SRussian.charAt(i), HEX);
    Serial.write(' ');
  }
  Serial.println();
}

void loop() {}

You would see in the Serial monitor @115200 bauds

[color=purple]
English Str: [color=red]A[/color]Z[color=blue]a[/color]z
Russian Str: [color=red]A[/color]Я[color=blue]a[/color]я
English length: 4
Russian length: 6
English hex:[color=red]41[/color] 5A [color=blue]61[/color] 7A 
Russian hex:[color=red]41[/color] D0 AF [color=blue]61[/color] D1 8F 
[/color]

You’ll recognize that

A

or

a

use actually 1 byte and are coded in the same way (0x41 and 0x61) in both Strings but your Russian characters Я and я are coded on 2 bytes (but some other glyphs could be 3 or 4 bytes long).

You can see all the glyphs in UTF8 here, and the UTF8 information for Я here (and find the D0 AF code) as an example

The way things are encoded to determine if you need 1 byte, 2 bytes, 3 bytes or 4 bytes for the glyph is determinist and needs no forward byte lookup → eg if the byte you read is strictly less than 0x80 then it’s the ASCII equivalent, if not there are some rules with continuations bytes that let you decide how many bytes you need.

The value of each individual byte indicates its UTF-8 function, as follows:

00 to 7F hex (0 to 127): first and only byte of a sequence.
80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
C2 to DF hex (194 to 223): first byte of a two-byte sequence.
E0 to EF hex (224 to 239): first byte of a three-byte sequence.
F0 to FF hex (240 to 255): first byte of a four-byte sequence.

→ you can read that in the spec.

The challenge you have is that UTF8 can represent all the 1,112,064 valid code points in Unicode… So mapping a special byte sequence to something else is quite a task.

If I were you I would identify what are the special glyphs you care about and detect them as you go through the bytes of your string and have a special bitmap for painting them. Just remember that they can be 1,2,3 or 4 bytes long, so the matching needs to take that into account so that if you find an unknown character you can skip the appropriate number of bytes and keep on with the next one.

Russian hex: D0 90 D0 AF D0 B0 D1 8F

I do not care about other languages in this project, only english and russian, so what is proper (standard) way to process this double number of byte?
Are there standard functions to convert unicode string to non-unicode encoding?
Is this ok, if I will process every second byte (make sure every first byte is D0 or D1), or are there any better solution?
What these D0,D1 mean, I have googled a little about unicode and seems like first byte equal to D0,D1 points to some Chinese symbols…

the first byte of the unicode gives you already the size/number of used bytes for this character. When you read 0xD0 = 0b11010000 … it’s a two byte character. See the link of J-M-L it’s explained there.

The explanation here may may help you understand your issue: Parola A to Z – Handling non ASCII characters (UTF-8) – Arduino++