Characters with higher ASCII-values - problem

I made an array with font-definitions for a LED-matrix.
That works fine and all letters and numbers show well.
BUT: in some other languages they want to use characters with a higher ASCII value, like the ü which has a value of 129..
Strange enough, when i want to know the ASCII value of 'ü' it returns 135 instead.
and when i want to know the value of all characters between 128 and 150, it returns all kinds of values but most of them are 195...
This is done on a ESP32C3. example-code:

scrollText = "ÇüéâäàåçêëèïîìÄÅÉæÆôöòû";  // all characters from 128-150
      for(int i=0; i<23;i++){Serial.print(scrollText[i],DEC);Serial.print(" ");Serial.write(scrollText[i]);}

this is the output:
195 �135 �195 �188 �195 �169 �195 �162 �195 �164 �195 �160 �195 �165 �195 �167 �195 �170 �195 �171 �195 �168 �195 �

you see, a lot of 195 values and no characters..

for the lower values it works ok:

scrollText = "ABCDEFGHIJKLMNOPQRSTUVW";
output:
65 A66 B67 C68 D69 E70 F71 G72 H73 I74 J75 K76 L77 M78 N79 O80 P81 Q82 R83 S84 T85 U86 V87 W

How to do this right?

ASCII only defines values from 0-127
other characters use 3-byte 2-byte unicode values

when i enter the degree symbol, °, in the serial monitor and output the hex value i get

⸮ C2
⸮ B0

using

void setup ()
{
    Serial.begin    (9600);
    Serial.println ("ready");
}
void loop ()
{
    if (Serial.available ()) {
        int c = Serial.read ();
        Serial.print ((char) c);
        Serial.print (" ");
        Serial.println (c, HEX);
    }
}

i get the following output for "ÇüéâäàåçêëèïîìÄÅÉæÆôöòû"

⸮ C3
⸮ 87
⸮ C3
⸮ BC
⸮ C3
⸮ A9
⸮ C3
⸮ A2
⸮ C3
⸮ A4
⸮ C3
⸮ A0
⸮ C3
⸮ A5
⸮ C3
⸮ A7
⸮ C3
⸮ AA
⸮ C3
⸮ AB
⸮ C3
⸮ A8
⸮ C3
⸮ AF
⸮ C3
⸮ AE
⸮ C3
⸮ AC
⸮ C3
⸮ 84
⸮ C3
⸮ 85
⸮ C3
⸮ 89
⸮ C3
⸮ A6
⸮ C3
⸮ 86
⸮ C3
⸮ B4
⸮ C3
⸮ B6
⸮ C3
⸮ B2
⸮ C3
⸮ BB

It does with the old DOS code page 437, but modern compilers use Unicode, where its code point is 252. Furthermore, strings are encoded as UTF-8, which what the Serial Monitor expects.

If you check strlen(scrollText), it's longer than 23.

The first byte is the 110xxxyy pattern, indicating a two-byte encoding. Do the bit math, and yy is 11: 192 + 3 = 195. That means that each of code points is at least 192, since the two high bits of yyyy are 11. 252 >= 192.

The second/last byte pattern is 10yyzzzz.

  • 252 - 192 = 60
  • 60 + 128 = 188

which is the fourth decimal value printed: the second byte for the second character ü

So to do this correctly, you need to decode UTF-8, and have your font table be indexed by Unicode, or its single-byte subset ISO-8859-1, if it has all the characters you want to support.

usually a 2-byte? I think

1 Like

The 3-byte codes have a larger number of possible characters than the 2-byte codes, and since there are currently some 4-byte UTF-8 codes I'd assume the shorter codes are fully populated. The shorter codes would tend to be the more commonly used, since they were allocated first. Some obscure emoji is likely to be a 4-byte code.

Many fit on 2 bytes indeed, but UTF-8 (which is the encoding used in the IDE) represents glyphs with 1 to 4 bytes.

Think of the terminal as dumb. There is no control capability. It uses the 7 bit ASCII character set. If you want that you can use a terminal emulator on your PC.

The Serial monitor is fully capable of interpreting UTF8, as is the IDE.

1 Like

It only appears to support the lower 7 bits and a few control characters on my IDE. No cursor control or anything like that. Mainly tab, cr, lf. What is the trick to turn on the rest of it. UTF-8 - Wikipedia Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte. How do I enable the rest of it?

Which ide ? It does this automatically with 1.x or 2.x on my Mac.

Run this

void setup() {  
  Serial.begin(115200);  
  Serial.println("Hello 🌍! π ≈ 3.1416 🚀");  
}  

void loop() {}

2 Likes

Thanks all,
I think i solved it:

int acar = int(scrollText[letter] - 0x20);
if(acar==163) acar=int(scrollText[++letter]-0x20);

Simply said: read a character, pointed by the value of "letter", from the string and if this is 195, increase the pointer "letter" and read that next character.
(as my fonttable starts with space=32, i subtract 32=0x20 from the value, so 195 will be 163).

The output is ok now, on the ledmatrix :wink:

To make code more readable you could let the compiler provide the ascii code

int acar = int(scrollText[letter] - ' '); // the cast is not necessary 
if(acar==163) acar=int(scrollText[++letter]-' ');

If you think that " " is more readable than 0x20..... fine.

should be ' ', not " ". " " is an array with a space followed by a NUL

Look up Extended ASCII, the chars from 128 to 255 ( x80 to xFF) as shown in the backs or fronts of programming books since the late 70's. All one single 8 bit chars.

They don't print without a FONT that goes that far and they do if it does.

They are great if you want to make linework boxes with text, I used them to make a POV 25x25 maze game.

I think that the 2 byte chars came along with LIMMS.

Of course I think so. It makes it clear what you do rather than using a magic number.

Where " " is 0x20, 0x0 perhaps ' ' is a better representation.

What is 'magical' about using hex?
Easily explained.
Arthur C. Clarke famously stated that "any sufficiently advanced technology is indistinguishable from magic".

It’s not the HEX it’s the 20 :slight_smile:

0x20 is sufficiently advanced to appear magical then.

I have an attitude after being crapped on for using pre-calculated data in my on-the-fly word match algorithm by someone who played superiority at (not on) me about my "spaghetti" code who was unable to figure out how it works if his failed attempt to clean it up is any indication. What can I say? He left the bits out that make it work for wrong words so it passed a test that fed only correct words in to get worst-case match speed and when I pointed that out, did not return with corrected code after claiming he could make it smaller and faster with his superior yadda-yadda. What a shame, he made it a few bytes smaller by making it fail!

Magic numbers... when I read that, it triggers me. People who can read hex have never cracked hex dumps.

Definitely magical (which is not always a good thing in programming :slight_smile: )

or never had to maintain someone else's code...