Correct/real encoding for Serial.print/(char)/Monitor serial

Hello folks! Good $time.

I’m writing a serial monitor program and my intention is to add options to choose among different encodings.
I’ve spent some time trying to understand what’s default Arduino character encoding. After searching on forum/google and trying with Arduino my conclusion is that information in this matter is generally misleading/imprecise. This way I decided to post this topic in the hope to confirm if some of my conclusions are correct or not.

Recapping from Arduino pages:
“The char datatype is a signed type, meaning that it encodes numbers from -128 to 127.” https://www.arduino.cc/en/Reference/Char

Arduino reference pages (Serial.print and others), indicates that the encoding used in sketches and by Monitor Serial is ASCII. Arduino - ASCIIchart

Sure. From int 32 to 126 it’s really like ASCII but overall, compiler and Serial Monitor encoding is something else.

If I do

for(byte r=32;r<256;r++) Serial.print((char)r);

or

for(byte r=32;r<256;r++) Serial.write(r);

or

for(byte r=32;r<256;r++) Serial.write((char)r);

The output in Serial Monitor is strictly what’s expected for cp1252 encoding (Windows-1252).
I think it sounds obvious but considering I didn’t find this information anywhere my real question is if it’s correct that I take these assumptions:

1- When I cast anything byte/int (0 - 255) to (char) what compiler really does is apply cp1252 encoding.
2- Arduino’s Serial Monitor encoding is configured internally for cp1252.

Are these assumptions correct?
If so it looks a good deal updating the reference pages to reflect this.

All characters in the ASCII range are represented also in ISO-8859-1
All characters in the ISO-8859-1 range are represented also in cp1252
Not the most correct term bu they are “backward compatible” and cp1252 the most complete (in terms of unique character representations), among the three.

Salute!

I doubt the AVR compiler is applying a specific encoding, more probably what you see in serial monitor is your system's default encoding. Even when the ready-to-be-compiled source is saved in a temporary folder, the IDE (java) may default to OS encoding to save it as a text file.

I may br wrong, try to change windows language (or encoding) without recompiling a sketch and see what happens in serial monitor.

Thanks blimpyway.

I cant make this test right now but will do it by tomorrow. What you stated about (char) is interesting. Maybe what cast to (char) does is just pick incoming values from 0 - 255 and distributing evenly between -128 and +127. Looks like another thing to test

Maybe what cast to (char) does is just pick incoming values from 0 - 255 and distributing evenly between -128 and +127. Looks like another thing to test

Not even that. Char is just an unsigned byte, difference between signed and unsigned bytes is how the most significant bit the seventh is handled by program - as a sign bit or as 128 (2^7) to be added to the number’s value.

So casting in this case doesn’t change any bits in the number, just tells the compiler how to “look” at that 8bit number.

NO, it doesn't work like that. It doesn't 'pick' or 'distribute' values.

The Arduino is a binary computer. The basic unit is a bit and ALL the low-level instructions inside the AVR Arduinos can ONLY operate on 8 bits at a time. 8 bits, with no other information, is a byte. It's awkward to write 8 bits and difficult to read, so the easiest way of talking about bytes is to use hexadecimal. A byte can hold all of the values between 0x00 and 0xFF inclusive.

When we output those values to the humans sitting outside the computer, there's a number of different ways of doing it. 0x32 is pretty much always going to represent the number 50 since it's the 50th value after zero.

The first problem is "how do we make negative binary numbers?" The simple solution is to make the first bit a 1, indicating that there's a minus sign in front. That is a little too simple and it turns out that negative maths is a lot easier if we flip all the bits and add 1. (The problem is the negative zero.) So 0x01 is positive 1 and 0xFF is negative 1 and there's no negative zero.

Due to an unfortunate early lack of specifications, C ended up with the char type being a signed type. If you ask it to print out the value of a char, it will show negatives for the range 0x80 to 0xFF as -128 to -1. If you cast it to a byte or a uint8_t then you will see only positive numbers, up to 255.

The Arduino literally doesn't know anything about character encoding. It's too simple for that. All of the conversion to characters is done on your computer when it displays it to you. The compiler too, converts characters into bytes according to a standard and sends those bytes to the Arduino.

the easiest way of talking about bytes is to use hexadecimal

That is true when you need to be able to visualise the bit pattern of the byte, otherwise you might just as well use the decimal numbers between 0 and 255.

blimpyway: Not even that. Char is just an unsigned byte,

Nope. the data type "char" is signed on the Arduino.

blimpyway:
I doubt the AVR compiler is applying a specific encoding …

I’m not sure how to interpret this but if I do on Arduino IDE (Windows 7):

 char control[] = "ABC";
  char test[] = "¼";
  char test2[] = "☂";
  char test3[] = "¼☂";

 Serial.println("control:");
 for(byte s=0;s<sizeof(control);s++){ byte a=control[s]; Serial.println(a); }

 Serial.println("quarter:");
 for(byte s=0;s<sizeof(test);s++){ byte a=test[s]; Serial.println(a); }
 
  Serial.println("umbrella:");
 for(byte s=0;s<sizeof(test2);s++){ byte a=test2[s]; Serial.println(a); }

   Serial.println("quarterumbrella:");
 for(byte s=0;s<sizeof(test3);s++) { byte a=test3[s]; Serial.println(a); }

The result are expected values for unicode, in UTF-8, except the zeros. No idea why sizeof is returning unexpected values (or char array storing unexpected number of bytes), generating zero at the end

control:
65
66
67
0
quarter:
194
188
0
umbrella:
226
152
130
0
quarterumbrella:
194
188
226
152
130
0

It looks like compiler is doing UTF8 encoding. Though I’m not sure how the IDE passes the source code to the compiler. Maybe the IDE is stripping out these UTF8 symbols to bytes before passing the code to the compiler (doing encoding), no idea.

Anyway, I found this page saying “The gcc compiler that is used with Arduino supports UTF-8 encoding only.” http://www.visualmicro.com/page/User-Guide.aspx?doc=Non-ASCII.html

AWOL:
Nope. the data type “char” is signed on the Arduino.

If I do:

  char control[] = "ABC";
  char test[] = "¼";
  char test2[] = "☂";
  char test3[] = "¼☂";

 Serial.println("control:");
 for(byte s=0;s<sizeof(control);s++){ int a=control[s]; Serial.println(a); }

 Serial.println("quarter:");
 for(byte s=0;s<sizeof(test);s++){ int a=test[s]; Serial.println(a); }
 
  Serial.println("umbrella:");
 for(byte s=0;s<sizeof(test2);s++){ int a=test2[s]; Serial.println(a); }

   Serial.println("quarterumbrella:");
 for(byte s=0;s<sizeof(test3);s++) { int a=test3[s]; Serial.println(a); }

The result is:

control:
65
66
67
0
quarter:
-62
-68
0
umbrella:
-30
-104
-126
0
quarterumbrella:
-62
-68
-30
-104
-126
0

This is confusing. Looks like char array values are signed?

This is confusing. Looks like char array values are signed?

If a char is signed, then an element of a char array is also going to be aigned.

The result are expected values for unicode, in UTF-8, except the zeros. No idea why sizeof is returning unexpected values (or char array storing unexpected number of bytes), generating zero at the end

Because small-s strings must have a null character at the end. Otherwise there is no way of knowing where the string ends.

Usually the null character is written as ‘\0’ and in 99.9% of encodings it is a zero. It is not a good idea to rely on that. Always use ‘\0’ when using it as a constant in your code.

This is also why the following code has problems:

  char wontwork[4] = "1234";
  char willwork[5] = "1234";

Sorry AWOL, my fault. While reading I just changed what you wrote by what you quoted in the post.
Thanks MorganS, that’s really important to know.

I did some more tests. For:

for(byte a=0;a<=255;a++) {
char b = a;
int c = b;
Serial.print(a); 
Serial.print(" | ");
Serial.println(c);

Result is:

0 | 0
1 | 1
2 | 2
3 | 3
4 | 4

... etc etc

126 | 126
127 | 127
128 | -128
129 | -127

--- etc etc

252 | -4
253 | -3
254 | -2
255 | -1

For:

String a = "☂";
char b[sizeof(a)];
a.toCharArray(b, sizeof(a)); 

for(byte c=0;c<sizeof(a);c++) {
byte d=b[c];
Serial.println(d);

Result is:

226
152
130
0
0
0

The expected UTF8 values for ‘umbrella’ plus some zeros (probably null character plus memory pointer bytes).

Finally:

byte a = 128;
char b = a;
Serial.print(b);
Serial.write(a);
delay(5000);

Will result in the same character displayed in my Serial Monitor.

Researching about impact of source file encoding it seems that avr-gcc inherits same preprocessor options defined by GCC (probably not all options would apply for avr-gcc): Preprocessor Options (Using the GNU Compiler Collection (GCC))

There are options for -fwide-exec-charset=charset and -finput-charset=charset

My new list of conclusions:
1- In fact, like blimpyway said, compiler does not perform decode/encode of any type. But preprocessor does decode. One can decide what bytes will look like and passed to compiler by setting -finput-charset, UTF8 being the default.

2- Serial.print will just translate char to it’s byte relative over serial line. In practice for char it will send bytes that are relative to UTF8 or cp1252 encoding or ISO8859 (as 0 to 255 translates to the same characters in these, with some gaps/not used positions depending of the encoding). ASCII shares same byte values until 126.

Finally, compiler or Serial.print does not have any compromise about charset encoding, it’s external world responsibility to deal with it or to set the preprocessor to pass to compiler what you meant to be passed.
I guess that maybe these are very obvious conclusions, not for me anyway, just trying to understand how it works and the scope of Serial.print.

EDIT:
Correction: UTF-8 is not like this. The correct is “Under UTF-8, the byte encoding of character numbers between 0 and 127 is the binary value of a single byte, just like ASCII. For character numbers between 128 and 65535, however, multiple bytes are used.”