Wokwi String bug with UTF8 data — interesting !

Here is a code to run on a UNO

const char * utf8Message = "éè§à€£";
String result = "";

void setup() {
  Serial.begin(115200);
  size_t n = strlen(utf8Message);
  Serial.print("\nCopying "); Serial.print(n);
  Serial.println(" bytes into the String.");

  for (size_t i = 0; i < n; i++) {
    result.concat(utf8Message[i]);
    //Serial.print("Copy at this stage ["); Serial.print(result); Serial.println("]"); // <==== uncomment
  }

  Serial.print("Original: ["); Serial.print(utf8Message); Serial.println("]");
  Serial.print("Copy: ["); Serial.print(result); Serial.println("]");

}

void loop() {}

the output of this code is the expected

Copying 13 bytes into the String.
Original: [éè§à€£]
Copy: [éè§à€£]

Now uncomment the debug line in the for statement where the concat happens.

and the output becomes a mess, including the original const cString.

Copying 13 bytes into the String.
Copy at this stage [Ã]
Copy at this stage [é]
Copy at this stage [éÃ]
Copy at this stage [éè]
Copy at this stage [éèÂ]
Copy at this stage [éè§]
Copy at this stage [éè§Ã]
Copy at this stage [éè§à ]
Copy at this stage [éè§à â]
Copy at this stage [éè§à â‚]
Copy at this stage [éè§à €]
Copy at this stage [éè§à €Â]
Copy at this stage [éè§à €£]
Original: [éè§à €£]
Copy: [éè§à €£]

testing this on a real uno does the right thing

Copying 13 bytes into the String.
Copy at this stage [⸮]
Copy at this stage [é]
Copy at this stage [é⸮]
Copy at this stage [éè]
Copy at this stage [éè⸮]
Copy at this stage [éè§]
Copy at this stage [éè§⸮]
Copy at this stage [éè§à]
Copy at this stage [éè§à⸮]
Copy at this stage [éè§à⸮]
Copy at this stage [éè§à€]
Copy at this stage [éè§à€⸮]
Copy at this stage [éè§à€£]
Original: [éè§à€£]
Copy: [éè§à€£]

some of the intermediary output is bogus since the UTF8 characters I've used fit on multiple bytes, so can't be interpreted, but in the end all works out OK and you get the String to match the original cString.

This is the first time I catch the simulator not doing the same thing on basic code as the real hardware.

Anyone has an explanation on why that is ? May be a bug in their UTF8 character display in the console that does not recover ?

Interesting.

You have a 6 character UTF8 string in a char buffer which, with the terminator \0, is 13 bytes long.
You iterate through the buffer 1 byte at a time. You hand each byte, that is one half of an UTF8 character, to a String method, and finally the terminating null charachter, and expect it join them up again in a way that is once again recognisable as 6 individual UTF8 characters.

I'm surprised that there are situations where this actually works! but it is difficult to get my mind around how.

The really odd thing is that if you don't print the String after adding each character, it works correctly in the simulator.

Not a byte, a char, so the addition to the String should work regardless of the value in that char.

I can see printing the String giving bad results when the final character in the String is the first byte of a UTF-8 character, that would try to take the next character as part of the UTF-8 instead of the terminating null. Its possibly that is what is messing up the simulator, instead of looking for a terminating null as the end of the String, it looks for a terminating null after each full UTF-8 character.

Typical use case = you build up a String from user input from an UTF8 enabled Stream (terminal, internet, ...)

I have indeed 6 glyphs in my cString which translates into 13 bytes of valid data to represent "éè§à€£" and I have in the buffer an additional null char (so the cString uses up 14 bytes actually).

When I hand over a byte to the String class, it appends it to the underlying cString and move its hidden null terminating char one byte further. So it's not surprising that along the way you get some strings that do not make sense from an UTF8 perspective and cannot be interpreted.

But in the end, when all bytes have been added, the result should be fine, as proven on the real arduino test.

What's "interesting" is that it seems that wokwi terminal gets into limbo when you try to display a wrongly formatted UTF8 sting and it has ripple effects on all the display (as I can't imagine it did actually modify my cString in memory)

My guess is that the terminal never recovers from a bogus UTF8 data flow whereas the one we have in the IDE does somehow.

Seems so, if I modify the code to only print out the value of the String when it contains an even number of characters then it prints correctly up until the next to last character, which happens to be a 3-byte UTF code.

That's my impression too.

Everyone complains about the Serial Terminal emulation in the IDE, but it's smarter and can recover from a bad UTF8 input it would seems.

:slight_smile:

Woyncha drop the boys over at wokwi a note about this? They appear to want to make it work.

OIC you have done. THX never mind.

a7

:wink: I did

2 Likes

Except that the Serial Monitor is software, not hardware.

Looks like it is switching to Latin1 (ISO-8859-1) to print that first single byte, which is not valid UTF-8. You can cause this by modifying the start of the "working" code to

  Serial.begin(115200);
  char cur = 0xA4;  // currency sign (generic) ¤
  Serial.println(cur);

That prints the currency sign, and then "breaks" the UTF-8 as before by treating each byte in the sequence as Latin1 as well.

Arguably this is a feature: defaulting to UTF-8 and auto-detecting usage of a single-byte character set, which is still quite common. (I wonder if you get a locale-specific variant of the SBCS with whatever your browser/OS defaults to, like 8859-4 or even Windows-1252.)

There's just no good way to switch back, automatically or not. So you'd want an option to disable the auto-detect, and/or a menu to switch the active character set encoding.

True

And the browser does not help either probably

1 Like

To close this thread - super hero Uri Shaked fixed this issue in not time.

The wokwi example from the first post now works like a charm !

And then there was also this one.
https://forum.arduino.cc/t/physical-uno-and-wokwi-give-different-results-for-a-simple-test/1328159/43

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.