Unicode characters in Arduino

Hi guys,

I'm currently working on a project in which i try to get data from a web page. I'm using Ethernet Shield Web client example slightly modified and it work fine! The problem I'm facing is that the web page (HTML code) have unicode characters [Greek (charset=iso-8859-7)] and what the arduino is reading is not what it should be (e.g "ÍçóéÜ Áíáôïëéêïý Áéãáßïõ" instead of "????? ?????????? ???????"). Is there anything I can do about that or is it one of arduino constraints??

thanks in advance for your help!!!

what the arduino is reading is not what it should be

It's more likely that the arduino is reading the unicode just fine, but is not managing to transmit/display to whatever display device you are using. What you're getting is probably the result of displaying 8bit codes greater than 128, while what you want is to display 16bit unicode. (what IS your display? For instance, I don't know offhand of any "Serial monitor" like programs that read unicode... The Arduino Serial Monitor certainly doesn't do it.)

ISO-8859-7 appears to be not Unicode.

It appears to be a standard which uses the full set of one byte numbers ( considered unsigned ) from 0 to 255.

Regular ascii uses the numbers from 0 to 127 and 128-255 represents characters in other languages ( which depends on which code page of the 8859 table you are using ).

This is sometimes refered to as 8-bit ascii instead of 7bit ascii. As far as I know, these will be transmitted through the serial hardware and software just fine.

As the previous post says, the problem is with your display device and it's font, not with Serial. If you are seeing all that french accent crap, your display device is displaying the byte codes 128-255 correctly, however it is apparently assuming that they are from ISO-8559-1, where the important number there is the 1, which means page 1 of the standard which is full of french and german accented characters instead of page 7 where the greek characters are. The character that look like a greek beta is actually a german "ss" character.

Wikipedia has a useful explanation of this here : ISO/IEC 8859 - Wikipedia

and you can see that the first two bytes of your message are 0xCD 0xE7 which are displayed as a dotted capital I and a c with cedilla from ISO-8558-1, where you want upper-case_Nu lower-case-eta from ISO-5889-7

I believe that in the absence of actual unicode (ie using an 8-bit character set), you MUST set your display device to match the particular character set you are using MANUALLY.

puTTY is a popular windows terminal emulator that is more powerful than the arduino serial monitor. It's documentation says:

4.10.1 Controlling character set translation

During an interactive session, PuTTY receives a stream of 8-bit bytes from the server, and in order to display them on the screen it needs to know what character set to interpret them in.

There are a lot of character sets to choose from. The ‘Received data assumed to be in which character set’ option lets you select one. By default PuTTY will attempt to choose a character set that is right for your locale as reported by Windows; if it gets it wrong, you can select a different one using this control.

A few notable character sets are:

The ISO-8859 series are all standard character sets that include various accented characters appropriate for different sets of languages.
The Win125x series are defined by Microsoft, for similar purposes. In particular Win1252 is almost equivalent to ISO-8859-1, but contains a few extra characters such as matched quotes and the Euro symbol.
If you want the old IBM PC character set with block graphics and line-drawing characters, you can select ‘CP437’.
PuTTY also supports Unicode mode, in which the data coming from the server is interpreted as being in the UTF-8 encoding of Unicode. If you select ‘UTF-8’ as a character set you can use this mode. Not all server-side applications will support it.

If you need support for a numeric code page which is not listed in the drop-down list, such as code page 866, then you can try entering its name manually (CP866 for example) in the list box. If the underlying version of Windows has the appropriate translation table installed, PuTTY will use it.

Open C:\Users<your user>\AppData\Roaming\ArduinoXX\preferences.txt (where XX is your Arduino IDE version).

Change preproc.substitute_unicode from true to false.

This solved a problem for me with the Arduino not being able to correctly read characters such as é à etc from the Serial monitor.. maybe it will work for you too :wink:

It is not clear from the OP's post, whether he is refering to problems with characters being sent to, or from, his arduino.

If characters are being sent from the arduino, then changing the setting of your display device on your computer will work.

If the problem is with characters being sent to the arduino, then the problem is going to be with the font installed on the arduino's Lcd ( or wherever it is that the OP is seeing characters he doesn't want ). If this is the case, then changing display settings on the computer terminal, or in the c++ preprocessor, is unlikely to work. He would need to somehow select or install an alternative font into his LCD module.

And, ISO-8559 is NOT Unicode. It is a now largely obsolete scheme for representing alternative larger sets of non-ascii characters within a single byte character scheme.

First of all, thanks for the replies! I assumed it was unicode as when I put these strange chars in here Free Unicode to ASCII Converter - The PCman Website the result was the actual word! I dont believe the problem lies on the serial monitor as it just shows the characters that the arduino is reading. When I just read the web page (and print it in the Serial) everything is exactly the same except from Greek words. Although, what I'm trying to do is find a specific word inside the web page's html code. When the key word is in English it work fine, but when it is in Greek the arduino can't find it.
if (client.connected()) {
if(client.find("??????")){
char c = client.read();
Serial.print(c);
}
else Serial.print ("1");

I tried what guix suggested

guix:
Change preproc.substitute_unicode from true to false.

but nothing happened :frowning: .

You are reading a web page from the internet to your arduino ? Or is your arduino serving web pages ? What is the actual pathway of these mystery characters between the internet, the pc and the arduino ?

if(client.find("??????")){

I am not sure how that piece of code will actually be processed.

The thing is, the greek characters are going to appear as single bytes if ISO-8559-7 encoding is used, and as three-byte codes in Unicode ( but those three bytes actually contain a two-byte character number ).

If you are trying to "match" that, you would need to know what code scheme the string you are attempting to match, is encoded in.

If there is only one string you want, I'd figure out what the actual bytes are, and match that byte by byte.

I'm trying to read data from a web page (http://www.hnms.gr/hnms/greek/forecast/forecast_city_html?&dr_city=Xanthi) throught it's html code! I've tried searching for those wierd char but with no luck!

michinyon:
If there is only one string you want, I'd figure out what the actual bytes are, and match that byte by byte.

How can i do that? eventhought it's not only a spesific string.
I have attached a pdf with the arduino code, the page's html and what the arduino is reading (printing) (char by char if you comment out if(client.find("??????")){ ).

dig beeper.pdf (241 KB)

Maybe you could read it in english to not be annoyed with those characters?

http://www.hnms.gr/hnms/english/forecast/forecast_city_html?&dr_city=Xanthi

LOL!!! I have to admit that this is the easiest way to do that!! I think I'll follow this road :stuck_out_tongue: !!! the question remains though...
Thanks for your help!!

You might need to make your non-ascii string into a "wide" string constant like this:
L"Wide Constant"
which will be turned into a wchar_t instead of a char .

thanks for your reply pepe!! i tried what you suggest, "if(client.find("\xc4\xf9\xe4\xe5\xea\xdc"))" didn't seem to work! i also tried onother wed page with UTF-8 encoding (Καιρός Ξάνθη - 14 ημέρες). Here's what it prints

Καιρός Ξάνθη - 14 ημέρες

instead of

?????? ????? - 14 ??????

.

Any suggestions? To be honest i'm using the english version now but I'm keep posting as i want to learn, maybe for the next time this problem comes up!! Something else that came up, sometimes the arduino reads many "yyyyyyyyyy". have anyone else seen that?

Its charset is iso-8859-7. It uses a one-byte encoding, but the decoding is not the same as for US-ASCII.

I don't think that is really correct. 8859-7 uses exactly the same encoding as US-ascii for the byte values 0 to 127, and in addition assigns greek character meanings to the byte value 128 to 255.

To reiterate what I said last week, unicode and UTF-8 are not your problem.

The web-page you are looking at, appears to be encoded using Iso-8859-7. That means, the byte values 0-127 are standard 7-bit ascii, and the byte values 128-255 are greek. It is a single-byte code with 255 different characters possible.

When you come to display those values, they are not being displayed or misinterpreted as UTF-8 nor as any kind of unicode.

They are being displayed as if they are encoded as ISO-8859-1. That version of ISO-8859 has the same standard ascii encoding for byte values 0 to 127, and values 128-255 represent mainly french and german characters using accents and diacritic marks.

You can see this for yourself. The wikipedia page has a list of the characters which are represented by the byte values 128-255 for each of the 15 alternative code pages for ISO-8859, and you can see that for each of the greek characters that you wanted, you are seeing the french or german special character with the corresponding hexadecimal byte value, to the greek character you expected to see.

Therefore, it isn't a problem with the encoding of the values themselves. There is no translation of character values required. It is fundamentally a font problem with your display device, which is automatically displaying the font characters from the ISO-8859-1 character set, instead of ISO-8859-7.

There would seem to be two possible approaches to solving your problem.

The first method would be, to somehow get your display device to use a font appropriate for ISO-8559-7 instead of using ISO-8559-1.

The second method, would be to translate your character stream into UTF-8, which would in practise require the replacement of byte values in the range 128-255 with sequences of two or three bytes, and then using a device which can display Unicode/UTF-8 encoding.