Pages: [1] 2   Go Down
Author Topic: Unicode characters in Arduino  (Read 490 times)
0 Members and 1 Guest are viewing this topic.
Offline Offline
Newbie
*
Karma: 0
Posts: 8
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hi guys,

I'm currently working on a project in which i try to get data from a web page. I'm using Ethernet Shield Web client example slightly modified and it work fine! The problem I'm facing is that the web page (HTML code) have unicode characters [Greek (charset=iso-8859-7)] and what the arduino is reading is not what it should be (e.g "ÍçóéÜ Áíáôïëéêïý Áéãáßïõ" instead of "Νησιά Ανατολικού Αιγαίου"). Is there anything I can do about that or is it one of arduino constraints??

thanks in advance for your help!!!

Logged

SF Bay Area (USA)
Offline Offline
Tesla Member
***
Karma: 135
Posts: 6786
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
what the arduino is reading is not what it should be

It's more likely that the arduino is reading the unicode just fine, but is not managing to transmit/display to whatever display device you are using.  What you're getting is probably  the result of displaying 8bit codes greater than 128, while what you want is to display 16bit unicode.  (what IS your display?  For instance, I don't know offhand of any "Serial monitor" like programs that read unicode...  The Arduino Serial Monitor certainly doesn't do it.)
Logged

Offline Offline
Faraday Member
**
Karma: 62
Posts: 3077
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

ISO-8859-7  appears to be not Unicode.

It appears to be a standard which uses the full set of one byte numbers ( considered unsigned ) from 0 to 255.

Regular ascii uses the numbers from 0 to 127     and 128-255  represents characters in other languages ( which depends on which code page of the 8859 table you are using ).

This is sometimes refered to as 8-bit ascii instead of 7bit ascii.    As far as I know,   these will be transmitted through the serial hardware and software just fine.

As the previous post says,  the problem is with your display device and it's font,  not with Serial.   If you are seeing all that french accent crap,   your display device is displaying the byte codes 128-255 correctly,  however it is apparently assuming that they are from ISO-8559-1,   where the important number there is the 1,   which means page 1 of the standard  which is full of french and german accented characters instead of page 7 where the greek characters are.   The character that look like a greek beta is actually a german "ss" character.

Wikipedia has a useful explanation of this here : https://en.wikipedia.org/wiki/ISO/IEC_8859

and you can see that the first two bytes of your message are 0xCD 0xE7  which are displayed as a dotted capital I and a c with cedilla  from ISO-8558-1,   where you want  upper-case_Nu  lower-case-eta   from ISO-5889-7

« Last Edit: July 27, 2014, 09:19:38 pm by michinyon » Logged

SF Bay Area (USA)
Offline Offline
Tesla Member
***
Karma: 135
Posts: 6786
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I believe that in the absence of actual unicode (ie using an 8-bit character set), you MUST set your display device to match the particular character set you are using MANUALLY.

puTTY is a popular windows terminal emulator that is more powerful than the arduino serial monitor.  It's documentation says:
Quote
4.10.1 Controlling character set translation

During an interactive session, PuTTY receives a stream of 8-bit bytes from the server, and in order to display them on the screen it needs to know what character set to interpret them in.

There are a lot of character sets to choose from. The ‘Received data assumed to be in which character set’ option lets you select one. By default PuTTY will attempt to choose a character set that is right for your locale as reported by Windows; if it gets it wrong, you can select a different one using this control.

A few notable character sets are:

    The ISO-8859 series are all standard character sets that include various accented characters appropriate for different sets of languages.
    The Win125x series are defined by Microsoft, for similar purposes. In particular Win1252 is almost equivalent to ISO-8859-1, but contains a few extra characters such as matched quotes and the Euro symbol.
    If you want the old IBM PC character set with block graphics and line-drawing characters, you can select ‘CP437’.
    PuTTY also supports Unicode mode, in which the data coming from the server is interpreted as being in the UTF-8 encoding of Unicode. If you select ‘UTF-8’ as a character set you can use this mode. Not all server-side applications will support it.

If you need support for a numeric code page which is not listed in the drop-down list, such as code page 866, then you can try entering its name manually (CP866 for example) in the list box. If the underlying version of Windows has the appropriate translation table installed, PuTTY will use it.
Logged

France
Offline Offline
Edison Member
*
Karma: 38
Posts: 1012
Scientia potentia est.
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Open C:\Users\<your user>\AppData\Roaming\ArduinoXX\preferences.txt (where XX is your Arduino IDE version).

Change preproc.substitute_unicode from true to false.

This solved a problem for me with the Arduino not being able to correctly read characters such as é à etc from the Serial monitor.. maybe it will work for you too smiley-wink
« Last Edit: July 27, 2014, 10:26:01 pm by guix » Logged

Offline Offline
Faraday Member
**
Karma: 62
Posts: 3077
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

It is not clear from the OP's post, whether he is refering to problems with characters being sent to,  or from, his arduino.

If characters are being sent from the arduino,   then changing the setting of your display device on your computer will work.

If the problem is with characters being sent to the arduino,    then the problem is going to be with the font installed on the arduino's Lcd (  or wherever it is that the OP is seeing characters he doesn't want ).  If this is the case,   then changing display settings on the computer terminal,  or in the c++ preprocessor,  is unlikely to work.  He would need to somehow select or install an alternative font into his LCD module.

And, ISO-8559 is NOT  Unicode.    It is a now largely obsolete scheme for representing alternative larger sets of non-ascii characters within a single byte character scheme.
Logged

Offline Offline
Newbie
*
Karma: 0
Posts: 8
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

First of all, thanks for the replies! I assumed it was unicode as when I put these strange chars in here http://www.thepcmanwebsite.com/unicode_converter.shtml the result was the actual word! I dont believe the problem lies on the serial monitor as it just shows the characters that the arduino is reading. When I just read the web page (and print it in the Serial) everything is exactly the same except from Greek words. Although, what I'm trying to do is find a specific word inside the web page's html code. When the key word is in English it work fine, but when it is in Greek the arduino can't find it.
if (client.connected()) {
    if(client.find("Δωδεκά")){
    char c = client.read();
    Serial.print(c);
    }
    else Serial.print ("1");


 I tried what guix suggested
 
Change preproc.substitute_unicode from true to false.
 but nothing happened  smiley-sad .

« Last Edit: July 28, 2014, 05:22:29 am by PascP » Logged

Offline Offline
Faraday Member
**
Karma: 62
Posts: 3077
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

You are reading a web page from the internet  to your arduino ?    Or is your arduino serving web pages ?       What is the actual pathway of these mystery characters  between the internet,  the pc and the arduino ?
Logged

Offline Offline
Faraday Member
**
Karma: 62
Posts: 3077
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Code:
if(client.find("Δωδεκά")){

I am not sure how that piece of code will actually be processed.

The thing is,   the greek characters are going to appear as single bytes if ISO-8559-7 encoding is used,    and as three-byte codes in Unicode (  but those three bytes actually contain a two-byte character number ).

If you are trying to "match" that,   you would need to know what code scheme  the string you are attempting to match,  is encoded in.

If there is only one string you want,  I'd figure out what the actual bytes are,   and match that byte by byte.
Logged

Offline Offline
Newbie
*
Karma: 0
Posts: 8
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I'm trying to read data from a web page (http://www.hnms.gr/hnms/greek/forecast/forecast_city_html?&dr_city=Xanthi) throught it's html code! I've tried searching for those wierd char but with no luck!
If there is only one string you want,  I'd figure out what the actual bytes are,   and match that byte by byte.
How can i do that? eventhought it's not only a spesific string.
I have attached a pdf with the arduino code, the page's html and what the arduino is reading (printing) (char by char if you comment out
Code:
if(client.find("Δωδεκά")){
).

* dig beeper.pdf (241.04 KB - downloaded 9 times.)
Logged

France
Offline Offline
Edison Member
*
Karma: 38
Posts: 1012
Scientia potentia est.
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Maybe you could read it in english to not be annoyed with those characters?

http://www.hnms.gr/hnms/english/forecast/forecast_city_html?&dr_city=Xanthi
Logged

Offline Offline
Newbie
*
Karma: 0
Posts: 8
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

LOL!!! I have to admit that this is the easiest way to do that!! I think I'll follow this road smiley-razz !!! the question remains though...
Thanks for your help!!
Logged

Offline Offline
Edison Member
*
Karma: 33
Posts: 1468
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

You might need to make your non-ascii string into a "wide" string constant like this:
 
Code:
L"Wide Constant"

which will be turned into a wchar_t instead of a char .
Logged

Offline Offline
Full Member
***
Karma: 3
Posts: 205
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hi,

Of course, on can avoid the problem by getting the english version of the web page. But the problem with the non-english version of the page can also be solved.


The characters typed in the Arduino IDE and programs appear to be stored in the UTF-8 format, that is a sequence of one or more bytes representing 7- to 21-bit coded values. As the 7-bit values - decimal 0 to 127 - correspond to US-ASCII codes, the encoding of simple english words, figures and punctuation require only one-byte characters.


On the other hand, your program reads and tests the exact content of the original web page, in HTML format, without any transcription.

Unfortunately, the encoding of the HTML page can vary from one to the other. Normally, it's explicitly specified in the header of the page - but sometimes it isn't. For instance, the web page you have indicated says :

Code:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-7">

Its charset is iso-8859-7. It uses a one-byte encoding, but the decoding is not the same as for US-ASCII. So it is incompatible with the UTF-8 encoding and decoding of the Arduino IDE and programs. If the charset specified in the page had been UTF-8, your greek program could have read and tested its content a bit more directly.


As a work around, you can type the characters using their values in the iso-8859-7 charset, e.g. "\xe1" for "α" or "\xf9" for "ω".

Instead of :

Code:
if(client.find("Δωδεκά")){

you have to write :

Code:
if(client.find("\xc4\xf9\xe4\xe5\xea\xdc")){


The hexadecimal codes for the iso-8859-7 characters are :

│   nbsp :a0   │   °:b0   │   ΐ:c0   │   Π:d0   │   ΰ:e0   │   π:f0   │
│   ʽ:a1   │   ±:b1   │   Α:c1   │   Ρ:d1   │   α:e1   │   ρ:f1   │
│   ʼ:a2   │   ²:b2   │   Β:c2   │     │   β:e2   │   ς:f2   │
│   £:a3   │   ³:b3   │   Γ:c3   │   Σ:d3   │   γ:e3   │   σ:f3   │
│     │   ΄:b4   │   Δ:c4   │   Τ:d4   │   δ:e4   │   τ:f4   │
│     │   ΅:b5   │   Ε:c5   │   Υ:d5   │   ε:e5   │   υ:f5   │
│   ¦:a6   │   Ά:b6   │   Ζ:c6   │   Φ:d6   │   ζ:e6   │   φ:f6   │
│   §:a7   │   ·:b7   │   Η:c7   │   Χ:d7   │   η:e7   │   χ:f7   │
│   ¨:a8   │   Έ:b8   │   Θ:c8   │   Ψ:d8   │   θ:e8   │   ψ:f8   │
│   ©:a9   │   Ή:b9   │   Ι:c9   │   Ω:d9   │   ι:e9   │   ω:f9   │
│     │   Ί:ba   │   Κ:ca   │   Ϊ:da   │   κ:ea   │   ϊ:fa   │
│   «:ab   │   »:bb   │   Λ:cb   │   Ϋ:db   │   λ:eb   │   ϋ:fb   │
│   ¬:ac   │   Ό:bc   │   Μ:cc   │   ά:dc   │   μ:ec   │   ό:fc   │
│   ­­­sh:ad   │   ½:bd   │   Ν:cd   │   έ:dd   │   ν:ed   │   ύ:fd   │
│     │   Ύ:be   │   Ξ:ce   │   ή:de   │   ξ:ee   │   ώ:fe   │
│   :af   │   Ώ:bf   │   Ο:cf   │   ί:df   │   ο:ef   │     │

nbsp: non-breaking space
sh: soft hyphen


But unfortunately, that's not all !

Web pages can also contain HTML sequences of US-ASCII characters beginning with "&" and ending with ";" that are used to encode charset-independant characters, e.g. "&alpha;", "&#945;" or "&#x3B1;" standing for "α".

If required, such sequences have to be considered independently and treated as character strings, or the program has to convert them into their equivalent single characters.
« Last Edit: July 30, 2014, 12:55:11 pm by _pepe_ » Logged

Offline Offline
Newbie
*
Karma: 0
Posts: 8
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

thanks for your reply pepe!! i tried what you suggest, "if(client.find("\xc4\xf9\xe4\xe5\xea\xdc"))" didn't seem to work! i also tried onother wed page with UTF-8 encoding (http://www.okairos.gr/%CE%BE%CE%AC%CE%BD%CE%B8%CE%B7.html). Here's what it prints




<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">




<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#" lang="gr" xml:lang="en">




   <head>

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />




      <!--[if lte IE 6]>

         <meta http-equiv="imagetoolbar" content="no" />

      <![endif]-->

                                             

      <title>Καιρός Ξάνθη - 14 ημέρες</title>

      <meta name="description" content="Δείτε τον καιρό σε Ξάνθη για 14 ημέρες. Καιρός, άνεμοι, μέγιστες και ελάχιστες θερμοκρασίες, ανατολή και δύση." />

      <meta name="keywords" content="kairos Xanthi, o kairos Xanthi, καιρός Ξάνθη, ο καιρός Ξάνθη, Ξάνθη καιρός, ο καιρός σε Ξάνθη, ο καιρός για Ξάνθη" />




      <meta name="locality" content="Ξάνθη, Ανατολική Μακεδονία και Θράκη, Ελλάδα" />




         

      <meta name="revisit-after" content="1 days" />

      <meta name="lang" content="gr" />













instead of







<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">




<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#" lang="gr" xml:lang="en">




   <head>

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />




      <!--[if lte IE 6]>

         <meta http-equiv="imagetoolbar" content="no" />

      <![endif]-->

                                             

      <title>Καιρός Ξάνθη - 14 ημέρες</title>

      <meta name="description" content="Δείτε τον καιρό σε Ξάνθη για 14 ημέρες. Καιρός, άνεμοι, μέγιστες και ελάχιστες θερμοκρασίες, ανατολή και δύση." />

      <meta name="keywords" content="kairos Xanthi, o kairos Xanthi, καιρός Ξάνθη, ο καιρός Ξάνθη, Ξάνθη καιρός, ο καιρός σε Ξάνθη, ο καιρός για Ξάνθη" />




      <meta name="locality" content="Ξάνθη, Ανατολική Μακεδονία και Θράκη, Ελλάδα" />




         

      <meta name="revisit-after" content="1 days" />

      <meta name="lang" content="gr" />

.




Any suggestions? To be honest i'm using the english version now but I'm keep posting as i want to learn, maybe for the next time this problem comes up!! Something else that came up, sometimes the arduino reads many "yyyyyyyyyy". have anyone else seen that?
Logged

Pages: [1] 2   Go Up
Jump to: