Detecting non-ascii characters in a String

kaan0 · May 22, 2015, 5:30pm

Hello. I'm new at arduino. I'm writing a library for 3.5" TFT display with ILI9327 controller. My problem is about Turkish characters "ĞÜŞİÖÇğüşıöç". My String printing function is:

void YAZI(String yaz, uint8_t x, uint8_t y, uint16_t renk, uint16_t ardrenk) { // x and y are coordinates, renk is color and ardrenk is background color.
  uint8_t uz=yaz.length();
  char buf[uz+1];
  yaz.toCharArray(buf, uz+1);

  for(uint8_t a=0;a<uz;a++) {
    KARAKTER(buf[a], x + (a*8), y, renk, ardrenk); //KARAKTER function prints a char.
  }
  
}

and my character printing function is:

void KARAKTER(char harf, uint8_t x, uint8_t y, uint16_t renk, uint16_t ardrenk) { //harf is the character going to be written.
  
  SETXY(x,y,x+7,y+11); //My font data has 8 pixels of width and 12 pixels of height.
  
  if(harf<0x7F&&harf>0){ // I want to control is the char ASCII or NON-ASCII character. It detects well.
   byte ch;
   for(uint8_t j=0;j<12;j++) {
     ch =pgm_read_byte(&SmallFont[harf-32][j]);
     for(uint8_t k=0;k<8;k++) {
       if((ch&(1<<(7-k)))!=0)   
           {
             NOKTA(renk); //sets a pixel at font color
           } 
           else
           {
             NOKTA(ardrenk); //sets a pixel at background color
           }   

     }
    
   }
 }
 


}

So the char datatype is 8 bit and my non-ascii characters are more than 8 bit. And I use String.toCharArray function for creating a character array that contains the text data.

I tried to

char ch='Ğ';
int i = ch;     // I tried int, byte, long, etc.

this code for converting non-ascii to integer but It gives same results for a couple of characters.

I want to do something like this:

if(ch=='Ğ') {
  ch=128;
}

Sorry for long post and bad English.

michinyon · May 22, 2015, 5:40pm

Well the 8 bit ascii char set doesn't include turkish characters.

There are at least 2 ways this has been handled. One way, is to have an alternative set of characters for the unsigned single byte numbers 128 to 255 . They called these "code pages" ? but they seem to be no longer fashionable.

The other way is to use unicode, which introduces all sorts of problems in trying to deal with them in the C language where a char is a single byte.

Java doesn't have that issue because all strings in java are inherently unicode.

I don't know how to solve your problem for the arduino.

michinyon · May 22, 2015, 5:45pm

There seem to be several code pages for Turkish. Looking at "code page" on wikipedia may be helpful.

You need a display device which is compatible with it, so when you send the char with the turkish character you want, the device will display it.

michinyon · May 22, 2015, 5:49pm

So the char datatype is 8 bit and my non-ascii characters are more than 8 bit.

What system are you using for the "non-ascii characters" ?

kaan0 · May 22, 2015, 6:19pm

My text edior uses UTF-8 I think. I wrote the TFT library so It can display Turkish characters with custom font array. I tried to make 'Ğ' as integer and it can detects the letter but now the String printer function become useless because of String to integer conversion.

michinyon · May 23, 2015, 1:06am

Here are some things you need to consider.

Do you need to send these strings to or from any other device ?

Does you need to display special characters from more than one language ?

Does your display device have the required font display information in its chip already ?

Does your display device allow you to download custom font data to it ?

Does your display device allow you to display text by sending pixel blocks to it ?

Is it really necessary to be able to use these characters with your arduino code text editor ?

If you don't need to communicate with any other device, and you can construct your own 11x8 pixel font, then this is the simplest solution:

You do not need to use multibyte characters at all. It is unecesary and complicated. The Turkish alphabet is not large. You can easily have the Turkish alphabet in a single byte character set. Remember, basic ascii is only 127 characters. For languages other than Chinese, you can use the single byte characters using the code page scheme I mentioned before.

For example, in code page 857, upper case dotted I is represented by the single byte 0x98 98 hexadecimal which is 152 decimal. It is a number in the 128-255 range.

You can represent this in your code like

char city[] = { 0x98, 's', 't', 'a', 'n','b','u','l' };

You can then use the code which you already have to display this, by sending pixels to the lcd. You code will detect that the first char in this char array is a number bigger than 127, and will use your special font for it.

Similarly c and s with french squiggles are 87 and 9f hex, so you can write in code

char exit[] = {0x87,'i','k','i',0x9f } ;

This is not as nice as having a wysiwig code editor which will display those characters, but if you only have a few of them, and you only need to do this once, you can put up with that inconvenience.

I wouldn't use the String class at all. I don't know how the String class works with chars in the 128-255 range. It might work. Try it.

michinyon · May 23, 2015, 1:11am

Actually, when you initialize a char array one char at a time, you need to add the zero byte at the end, yourself

char attention[] = {"Dikkat"}  ;                                     //  the 7th 0 byte is added automatically here
char exit[] = {0x87,'i','k','i',0x9f,0x00 } ;                         //  you have to add the zero byte yourself

michinyon · May 23, 2015, 1:27am

You can also initialize exactly the same strings, like this

char city[] = "\x98stanbul" ;
char exit[] = "\x87iki\x9F" ;

PaulMurrayCbr · May 23, 2015, 3:46am

kaan0:
My text edior uses UTF-8 I think.

Ok.

To get your head around this, you need to understand that a character is not the same thing as a byte. Characters are encoded as a sequence of bytes. There are other ways of doing it - consider morse code, where characters are encoded as series of dots and dashes separated by pauses.

A byte (in C) is the smallest chunk of addressable memory. Pretty much all machines these days have 8-bit bytes, but it ain't necessarily so, which is why the unicode standard talks about "octets" rather than bytes.

Now, the most widely used standard for representing characters as sequences of bits and bytes is The American Standard Code for Information Interchage. ASCII. Most of the things you can type on a regular US keyboard are represented as 7-bit binary numbers.

So when I said earlier that "characters are encoded as a sequence of bytes", in ASCII 1 byte == 1 character. Easy as.

Now, various people noted that ASCII does not use the values 128-255, so for encoding languages that use non-latin character sets, the "code page" was born. You would create a set of characters occupying 128-255, and that set would be a code page. You use this, you simply have to know - somehow - what particular code page the chunk of text you are dealing with uses.

Anyway. This became a pain in the butt, because to produce arbitrary text you'd need to have some way to switch code pages midstream.

Unicode deals with this by using 16-bit values for characters. This gives us enough space for most national langugaes (except chinese pictographs, which are encodes as sequences of unicode characters).

In Java, a 'char' is a 16-bit value. But since most elecronic communication only uses the trusty old ASCII character set, transmitting unicode text as 16-bit values wastes a lot of electrons. Not only that, but a steam of 16-bit characters is incompatible with a stream of trusty-old 7-bit ASCII characters (badded out to 8 bitys).

So unicode defines an encoding named UTF-8. The full set of 16-bit unicode characters are encoded as runs of 8-bit characters of varying length. In particular, all of the 7-bit ASCII characters are encoded as a run of length 1. So any ACSII-encoded document is automatically also a valid UTF-8 encoded document.

So.

To handle UTF-8 turkish, the way to so it is to do these strings as arrays of uint-16 quantities. This means, BTW, than most of the standard library string functions will not work as you'd expect.

And you need to explicitly encode and decode your character stream. With ASCII you don't need to think about this, but once you want turkish (etc) characters you do.

Google "arduino unicode utf-8 library".

http://playground.arduino.cc/Code/UTF-8

And the wikipedia page:

guix · May 23, 2015, 7:34am

I'm not sure if it will help you but try:

Open the Preferences window of the Arduino IDE, look at the bottom, there is a link to preferences.txt, click it.
Close the Arduino IDE
Open preferences.txt
Locate setting preproc.substitute_unicode, change its value to false
Save and close this file
Open the Arduino IDE and try your code again

Topic		Replies	Views
Serial.println unicode characters Programming	9	17001	May 5, 2021
non-English letters in if statement Programming	3	526	May 5, 2021
Comparing UTF-8 chars Programming	10	5203	May 5, 2021
Printing Extended ASCII characters in a String by the Serial Monitor Programming	4	13597	May 5, 2021
Can Arduino receive a string that contains non-printing Ascii characters? Programming	5	777	June 26, 2021

Detecting non-ascii characters in a String

Related topics