Printing and understanding international chars

I'm trying to make a simple sketch that prints international (more specificaly Greek) characters to the Serial port. I'll modify it to print Greek chars to a LED matrix. Is that even possible from the hardware perspective?

The sketch bellow doesn't work. Is there any known workaround? Do AVRs only support ASCII?

void setup() {
  Serial.begin(9600);
}

void loop() {
  Serial.println("[ch947][ch949][ch953][ch945]!");
}

I'm trying to make a simple sketch that prints international (more specificaly Greek) characters to the Serial port. I'll modify it to print Greek chars to a LED matrix. Is that even possible from the hardware perspective?

Hard to say without knowing more about your LED matrix. If it's something you've built then you can print anything you'd like on it.

The sketch bellow doesn't work.

It what way?

Do AVRs only support ASCII?

The AVR compilier supports eight-bit characters. Traditionally, the first 127 characters are ASCII. Using your example as an example, it's up to the reciever to decide how those eight-bits are interpreted.

Bear in mind that the "[ch949]" you included in your post is a Unicode character which is 16 bits. The compiler will store the "[ch949]" in two adjacent bytes. Serial.println("[ch949]") will send those two bytes to the PC. Most terminal applications (like Serial Monitor) will display each byte as a single character; eight bytes of "garbage" will be displayed instead of four Greek characters.

The serial port of the Arduino can only sent bytes (value 0..255) over the line. The receiving application interpret these byte values - typically as ASCII, however you are free to use another interpretation including Greek characters. The Arduino does not know, it just send bytes.

It's similar to writing a word document, selecting all and change the font to Greek. The (underlying) bytes won't change, but the interpretation and visualization is changed.

Several LCD screens have the option to define your own characterset. If there are enough free definable chars in the LCD you could define the whole greek alphabet.
If you are sending the data to a PC/Mac, it is up to the receveing app to translate the byte to greek.

A workaround I think of is to place the Unicode (greek - subset) characters in EEPROM and overload the print statement in such way that instead of [byte] it will send EEPROM[byte] and EEPROM[byte+1]. However the receiving app must expect and understand Unicode. As there are 512 bytes in EEPROM there are just enough memory places to do this.

Another workaround is to define two special characters to shift forth and back a fontset. This technique was allready used in Morsecode. If the '>' is sent the next fontset is used and if '<' is sent the previous fontset is used. The receiving app must understand this protocol and it probably only will do this if you write it yourself.

Hopes this helps.

Hard to say without knowing more about your LED matrix. If it's something you've built then you can print anything you'd like on it.

Its a matrix based on the Holtek HT1632. The Arduino controls it. Forgot to mention!

It what way?

It prints garbage characters, as you said.

The AVR compilier supports eight-bit characters. Traditionally, the first 127 characters are ASCII. Using your example as an example, it's up to the reciever to decide how those eight-bits are interpreted.

What I want to do is make the Arduino print Greek characters on the matrix. I have a function that reads the "font" from an array and then print the corresponding character on the display. I want to extend this function and make it print more characters. The array contains the printable ASCII characters.

This is what came into my mind after your reply. I'll make another array containing the Unicode characters and modify the function to use that array when needed. Will that work, considering that Unicode chars use 16bits?

EDIT:

The serial port of the Arduino can only sent bytes (value 0..255) over the line. The receiving application interpret these byte values - typically as ASCII, however you are free to use another interpretation including Greek characters. The Arduino does not know, it just send bytes.

It's similar to writing a word document, selecting all and change the font to Greek. The (underlying) bytes won't change, but the interpretation and visualization is changed.

Several LCD screens have the option to define your own characterset. If there are enough free definable chars in the LCD you could define the whole greek alphabet.
If you are sending the data to a PC/Mac, it is up to the receveing app to translate the byte to greek.

Hopes this helps.

Thanks! Yes it helped me clear some things :slight_smile:

Unicode, if UTF-8 encoding is used, represents each character using 1..4 bytes.

But you could try to use iso-8859-7 charset:

It is 8 bits per character and it should suit you well if you don't need to use any other "exotic" language simultaneously.

This is what came into my mind after your reply. I'll make another array containing the Unicode characters and modify the function to use that array when needed. Will that work, considering that Unicode chars use 16bits?

It may work.

I can think of several things that can go wrong...

  • The editor allows you to enter Unicode characters but saves the source file in another format (like UTF-8). From your perspective, the characters will be corrupt.
  • The code that determines which character to output is written with the assumption that characters are eight bits.
  • The glyph data (what the character looks like) is a big array with an element for each character. Extending such an array to 16 bits will make it way too big for an Arduino.

But, my gut tells me it's worth a try. :wink:

Unicode, if UTF-8 encoding is used, represents eatch character using 1..4 bytes.

UTF-8 - Wikipedia

But you could try to use iso-8859-7 charset:
ISO 8859 Alphabet Soup
It is 8 bits per character and it should suit you well if you don't need to use any other "exotic" language simultaneously.

Nice idea. I'll create an array containing all the ISO-8859-7 chars in that order. But how will the function know that "[ch916]" is C4 and print it?

This is the function I'm talking about.

//*********************************************************************************************************
/*
 * Copy a character glyph from the myfont data structure to
 * display memory, with its upper left at the given coordinate
 * This is unoptimized and simply uses plot() to draw each dot.
 */
void ht1632_putchar(int x, int y, char c)
{
      // fonts defined for ascii 32 and beyond (index 0 in font array is ascii 32);
      byte charIndex;

      // replace undisplayable characters with blank;
      if (c < 32 || c > 126)
      {
            charIndex      =      0;
      }
      else
      {
            charIndex      =      c - 32;
      }

      // move character definition, pixel by pixel, onto the display;
      // fonts are defined as one byte per row;
      for (byte row=0; row<8; row++)
      {
            byte rowDots      =      pgm_read_byte_near(&myfont[charIndex][row]);
            for (byte col=0; col<6; col++)
            {
                  if (rowDots & (1<<(5-col)))
                        ht1632_plot(x+col, y+row, 1);
                  else 
                        ht1632_plot(x+col, y+row, 0);
            }
      }
}

Thanks for your answers!

// replace undisplayable characters with blank;
      if (c < 32 || c > 126)
      {
            charIndex      =      0;
      }

becomes something like

// replace undisplayable characters with blank;
      if (c < 32 )
      {
            charIndex      =      0;
      } else if (c > 127)
        { 
           // display greek font eg iso-8859-7 // thanks to VilluV
        } else {
           // display ascii font
          ...
        }

I would try to configure my code editor so that it stores files in ISO-8859-7 encoding and it should happen "automagically". You just type the characters in strings and that's about it. When you write "[ch916]" it will store 0xC4 in the source file, just like it stores 0x61 when you write 'a' etc...

EDIT: That is: in your example, the "c" variable should already contain the correct code, exactly as it behaves for lower part of the ASCII table.

But you must be careful with those files, if you edit them with some other editor (encoding) it will probably corrupt the strings.

Or you could escape greek characters as codes in strings: like
"this is delta \xc4 isn't it?"

This is what I got so far and unluckily it isn't working. I'm trying to make it print the char '[ch915]'.

This is the function:

void ht1632_putchar_greek(int x, int y, char c)
{
  byte charIndex;

  if (c < 193 || c > 254)
    charIndex = 0;
  else
    charIndex = c - 33;

  for (byte row=0; row<8; row++) {
    byte rowDots = pgm_read_byte_near(&greek[charIndex][row]);
    for (byte col=0; col<6; col++) {
      if (rowDots & (1<<(5-col)))
        ht1632_plot(x+col, y+row, 1);
      else 
        ht1632_plot(x+col, y+row, 0);
      }
  }
}

This is the array (just the first 4 chars for now):

unsigned char PROGMEM greek[4][8] = {
  {
    0x00,    // ________   blank (ascii 32)
    0x00,    // ________
    0x00,    // ________
    0x00,    // ________
    0x00,    // ________
    0x00,    // ________
    0x00,    // ________
    0x00     // ________
  },
  {
    0x00,    // ________   A
    0x1C,    // ___XXX__
    0x22,    // __X___X_
    0x22,    // __X___X_
    0x3E,    // __XXXXX_
    0x22,    // __X___X_
    0x22,    // __X___X_
    0x00     // ________
  },
  {
    0x00,    // ________   B
    0x3C,    // __XXXX__
    0x22,    // __X___X_
    0x3C,    // __XXXX__
    0x22,    // __X___X_
    0x22,    // __X___X_
    0x3C,    // __XXXX__
    0x00     // ________
  },
  {
    0x00,    // ________
    0x3E,    // __XXXXX_
    0x20,    // __X_____
    0x20,    // __X_____
    0x20,    // __X_____
    0x20,    // __X_____
    0x20,    // __X_____
    0x00     // ________
  }
};

EDIT:
I call the function like this:

ht1632_putchar_greek(0, 0, '[ch915]');

Open the sketch source file with some HEX editor and check how the char is stored there. It must be the same code that you are using in your program. If it is not, the file is still not in the correct encoding. If there is two or more bytes used for that char, it is still in UTF-8.

If it is correct, then the next step I'd take is peek into the compiled .o file with hex editor (or dissassembler, like: http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1290209328/16#16 ) and see how that char is stored there.

PS:
How to change default encoding in Arduino IDE, I don't (yet) know, maybe someone else can help with that. If it is not possible, create some .h file with your string constants with some other editor and include this file in your sketch, but don't open/edit it with IDE.

(oh the joys of i18n... :wink: )

EDIT: Maybe i'm reading your source code wrong, but I think that the character your're trying to display is not in the right place in the character map array. It must be at the place that it is in the iso character map (- 33), but right now it seems to be at the place of 'c'.
If you want just to test if the characters are stored with right codes, and don't want to create the full array yet, use special condition in the 'if' clause which tests just this character code and gives the correct array index for that. If that works, you can go to the full-blown character set and remove the condition.

(your character has code 0xC3, so that has to end up in 195 - 33 = 162 nd place in the char map array)

Open the sketch source file with some HEX editor and check how the char is stored there. It must be the same code that you are using in your program. If it is not, the file is still not in the correct encoding. If there is two or more bytes used for that char, it is still in UTF-8.

You are right. It is stored as CE 93 (two bytes). It should be C1.

How to change default encoding in Arduino IDE, I don't (yet) know, maybe someone else can help with that. If it is not possible, create some .h file with your string constants with some other editor and include this file in your sketch, but don't open/edit it with IDE.

Can you give a simple example? What do you mean with string constants?

Thanks for your help!

What I meant is basically this:
create a file called, for example, messages.h

#define SOME_MSG "blah blah"
#define OTHER_MSG "oh yea"

etc.
Edit this with some editor that can save files with iso encoding.

And now in your sketch, use:

#include "messages.h"

...

do_something_with_string(SOME_MSG);

Added bonus: If this works, you can have separate header files for different languages and include the one you need for translated versions of your software without touching other files!

But what I'm not sure about is how the c++ preprocessor handles files with different encoding, maybe it doesn't work as I expect....

EDIT: You may have to sneak in following parameter to the compiler

-finput-charset="iso-8859-7"

But I don't yet know how to do that with Arduino IDE...

I think I made some progress. I will try not to mess with file encodings and special compiler parameters.

When you cast a char to an integer, the number represents the character's position on the ASCII table. For example, the sketch below will print:

C
67

void setup() {
  Serial.begin(9600);
  
  char latin = 'C';
  
  Serial.println(latin);
  Serial.println((int)latin);
}

void loop() { }

If I change the char to '[ch915]', it prints:


-108

Why -108? If I make the char unsigned it prints 148. The sketch I want to print Greek chars should work with those numbers (instead of the ISO-8859-7). My questions now are, why does it print a negative number? Is it because its a 8bit unsinged and I declared it as a signed? In what encoding are the chars stored?

Thanks for your answers, you really helped :slight_smile:

Yes, it is negative, because you're using signed char data type where the most significant bit is used as sign bit (-127..127). Use unsigned char for range 0..255.

I think that the IDE saves files by default in UTF-8 encoding, but it also might use system's default encoding. In Linux you can check what's the default from LANG or LC_ALL env. variables, I don't know how to check that in Windows or OS X.