Serial problem with accentized characters

Hello, I have a problem with Serial and characters such as é, ô, etc. Well it's a strange problem! Look this code

void setup ()
{
  Serial.begin(9600);
}

void loop()
{
  if ( Serial.available() > 0 ) 
  {
    static char input[16];
    static uint8_t i;
    char c = Serial.read();

    if ( c != '\r' && i < 15)
      input[i++] = c;
    
    else
    {
      input[i] = '\0';
      i = 0;
      
      if ( !strncmp( input, "olé", 3 ) )
      {
      }
      else
      {
        Serial.print( "The strings are NOT equal...\nHowever the input looks correct: \"" );
        Serial.print( input );
        Serial.print( "\"\n\nNow testing the hardcoded strings...\n" );
      }
      
      const char test[] = "olé";
      if ( !strncmp( test, "olé", 3 ) )
      {
        Serial.println( "The strings are equal!" );
      }
    }
  }
}

Now in serial windows i type olé, here is the output:

The strings are NOT equal...
However the input looks correct: "olé"

Now testing the hardcoded strings...
The strings are equal!

And with this very small program:

void setup ()
{
  Serial.begin(9600);
}

void loop()
{
  if ( Serial.available() > 0 ) 
    Serial.println( Serial.read() );
}

When I type olé, the output is:

111
108
233

And according to the ASCII table, that should be 111 108 130
Err, no, 233 is correct (I think): http://www.ascii-code.com (different than what this site is showing http://www.asciitable.com :~ )

Do you know how I can fix the problem? Thanks in advance!

It is just a miss match of font definitions.
There is no fix as such because of the way your computer works. Is it a PC? What operating system are you using?
The best bet is if you forget about strings and just use the ASCII to do a numerical compair.

Yes it's a PC and I use Windows 7 64 bits, and I was wrong about the ASCII value 130, I edited my post. Strange problem :~

Edit: Not sure why, but if I do:

if ((byte)c == 233) c = 195;

Then the string comparison will work.

I got the value 195 by doing

char test[] = "olé";
Serial.println( test[2] );

Which printed 'Ã', that is 195 in the ASCII table, and that is 233 - 38... While substracting 38 worked for 'é', it doesn't work for other accentized characters... I will update this post later if I find more about this problem :slight_smile:

Perhaps peeking at the Reference page for the char data type would prove useful. What is the range of values that can be stored in a char? It might surprise you.

PaulS, maybe you can explain what you mean, I know a char can store -127 to + 127 or whatever that is...But if I change the data type to a byte array or int array then how do I use it with strncmp?

Anyway... I think I'm close of getting something working:

/*
for (int i = 128; i < 256; i++)
  Serial.print( (char) i );
//??????????????????????????? ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
*/  
  
char s[] = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ";
   
for (int i = 1; i < strlen(s); i+=2)
  Serial.print( (char) (s[i]+64) ); //the magic number
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ

PaulS, maybe you can explain what you mean, I know a char can store -127 to + 127 or whatever that is

Then you should also know that values like 195, 233, etc. can NOT be (successfully) stored.

But if I change the data type to a byte array or int array then how do I use it with strncmp?

You can't. You can, however, write your own function that does the same thing with unsigned char arrays, or byte arrays, or whatever data type that you find works to hold YOUR data.

All that strncmp() is doing is looping up to n times comparing values in the ith position of two arrays. It returns as soon as a non-match is found.

Ok, but I think I almost fixed it

No, not at all :smiley:

Ok I think I got it, will do some more tests!

if ( c != '\r' && i < 15)
{
  if ( c < 0 )
  {
    input[i++] = 195;
    c += 192;
  }

  input[i++] = c;
}
    input[i++] = 195;
    c += 192;

What type are input and c?

guix:
Do you know how I can fix the problem? Thanks in advance!

The standard ASCII character set only covers values in the range 0 - 127. There are multiple definitions for values outside this range. In Microsoft Windows, these are termed 'code pages'. If you're using values outside the standard set then the sender and receiver would need to agree what definition (or 'code page') they are using. I suspect the problem in this case is that the sender (PC) and receiver (Arduino runtime library) do not agree.

PaulS, they are char, I changed those value to:

input[i++] = -61;
c -= 64;

It's the same result, but anyway that's not working for all characters, only the Latin ones. I may give up on this because some of the other characters actually uses more than the 2 characters per character...Here in my code there are 2 characters per character, for example letter é is coded with à and © (so the length specified in strncmp must also be increased accordingly, hopefully strlen return the correct length)

PeterH, ok that's what I thought, is there a way to change the code page used by the Arduino ??

PeterH, ok that's what I thought, is there a way to change the code page used by the Arduino ??

There really isn't a "code page" for the Arduino. A code page defines how to stroke the characters on a display device. You've probably noted that there is no display device on the Arduino.

Ok there I fixed it, might be useful for someone else so I post it :slight_smile:

const uint8_t WINDOWS_1252 = 0;

int8_t SerialInputToCharSet( const char *src, char *dest, const size_t num = 128, const uint8_t charset = WINDOWS_1252);
int8_t SerialInputToCharSet( const char *src, char *dest, const size_t num, const uint8_t charset )
{
  uint8_t
    len = strlen( src );
  
  if ( len > num )
  {
    return -1;
  }
  
  uint8_t
    i = 0,
    j = 0,
    n = 0,
    c = 0;
    
  char
    tmp[num];

  switch ( charset )
  {
    case WINDOWS_1252 :
    {
      while ( i < len && j < num - 2 )
      {
        c = (uint8_t)src[i++];
        
        switch ( c )
        {
          case 0 ... 127 :
          {
            tmp[j++] = c;
            break;
          }
          case 128 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x82;
            tmp[j++] = 0xAC;
            n++;
            break;
          }
          case 130 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = 0x9A;
            n++;
            break;
          }
          case 131 :
          {
            tmp[j++] = 0xC6;
            tmp[j++] = 0x92;
            n++;
            break;
          }
          case 132 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = 0x9E;
            n++;
            break;
          }
          case 133 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = 0xA6;
            n++;
            break;
          }
          case 134 ... 135 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = c + 0x1A;
            n++;
            break;
          }
          case 136 :
          {
            tmp[j++] = 0xCB;
            tmp[j++] = 0x86;
            n++;
            break;
          }
          case 137 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = 0xB0;
            n++;
            break;
          }
          case 138 :
          {
            tmp[j++] = 0xC5;
            tmp[j++] = 0xA0;
            n++;
            break;
          }
          case 139 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = 0xB9;
            n++;
            break;
          }
          case 140 :
          {
            tmp[j++] = 0xC5;
            tmp[j++] = 0x92;
            n++;
            break;
          }
          case 142 :
          {
            tmp[j++] = 0xC5;
            tmp[j++] = 0xBD;
            n++;
            break;
          }
          case 145 ... 146 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = c + 0x7;
            n++;
            break;
          }
          case 147 ... 148 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = c + 0x9;
            n++;
            break;
          }
          case 149 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = 0xA2;
            n++;
            break;
          }
          case 150 ... 151 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = c - 0x3;
            n++;
            break;
          }
          case 152 :
          {
            tmp[j++] = 0xCB;
            tmp[j++] = 0x9C;
            n++;
            break;
          }
          case 153 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x84;
            tmp[j++] = 0xA2;
            n++;
            break;
          }
          case 154 :
          {
            tmp[j++] = 0xC5;
            tmp[j++] = 0xA1;
            n++;
            break;
          }
          case 155 :
          {
            tmp[j++] = 0xE2;
            tmp[j++] = 0x80;
            tmp[j++] = 0xBA;
            n++;
            break;
          }
          case 156 :
          {
            tmp[j++] = 0xC5;
            tmp[j++] = 0x93;
            n++;
            break;
          }
          case 158 :
          {
            tmp[j++] = 0xC5;
            tmp[j++] = 0xBE;
            n++;
            break;
          }
          case 159 :
          {
            tmp[j++] = 0xC5;
            tmp[j++] = 0xB8;
            n++;
            break;
          }
          case 160 ... 191 :
          {
            tmp[j++] = 0xC2;
            tmp[j++] = c;
            n++;
            break;
          }
          case 192 ... 255 :
          {
            tmp[j++] = 0xC3;
            tmp[j++] = c - 0x40;
            n++;
            break;
          }
          default :
          {
            tmp[j++] = '?';
            n++;
            break;
          }
        }
      }
    }
  }
  
  tmp[j] = '\0';
  strcpy( dest, tmp );
  return n;
}

void setup ()
{
  Serial.begin(9600);
  delay(1000);
}

void loop()
{
  if ( Serial.available() > 0 ) 
  {
    static char input[64];
    static uint8_t i;
    char c = Serial.read();

    if ( c != '\r' && i < 64-1)
      input[i++] = c;

    else
    {
      input[i] = '\0';
      i = 0;
      
      char s[256];
      sprintf( s, "In:  <%s>\n", input);
      int8_t n = SerialInputToCharSet( input, input );
      sprintf( s, "%sOut: <%s>\n%d character(s) converted!\n", s, input, n);
      Serial.print( s );
      
      if ( !strcmp( input, "€" )
        || !strcmp( input, "§" )
        || !strcmp( input, "abc" )
        || !strcmp( input, "olé" )
        || !strcmp( input, "h€lloéèàùlol" ) ) 
      {
        Serial.print( "strcmp: OK!\n\n" );
      }
      else
      {
        Serial.print( "strcmp: FAIL!\n\n" );
      }
    }
  }
}

Argh! I've just noticed that IDE setting

preproc.substitute_unicode=true

I've changed it to false and now I don't need to use my function anymore! Oh well, at least I tried :slight_smile:

This setting should appear in the Preferences window...