Arduino and Unicode Strings like "\u00e1"

Hi,

I´m fetching and parsing a JSON Response. Works quite well, but now I´m getting unicode characters like "\u00e1". How can I convert these with Arduino/C++?

greetings Sunny

This is my simple JSON "Parser":

void loop()
{
  if (client.connected()) {
    if (client.available()) {
      
      char inChar = client.read();
      
      if(inString){
        
        if(inChar == '\\'){
          escapeNextChar = true;
          currentValue += inChar;
        } else if(!escapeNextChar && inChar == '"'){
          inString = false;
          
          if(currentIdentifier == ""){
            currentIdentifier = currentValue;
            currentValue = "";
          } else {
            printKeyValue();
          }
          
        } else {
          escapeNextChar = false;
          currentValue += inChar;
        }
        
      } else if(inChar == '{'){
        //inObject = true;
        printKeyValue();
      } else if(inChar == '}'){
        //inObject = false;
        printKeyValue();
      } else if(inChar == '['){
        //inArray = true;
        printKeyValue();
      } else if(inChar == ']'){
        //inArray = false;
        printKeyValue();
      } else if(inChar == '"'){
        inString = true;
      } else if(inChar == ',') {
        //colon = true;
        printKeyValue();
      } else if(inChar == '\n') {
        
        printKeyValue();
        
      } else if(inChar == ' ') {
        
        printKeyValue();
        
      } else if(inChar == ':') {
        if(currentValue != ""){
          currentIdentifier = currentValue;
          currentValue = "";
        }
      } else {
        currentValue += inChar;
      }
      
    }
    else {if (millis() - lastAttemptTime > requestInterval) {
        gotResults = false;
        
        connectToServer();
      }
    }
  }
}

void printKeyValue(){
  
  if(currentIdentifier == "results"){
    
    gotResults = true;
    
  } else if(gotResults){
    
    if(currentIdentifier == "created_at"){
      currentTweetTime = currentValue;
    } else if(currentIdentifier == "from_user"){
      currentTweetUser = currentValue;
    } else if(currentIdentifier == "from_user_name"){
      currentTweetUserName = currentValue;
    } else if(currentIdentifier == "text"){
      currentTweetText = currentValue;
    }
    
    if(currentTweetText != ""){
      
      Serial.println("---------------");
      Serial.println("From @" + currentTweetUser + " (" + currentTweetUserName + ")");
      Serial.println(currentTweetText);
      Serial.println(currentTweetTime);
  
      currentTweetTime = "";
      currentTweetUser = "";
      currentTweetUserName = "";
      currentTweetText = "";
    }
  }
  
  currentIdentifier = "";
  currentValue = "";
}

First thing to mention is that both : and " are special characters used by C++, so in order to use them in the character notiation: ' ', you need to escape them: e.g.

'\"'

and

'\:'

There is also this one for a backslash, but I see you have already correctly used it.

'\\'

(I think the second one is an escape character, the first definitely is)

Hey, great! Thanks for these tips! Du you have any suggestion for my Unicode problem? ^^ I don't know how to convert them. With PHP it would be easy, but I don't know how to do it with C++/arduino.

thx Sunny

I presume you are just trying to convert them to their integer value?

so the string 00e1, would be converted to an integer with value 255?

If so, you would need to build up a c string when you detect ‘"’ followed by a ‘\’

And then use this to convert it when you detect the closing

char* blah; //not used other than to fill a gap.
unsigned int integerValue = (unsigned int)strtol(string,&blah,4); //conver the string.

where string would be declared something like this:

void loop()
{
  char string[5] = {0};
  char count;
  if (client.connected()) {
    if (client.available()) {
      
      char inChar = client.read();
      
      if(inString){
        
        if(inChar == '\\'){
          escapeNextChar = true;
          currentValue += inChar;
          count = 0;
        } else if(escapeNextChar && inChar == '\"'){
          char* blah; //not used other than to fill a gap.
          unsigned int integerValue = (unsigned int)strtol(string,&blah,count -1); //convert the string.
          
          //do something here with your integer value. Maybe if it was a char?
          if(integerValue < 256){
            currentValue += (char)integerValue; //?
          } else {
            //something??
          }
          escapeNextChar = false;
        } else if(!escapeNextChar && inChar == '"'){
          inString = false;
          
          if(currentIdentifier == ""){
            currentIdentifier = currentValue;
            currentValue = "";
          } else {
            printKeyValue();
          }
          
        } else {
          string[count++] = inChar;
          currentValue += inChar;
        }
      }

You haven’t posted a complete code, so I can’t really be more helpful.

Hi Tom and thanks again! But I couldn´t get it working. Here´s my test case:

String unicodeStr = "let\\u00C6s get some german umlauts like \\u00E4\\u00F6\\u00FC";

void setup() {
  Serial.begin(9600);
  
  /*
  
    Expected output:
    
    let´s get some german umlauts like äöü
    
  */
  Serial.println(convertUnicode(unicodeStr));
}

String convertUnicode(String unicodeStr){
  String out = "";
  int len = unicodeStr.length();
  char iChar;
  char* error;
  for (int i = 0; i < len; i++){
     iChar = unicodeStr[i];
     if(iChar == '\\'){ // got escape char
       iChar = unicodeStr[++i];
       if(iChar == 'u'){ // got unicode hex
         char unicode[4];
         for (int j = 0; j < 4; j++){
           iChar = unicodeStr[++i];
           unicode[j] = iChar;
         }
         unsigned int integerValue = (unsigned int) strtol(unicode, &error, 4); //convert the string
         out += (char)integerValue;
       }
     } else {
       out += iChar;
     }
  }
  return out;
}

void loop(){}

Do you have any hints where I´m failing?

greetings André

I suspect you can't get access to Unicode strings without being able to tell G++ to enable them. GCC 4.7 added unicode string support with the C++11 standard support (-std=c++11), and in earlier compilers I think you needed to enable the GNU + future standards support (-std=gnu++0x). However, unless the IDE gives you a way to modify the options passed to the compiler, it may not be possible to generate those string.

I’m getting closer … ;o) only one char isn´t correct …

#include <cstdlib>

String unicodeStr = "let\\u00C6s get some german umlauts like \\u00E4\\u00F6\\u00FC";

void setup() {
  Serial.begin(9600);
  
  /*
  
    Expected output:
    
    let´s get some german umlauts like äöü
    
  */
  Serial.println(convertUnicode(unicodeStr));
}

String convertUnicode(String unicodeStr){
  String out = "";
  int len = unicodeStr.length();
  char iChar;
  char* error;
  for (int i = 0; i < len; i++){
     iChar = unicodeStr[i];
     if(iChar == '\\'){ // got escape char
       iChar = unicodeStr[++i];
       if(iChar == 'u'){ // got unicode hex
         char unicode[6];
         unicode[0] = '0';
         unicode[1] = 'x';
         for (int j = 0; j < 4; j++){
           iChar = unicodeStr[++i];
           unicode[j + 2] = iChar;
         }
         long unicodeVal = strtol(unicode, &error, 16); //convert the string
         out += (char)unicodeVal;
       }
     } else {
       out += iChar;
     }
  }
  return out;
}

void loop(){}

Current output is:

letÆs get some german umlauts like äöü

Ok, That´s it. Works very well for decoding Twitter tweets from JSON format:

String convertUnicode(String unicodeStr){
  String out = "";
  int len = unicodeStr.length();
  char iChar;
  char* error;
  for (int i = 0; i < len; i++){
     iChar = unicodeStr[i];
     if(iChar == '\\'){ // got escape char
       iChar = unicodeStr[++i];
       if(iChar == 'u'){ // got unicode hex
         char unicode[6];
         unicode[0] = '0';
         unicode[1] = 'x';
         for (int j = 0; j < 4; j++){
           iChar = unicodeStr[++i];
           unicode[j + 2] = iChar;
         }
         long unicodeVal = strtol(unicode, &error, 16); //convert the string
         out += (char)unicodeVal;
       } else if(iChar == '/'){
         out += iChar;
       } else if(iChar == 'n'){
         out += '\n';
       }
     } else {
       out += iChar;
     }
  }
  return out;
}

A Serial.println():

---------------
From @Wheres_October (Tsuki~Chan)
*giggles* ^°^ 
#steampunk #zombies #smallthings #daymade http://t.co/r17d8bkf
Tue, 14 Aug 2012 19:18:56 +0000

SunboX: I´m fetching and parsing a JSON Response. Works quite well, but now I´m getting unicode characters like "\u00e1". How can I convert these with Arduino/C++?

You seem to have an encoding problem. To solve this you need to understand how character encoding works. What encoding scheme is used on your input? (ASCII? UTF-8? Something else?) What encoding scheme are you trying to provide on your output?

Ok, That´s it. Works very well for decoding Twitter tweets from JSON format:

Only because you have been very lucky so far.

  char* error;

This variable does not point to any allocated space.

         long unicodeVal = strtol(unicode, &error, 16); //convert the string

This call to strtol tells it to convert the value to a string, and put that string in the array that is the second argument. The 2nd argument is NOT an array.

You really need to fix error so that it IS an array, not a pointer.

In most cases, the argument can be made that a pointer and an array are interchangeable, but that is true ONLY when the pointer points to allocated space.

Yours does NOT.

No, strtol converts a string to a long type variable (which it returns), and then saves a pointer to the end of the last conversion in the second variable.

Reference to an object of type char*, whose value is set by the function to the next character in str after the numerical value.
This parameter can also be a null pointer, in which case it is not used.

Notice the example from the C++ reference:

/* strtol example */
#include <stdio.h>
#include <stdlib.h>

int main ()
{
  char szNumbers[] = "2001 60c0c0 -1101110100110100100000 0x6fffff";
  char * pEnd;
  long int li1, li2, li3, li4;
  li1 = strtol (szNumbers,&pEnd,10);
  li2 = strtol (pEnd,&pEnd,16);
  li3 = strtol (pEnd,&pEnd,2);
  li4 = strtol (pEnd,NULL,0);
  printf ("The decimal equivalents are: %ld, %ld, %ld and %ld.\n", li1, li2, li3, li4);
  return 0;
}

I don’t see pEnd being declared as an array.