UTF-8 String

Hello,
I am doing some tests on Arduino+SM800L (GSM modem)

consider a table of utf-8 codes:

uint16_t P16[] = {0x0672,0x0631,0x062F,0x0648,0x064A,0x0646,0x0648,0x03B2};

It is possible to manipulate it as String object to benefit of things like .indexOf() or .substring() …

Thank you,

What exactly are you trying to achieve ?

I receive a SMS in UCS2 (HEX) format (Arabic) something like this "06720631062F0648064A06460648"
i break it 4 by 4, i transform to numbers with strtoul(), i change the code to utf-8 , it is OK I can display the correct message in the correct language on The serial monitor
Now I would like to find specific substrings in the message to perform specific tasks.

If you insist on using Strings then why not put what you received in a String and use that ?

the received string "06720631062F0648064A06460648" is formed by the unicode code points (2 bytes) in hex representation of the word ٲردوينو, to display on the serial monitor, it is necessary to transform to utf-8. To do that, i was obliged to cast the string to numbers. But now, i dont know how to cast it back to String

void setup() {
    Serial.begin(9600);            
    char UCS2[] = "06720631062F0648064A06460648";    
    uint8_t n = strlen(UCS2);    
    char S[5];  // 4 digits + \0    
    for(uint8_t i = 0; i < n ; i+=4){
        strncpy(S, &UCS2[i], 4);
        uint16_t CP = strtoul(S,NULL,16);
        unicode2utf8(CP);  //inversé [L H]
        Serial.write((byte*)&CP,2);  // little Endian      
    } 
}

void loop() {
}

void unicode2utf8(uint16_t& U){
    // pour points de codes 0 --> u+07FF
    if(U > 127){
        uint8_t UL = (U & 0x003F)| 0B10000000;
        uint8_t UH = (U >> 6)    | 0B11000000;
        U = (UL << 8) | UH ;  //inversé
    }
}

But now, i dont know how to cast it back to String

Copy it before you change it then you can use either format

Thank you,
I am going to do somme tests

You could change unicode2utf8() to append the byte or two to a String passed as an argument.

void unicode2utf8(String &result; uint16_t U)
{
  // pour points de codes 0 --> u+07FF
  if(U > 127)
  {
    char UL = (U & 0x003F) | 0B10000000;
    char UH = (U >> 6) | 0B11000000;
    result += UL;
    result += UH;  //inversé
  }
  else
    result += (char) U;
}

That it, (result += (char) U;) we have just to do it byte by byte maybe because (char) is byte

here is little code de show it

void setup() {
    Serial.begin(9600);
    
    // Extract utf-8 codes of String characters
    String STR1 = "ββδΨωββ";  // utf-8(β) = CE B2    
    uint16_t *P16 = (uint16_t*) STR1.c_str();  // better than .tocharArray()
    for (int i = 0; i < 7; i++){
        Serial.println(P16[i],HEX); //--> B2CE,B2CE,B4CE,A8CE,89CF,B2CE,B2CE
    }

    // make a String from utf-8 codes
    uint16_t M16[]={0xCEB2,0xCEB2,0xCEB4, 0xCEA8,0xCF89,0xCEB2,0xCEB2};
    String STR2 = "";
    for (int i = 0; i < 7; i++){        
        STR2 += (char) highByte(M16[i]);
        STR2 += (char) lowByte(M16[i]);
    }
    Serial.println(STR1); // --> ββδΨωββ
    Serial.println(STR2); // --> ββδΨωββ
    Serial.println(STR2.indexOf("ω")); // --> 8 (packed 2 by 2)
    String STR3 = STR2.substring(4,10);
    Serial.println(STR3); // --> δΨω

}

void loop() {
    // put your main code here, to run repeatedly:

}

here is the output
B2CE
B2CE
B4CE
A8CE
89CF
B2CE
B2CE
ββδΨωββ
ββδΨωββ
8
δΨω

Thank you very much

1 Like

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.