SOLVED: UTF8 to extended ASCII conversion partially fails

Amiga · January 16, 2020, 9:17am

Hello,

In a sketch for an ESP I use the following code based on Arduino Playground - HomePage

uint8_t utf8Ascii(uint8_t ascii) {                      
  static uint8_t cPrev;
  uint8_t c = '\0';
  
  if (ascii < 0x7f || ascii == degCascii || ascii == degFascii) {
    cPrev = '\0';
    c = ascii;
  } else {
    switch (cPrev) {
      case 0xC2: c = ascii;  break;
      case 0xC3: c = ascii | 0xC0;  break;
      case 0x82: if (ascii==0xAC) c = 0x80;   // Euro symbol special case
    }
    cPrev = ascii;                                     // save last char
  }
  return(c);
}


String utf8AsciiStr(char * s) {
                                                           
  uint8_t c, k = 0;
  
  while (*s != '\0') {
    c = utf8Ascii(*s++);
    if (c != '\0')
      tmpBuffer[k++] = c;
  }
  tmpBuffer[k] = '\0';
  return(tmpBuffer);
}

I set the all relevant char arrays including "tmpBuffer" to a size of 500, but the conversion only works for a maximum length of 243 characters. If I try to convert a larger array, the beginning is cut and only the last 243 characters of the array are converted.

Can someone explain why this happens - is there any limitation with this conversion function?
I also tried larger array and buffer sizes, but that didn't help.

Amiga · January 16, 2020, 8:17pm

Update:
Having done some further testing, I can report the limit is not an exact length of 243 characters; it's a value of about 250 characters where the strange behaviour starts. Sometimes the beginning of the string is cut, sometimes only the last character is shown - it seems to be dependent on the "kind" of characters that are converted. Using single characters with a space space between (a b c d) seems to exceed the limit whereas words (abcd) lead to the opposite.
This is really strange and - regarding the conversion function - I have no explanation what factor causes this limitation. Memory ? -> I increased buffer size signifantly without effort. Or could it be a timing problem?

Amiga · January 17, 2020, 5:45pm

Update 2 - I got it running.

I used a code snippet grabbed somewhere in the web without understanding its exact function - which hasn't been a good idea...
Differently to the example code from arduino playground above, "my" function used an additional variable "tmpBuffer". Without knowing the exact reason I found out this additional variable caused the malfunction.
So I took the "in place conversion" example from arduino playground which was also a good opportunity to get rid of the String class. I made a little change because I also want to keep the unconverted array:

uint8_t utf8Ascii(uint8_t ascii) {                                    //converts a single character
  static uint8_t cPrev;
  uint8_t c = '\0';
  
  if (ascii < 0x7f || ascii == degCascii || ascii == degFascii) {
    cPrev = '\0';
    c = ascii;
  } else {
    switch (cPrev) {
      case 0xC2: c = ascii;  break;
      case 0xC3: c = ascii | 0xC0;  break;
      case 0x82: if (ascii==0xAC) c = 0x80;   // Euro symbol special case
    }
    cPrev = ascii;                            // save last char
  }
  return(c);
}



void convert(char* source, char* destination)     // picks every character from char *source,
{                                                                 // converts it and writes it into char* destination
        int k=0;
        char c;
        for (int i=0; i<strlen(source); i++)
        {
                c = utf8Ascii(source[i]);
                if (c!=0)
                        destination[k++]=c;
        }
        destination[k]=0;
}

So if you want to convert an array, just call convert(--source-array--, --destination-array--);
In case you want to overwrite the source array you can use convert(--source-array--, --source-array--);
Perhaps this can help someone in the future.

christop · January 17, 2020, 11:55pm

for (int i=0; i<strlen(source); i++)

Can be improved to:

for (int i=0; source[i]; i++)

Otherwise you'll count the characters in the source for each character you convert (it's O(n²)).

Also, since the convert function isn't modifying source, you should mark it as const:

void convert(const char* source, char* destination)

Amiga · January 22, 2020, 9:26pm

Thanks a lot for your hints, christop!

Amiga · January 28, 2020, 7:28pm

I'm facing a new problem and I haven't been able to solve it yet:
Receiving Json formatted char strings from an API they sometimes contain the strange character "…" In Unicode it is known as "U+2026 horizontal ellipsis"; as far as I know there is no ASCII code for this character.
Running an LED Matrix (Max7219) I use the function described above to be able to show special characters like ä,ö,ü,ß,€ etc. I'd like to replace the "…" by one ore more characters that cause no problems (+ or +++ for example). This should be implemented into the described function - I tried it, but unfortunately without any success. How could I change this function to achieve my goal?

uint8_t utf8Ascii(uint8_t ascii) {                                    //converts a single character
  static uint8_t cPrev;
  uint8_t c = '\0';
  
  if (ascii < 0x7f || ascii == degCascii || ascii == degFascii) {
    cPrev = '\0';
    c = ascii;
  } else {
    switch (cPrev) {
      case 0xC2: c = ascii;  break;
      case 0xC3: c = ascii | 0xC0;  break;
      case 0x82: if (ascii==0xAC) c = 0x80;   // Euro symbol special case
    }
    cPrev = ascii;                            // save last char
  }
  return(c);
}

Thanks in advance for your help!

Amiga · January 29, 2020, 9:47am

I got help and found a solution.

I use this conversion function in combination with the MD_Parola Library for an LED-Matrix. This library supports own font sets, which can be generated easily by editors like this.

Keeping this in mind and knowing that in my own font set the "horizontal ellipsis" has the number 133, you can regard the solution posted by the creator of the library.

Topic		Replies	Views
Converting analog value to 4-character ASCII string[SOLVED] Programming	29	9188	May 5, 2021
Problem with special characters in Strings (like ä, ü, ß, €) Programming	7	1744	May 5, 2021
uint8_t converting to readable text? Programming	4	2157	May 5, 2021
string utf8 convert Programming	5	5433	May 6, 2021
Binary translator General Guidance	8	1479	May 5, 2021

SOLVED: UTF8 to extended ASCII conversion partially fails

Related topics