Problems with string-methods in combination with other charsets

Hey,
for a project I need to process some informations from a website. The problem are the umlauts (like ä, ö, ü) that comes with it, they are not encoded in UTF-8 but in ISO 8859-1. So when I'm trying to edit those with the standard string tools, they find nothing and I can do nothing about it.

I would be very thankful for concrete solutions! :slight_smile: I also thought about reading/editing the memory directly, and put a library together to do the job, but I have no clue how to do it :confused:

PS: I'm using for this project an ESP32 (of course with the Arduino Core) and I'm a beginner at programming. I found the problem by using PuTTY with different charsets.

What exactly are you trying to do with these strings? ISO 8859-1 is a well defined set of 256 characters. UTF-8 is also a well defined set of characters so I'm not sure why one encoding is better than the other? If you want to do something like map all the umlauts to their english (non-umlaut) cousins, then you will have to do that yourself.

The ISO 8859-1 charset comes from the website and I can't change it. I just need an switch case/if case with the equals-method to differentiate the letters. Translating an 'ä' to an 'ae' isn't fully perfect and also needs a case differentiation, which is not possible because the equals-method works with UTF-8 (at least it seems to be so).

You could use the hex or decimal equivalents.
A umlaut seems to be 228, for example...

What do you try to edit? The below will find 'ä' in a string and modify it to a normal 'a'. Unfortunately for the testing, I had to use westfw 228 in sometext as well.

// received from web page
char sometext[] = {'H', (char)228, 'l', 'l', 'o', '\0'};

void setup()
{
  Serial.begin(57600);

  Serial.println(sometext);

  char *ptr;
  ptr = strchr(sometext, 228);
  if(ptr == NULL)
  {
    Serial.println("Not found");
  }
  else
  {
    *ptr='a';
    Serial.println(sometext);
  }

  
}

void loop()
{

}

westfw:
You could use the hex or decimal equivalents.
A umlaut seems to be 228, for example...

I would love to but when I use the equals method it finds nothing, so I can't make a differentiation and can't change the letters. For example: I want to convert 'Hällo' into decimal values. The method needs to know when there is which letter to convert it, but in case of an 'ä' it doesn't have a match and can't replace it, because the ISO 8859-1-'ä' is not an UTF-8-'ä'.

sterretje:
What do you try to edit? The below will find 'ä' in a string and modify it to a normal 'a'. Unfortunately for the testing, I had to use westfw 228 in sometext as well.

// received from web page

char sometext[] = {'H', (char)228, 'l', 'l', 'o', '\0'};

void setup()
{
  Serial.begin(57600);

Serial.println(sometext);

char *ptr;
  ptr = strchr(sometext, 228);
  if(ptr == NULL)
  {
    Serial.println("Not found");
  }
  else
  {
    *ptr='a';
    Serial.println(sometext);
  }

}

void loop()
{

}

I'd like to know in generel how to handle this and in my project I just want to make a case differential. I'm sorry that I wasn't clear enough. The Problem is that an 'ä' in ISO 8859-1 is not an 'ä' in UTF-8. And because these methods work with UTF-8 I can not do anything.

Try this:

[color=#5e6d03]template[/color] [color=#434f54]<[/color][b][color=#d35400]size_t[/color][/b] [color=#000000]N[/color][color=#434f54],[/color] [b][color=#d35400]size_t[/color][/b] [color=#000000]M[/color][color=#434f54]>[/color] [b][color=#d35400]size_t[/color][/b] [color=#000000]convert_ISO_8859_1_to_UTF_8[/color][color=#000000]([/color][color=#00979c]char[/color] [color=#000000]([/color][color=#434f54]&[/color][color=#000000]UTF_8[/color][color=#000000])[/color][color=#000000][[/color][color=#000000]N[/color][color=#000000]][/color][color=#434f54],[/color] [color=#00979c]const[/color] [color=#00979c]char[/color] [color=#000000]([/color][color=#434f54]&[/color][color=#000000]ISO_8859_1[/color][color=#000000])[/color][color=#000000][[/color][color=#000000]M[/color][color=#000000]][/color][color=#000000])[/color] [color=#000000]{[/color]
  [color=#000000]static_assert[/color][color=#000000]([/color][color=#000000]2[/color][color=#434f54]*[/color][color=#000000]M[/color] [color=#434f54]-[/color] [color=#000000]2[/color] [color=#434f54]<[/color] [color=#000000]N[/color][color=#434f54],[/color] [color=#005c5f]"Error: buffer too small"[/color][color=#000000])[/color][color=#000000];[/color]
  [color=#5e6d03]return[/color] [color=#000000]convert_ISO_8859_1_to_UTF_8[/color][color=#000000]([/color][color=#434f54]&[/color][color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]0[/color][color=#000000]][/color][color=#434f54],[/color] [color=#000000]N[/color][color=#434f54],[/color] [color=#434f54]&[/color][color=#000000]ISO_8859_1[/color][color=#000000][[/color][color=#000000]0[/color][color=#000000]][/color][color=#434f54],[/color] [color=#000000]M[/color][color=#000000])[/color][color=#000000];[/color]
[color=#000000]}[/color]

[b][color=#d35400]size_t[/color][/b] [color=#000000]convert_ISO_8859_1_to_UTF_8[/color][color=#000000]([/color][color=#00979c]char[/color] [color=#434f54]*[/color][color=#000000]UTF_8[/color][color=#434f54],[/color] [b][color=#d35400]size_t[/color][/b] [color=#000000]bufferLength[/color][color=#434f54],[/color] [color=#00979c]const[/color] [color=#00979c]char[/color] [color=#434f54]*[/color][color=#000000]ISO_8859_1[/color][color=#434f54],[/color] [b][color=#d35400]size_t[/color][/b] [color=#000000]inputLength[/color][color=#000000])[/color] [color=#000000]{[/color]
  [b][color=#d35400]size_t[/color][/b] [color=#000000]n[/color] [color=#434f54]=[/color] [color=#000000]0[/color][color=#000000];[/color]
  [b][color=#d35400]size_t[/color][/b] [color=#000000]m[/color] [color=#434f54]=[/color] [color=#000000]0[/color][color=#000000];[/color]
  [color=#5e6d03]while[/color] [color=#000000]([/color][color=#000000]ISO_8859_1[/color][color=#000000][[/color][color=#000000]m[/color][color=#000000]][/color] [color=#434f54]!=[/color] [color=#00979c]'\0'[/color] [color=#434f54]&&[/color] [color=#000000]m[/color] [color=#434f54]<[/color] [color=#000000]inputLength[/color][color=#000000])[/color] [color=#000000]{[/color]
    [color=#00979c]unsigned[/color] [color=#00979c]char[/color] [color=#000000]iso8859[/color] [color=#434f54]=[/color] [color=#000000]ISO_8859_1[/color][color=#000000][[/color][color=#000000]m[/color][color=#000000]][/color][color=#000000];[/color]
    [color=#5e6d03]if[/color] [color=#000000]([/color][color=#000000]iso8859[/color] [color=#434f54]>=[/color] [color=#000000]0x20[/color] [color=#434f54]&&[/color] [color=#000000]iso8859[/color] [color=#434f54]<=[/color] [color=#000000]0x7E[/color][color=#000000])[/color] [color=#000000]{[/color]
      [color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]n[/color][color=#434f54]++[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#000000]iso8859[/color][color=#000000];[/color]
    [color=#000000]}[/color] [color=#5e6d03]else[/color] [color=#5e6d03]if[/color] [color=#000000]([/color][color=#000000]iso8859[/color] [color=#434f54]>=[/color] [color=#000000]0xA0[/color][color=#000000])[/color] [color=#000000]{[/color]
      [color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]n[/color][color=#434f54]++[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#00979c]'\xC2'[/color] [color=#434f54]|[/color] [color=#000000]([/color][color=#000000]iso8859[/color] [color=#434f54]>[/color] [color=#000000]0xBF[/color][color=#000000])[/color][color=#000000];[/color]
      [color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]n[/color][color=#434f54]++[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#000000]([/color][color=#000000]iso8859[/color] [color=#434f54]&[/color] [color=#000000]0x3F[/color][color=#000000])[/color] [color=#434f54]|[/color] [color=#000000]0x80[/color][color=#000000];[/color]
    [color=#000000]}[/color] [color=#5e6d03]else[/color] [color=#000000]{[/color] [color=#434f54]// �[/color]
        [color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]n[/color][color=#434f54]++[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#00979c]'\xEF'[/color][color=#000000];[/color]
        [color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]n[/color][color=#434f54]++[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#00979c]'\xBF'[/color][color=#000000];[/color]
        [color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]n[/color][color=#434f54]++[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#00979c]'\xBD'[/color][color=#000000];[/color]
    [color=#000000]}[/color]
    [color=#000000]m[/color][color=#434f54]++[/color][color=#000000];[/color]
  [color=#000000]}[/color]
  [color=#000000]UTF_8[/color][color=#000000][[/color][color=#000000]n[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#00979c]'\0'[/color][color=#000000];[/color]
  [color=#5e6d03]return[/color] [color=#000000]n[/color][color=#000000];[/color] [color=#434f54]// return strlen(UTF_8);[/color]
[color=#000000]}[/color]
[color=#00979c]void[/color] [color=#5e6d03]setup[/color][color=#000000]([/color][color=#000000])[/color] [color=#000000]{[/color]
  [b][color=#d35400]Serial[/color][/b][color=#434f54].[/color][color=#d35400]begin[/color][color=#000000]([/color][color=#000000]115200[/color][color=#000000])[/color][color=#000000];[/color]
  [color=#5e6d03]while[/color][color=#000000]([/color][color=#434f54]![/color][b][color=#d35400]Serial[/color][/b][color=#000000])[/color][color=#000000];[/color]
  [color=#00979c]const[/color] [color=#00979c]char[/color] [color=#000000]iso8859_1[/color][color=#000000][[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#005c5f]"\xA1 H\xEBll\xF6, W\xF6rld !"[/color][color=#000000];[/color] [color=#434f54]// "¡ Hëllö, Wörld !" in ISO 8859-1 [/color]
  [color=#00979c]char[/color] [color=#000000]utf8[/color][color=#000000][[/color][color=#000000]16[/color][color=#434f54]*[/color][color=#000000]2[/color][color=#434f54]+[/color][color=#000000]1[/color][color=#000000]][/color][color=#000000];[/color]
  [b][color=#d35400]Serial[/color][/b][color=#434f54].[/color][color=#d35400]println[/color][color=#000000]([/color][color=#000000]convert_ISO_8859_1_to_UTF_8[/color][color=#000000]([/color][color=#000000]utf8[/color][color=#434f54],[/color] [color=#000000]iso8859_1[/color][color=#000000])[/color][color=#000000])[/color][color=#000000];[/color]
  [b][color=#d35400]Serial[/color][/b][color=#434f54].[/color][color=#d35400]println[/color][color=#000000]([/color][color=#000000]iso8859_1[/color][color=#000000])[/color][color=#000000];[/color]
  [b][color=#d35400]Serial[/color][/b][color=#434f54].[/color][color=#d35400]println[/color][color=#000000]([/color][color=#000000]utf8[/color][color=#000000])[/color][color=#000000];[/color]

  [color=#00979c]char[/color] [color=#000000]iso8859_full[/color][color=#000000][[/color][color=#000000]240[/color][color=#000000]][/color][color=#000000];[/color]
  [color=#5e6d03]for[/color] [color=#000000]([/color][color=#00979c]uint8_t[/color] [color=#000000]i[/color] [color=#434f54]=[/color] [color=#000000]0[/color][color=#000000];[/color] [color=#000000]i[/color] [color=#434f54]<[/color] [color=#000000]239[/color][color=#000000];[/color] [color=#000000]i[/color][color=#434f54]++[/color][color=#000000])[/color]
    [color=#000000]iso8859_full[/color][color=#000000][[/color][color=#000000]i[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#000000]i[/color][color=#434f54]+[/color][color=#00979c]'\x20'[/color][color=#000000];[/color]
  [color=#000000]iso8859_full[/color][color=#000000][[/color][color=#000000]239[/color][color=#000000]][/color] [color=#434f54]=[/color] [color=#00979c]'\0'[/color][color=#000000];[/color]
  [color=#00979c]char[/color] [color=#000000]utf8_full[/color][color=#000000][[/color][color=#000000]239[/color][color=#434f54]*[/color][color=#000000]2[/color][color=#434f54]+[/color][color=#000000]1[/color][color=#000000]][/color][color=#000000];[/color]
  [b][color=#d35400]Serial[/color][/b][color=#434f54].[/color][color=#d35400]println[/color][color=#000000]([/color][color=#000000]convert_ISO_8859_1_to_UTF_8[/color][color=#000000]([/color][color=#000000]utf8_full[/color][color=#434f54],[/color] [color=#000000]iso8859_full[/color][color=#000000])[/color][color=#000000])[/color][color=#000000];[/color]
  [b][color=#d35400]Serial[/color][/b][color=#434f54].[/color][color=#d35400]println[/color][color=#000000]([/color][color=#000000]iso8859_full[/color][color=#000000])[/color][color=#000000];[/color]
  [b][color=#d35400]Serial[/color][/b][color=#434f54].[/color][color=#d35400]println[/color][color=#000000]([/color][color=#000000]utf8_full[/color][color=#000000])[/color][color=#000000];[/color]
[color=#000000]}[/color]

[color=#00979c]void[/color] [color=#5e6d03]loop[/color][color=#000000]([/color][color=#000000])[/color] [color=#000000]{[/color][color=#000000]}[/color]

Output:

20
� H�ll�, W�rld !
¡ Hëllö, Wörld !
386
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~��������������������������������������������������������������������������������������������������������������������������������
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~��������������������������������� ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Pieter

I just want to make a case differential. I'm sorry that I wasn't clear enough. The Problem is that an 'ä' in ISO 8859-1 is not an 'ä' in UTF-8. And because these methods work with UTF-8 I can not do anything.

I don't quite understand what you're trying to do. Doesn't this work:

#define CHAR_a_UMLAUT 0xC4  // ISO8859 encoding for ä
#define CHAR_A_UMLAUT 0xE4  // Ä
char *vowels = "AaEeIiOoUu\xC4\xE4\xCB\xEB\xCF\xEF\xD6\xF6\xDC\xFC";  //AaEeIiOoUuÄäuËëÏïÖöÜü
 :
   c = Serial.read();
   if strchr(vowels, c) {
     count_vowels++;
     switch(c) {
     case 'a': case 'A': case CHAR_a_UMLAUT: case CHAR_A_UMLAUT:
        // character is some variety of ISO_8859 'a'
        count_as++;
        break;
     :
     }
   }

Getting your keyboard or terminal window to generate ISO8859 is a separate problem...