Hi !
Is there a function to convert a string to hexadecimal with french characters (éèàùç...) ?
For example :
String ("L'ÉTÉ")
become:
messagebyte[5] = {0x4c, 0x27, 0xc9, 0x54, 0xc9}
Thanks (I'm new !
)
Hi !
Is there a function to convert a string to hexadecimal with french characters (éèàùç...) ?
For example :
String ("L'ÉTÉ")
become:
messagebyte[5] = {0x4c, 0x27, 0xc9, 0x54, 0xc9}
Thanks (I'm new !
)
there was some talk here.
@ben1234ben you posted in the wrong section. Please read the description of where you posted. It says it is NOT for your project, and you post is about your project. Please be more careful where you post in future.
I have moved your thread to a better place.
Suite du sujet Convert String to hex :
Hi @Grumpy_Mike ! and Sorry for my mistake... ![]()
the IDE supports UTF8 so by defaults the bytes in your string are already encoding l'été in the right format - using multibyte where necessary.
Hi @J-M-L and thanks !
OK, but I want to send this string to a Nextion TFT but it only knows ASCII.
Nextion does not understand : Serial.print(String("L'ÉTÉ"));
but undesrstand : byte message[5] = {0x4c, 0x27, 0xc9, 0x54, 0xc9};
Serial.write(message, 5);
this is an example showing all the bytes
void setup() {
Serial.begin(115200); Serial.println();
String message = "C'est l'été àçèéûô";
Serial.print("String Length : "); Serial.println(message.length());
for (byte i = 0; i < message.length(); i++) {
Serial.print(F("0x")); Serial.print((byte) message.charAt(i), HEX);
if (isAscii(message.charAt(i))) {
Serial.print(F("\t-> ")); Serial.write(message.charAt(i));
}
Serial.println();
}
}
void loop() {}
it will print
String Length : 26
0x43 -> C
0x27 -> '
0x65 -> e
0x73 -> s
0x74 -> t
0x20 ->
0x6C -> l
0x27 -> '
0xC3
0xA9
0x74 -> t
0xC3
0xA9
0x20 ->
0xC3
0xA0
0xC3
0xA7
0xC3
0xA8
0xC3
0xA9
0xC3
0xBB
0xC3
0xB4
so you can see that characters that are known to ASCII like "C'est" at the beginning are really just one byte, but the the 'é' is a 2 byte character in UTF8 0xC3A9
the 't' of été is just ASCII so will be just one byte so "été" is encoded as
0xC3
0xA9
0x74 -> t
0xC3
0xA9
if you want to "clean up" UTF8 to just go to ASCII, it's quite a task if you need to support all the UTF8 characters.
if you want to support only a subset so that you could replace 'é', 'è' or 'ê' with 'e' and other common characters with a diacritical mark then you need to write a parser that will understand how to decode UTF8 (it's not complicated) and then find which multibyte UTF8 code you are looking at and provide its replacement in ASCII.
0x4c is 'L' in ASCII
0x27 is ''' in ASCII
0x54 is 'T' in ASCII
you would need to know what's the encoding rule for 0xc9 to be 'É'. ==> what character table are they using? (it's likely Roman Extended ASCII or ISO Latin 1)
Yes I understand, thank you for your explanation. Now I don't have the skills to write a parser...
But if you guide me I will be able and I like to learn!
For example if you tell me for "é" and "è" I think I can add all the other characters
ISO Latin 1
Sorry I'm french ![]()
Nothing to be ashamed of... me too.
there is a very active French forum if you want... ![]()
Avec plaisir ![]()
In ISO Latin 1, the 0xC3 character is à but In Unicode UTF-8 it's a prefix character that indicates the NEXT character 0x80 to 0xBF maps to the Unicode characters from C0 to FF. I think that means if you take the character after the 0xC3, subtract 0x80 and add 0xC0 (in other words, add 0x40), you will get the ISO Latin-1 character.
For example, C3 A9 -> E9 (A9+40) -> é
C2 is a different prefix mapping 80 through BF to 80 through BF. In other words, ignore the 0xC2 and use the next byte directly.
thanks @johnwasser
very interesting ! I watch how it reacts
arg.... you can't post the same question in 2 forums....
please decide which one you want to close...
Just to expand on what @johnwasser was saying with more data:
The ISO Latin table is
the full list of UTF-8 characters is here
You can indeed see that
from UTF8 NO-BREAK SPACE (C2A0) to the INVERTED QUESTION MARK (C2BF) you match ISO Latin Code for A0 to BF in the same order. so if you are in this range, you Ignore C2 and can directly use the code after
from UTF8 LATIN CAPITAL LETTER A WITH GRAVE (C380) to the LATIN SMALL LETTER Y WITH DIAERESIS (C3BF) you match ISO Latin Code for C0 to FF in the same order. So if you are in this range, you can ignore C3 and take the next byte, subtract 0x80 and and 0xC0 (so indeed add 0x40).
if you are outside this range (but above 0x7F otherwise it's ASCII), you don't have a direct match for ISO Latin
here is it in code (with a UTF8 sequence decoder)
byte isolatin1ForCode(uint32_t code) {
if ((code >= 0xC380) && (code <= 0xC3BF)) return (code & 0xFF) + 0x40;
else if ((code >= 0xC2A0) && (code <= 0xC2BF)) return (code & 0xFF);
return '_';
}
size_t utf8ToIsoLatin1 (const char* input, byte* output)
{
size_t outputIndex = 0;
uint32_t code = 0;
while (*input) {
uint32_t currentByte = *input++ & 0xFF;
if (currentByte <= 0x7F) { // it's the same as ASCII
output[outputIndex++] = currentByte; // 1 byte 0ƀƀƀ·ƀƀƀƀ
code = currentByte;
}
else if ((currentByte >= 0xC2) && (currentByte <= 0xDF)) { // 2 bytes 110ƀ·ƀƀƀƀ 10ƀƀ·ƀƀƀƀ
code = (currentByte << 8);
currentByte = *input++ & 0xFF;
code |= currentByte;
output[outputIndex++] = isolatin1ForCode(code);
}
else if ((currentByte >= 0xE0) && (currentByte <= 0xEF)) { // 3 bytes 1110·ƀƀƀƀ 10ƀƀ·ƀƀƀƀ 10ƀƀ·ƀƀƀƀ
code = (currentByte << 16);
currentByte = *input++ & 0xFF;
code |= currentByte << 8;
currentByte = *input++ & 0xFF;
code |= currentByte;
output[outputIndex++] = isolatin1ForCode(code);
}
else if ((currentByte >= 0xF0) && (currentByte <= 0xF3)) { // 4 bytes 1111·00ƀƀ 10ƀƀ·ƀƀƀƀ 10ƀƀ·ƀƀƀƀ 10ƀƀ·ƀƀƀƀ
code = (currentByte << 24);
currentByte = *input++ & 0xFF;
code |= currentByte << 16;
currentByte = *input++ & 0xFF;
code |= currentByte << 8;
currentByte = *input++ & 0xFF;
code |= currentByte;
output[outputIndex++] = isolatin1ForCode(code);
}
else if (currentByte == 0xF4) { // 4 bytes 1111·0100 1000·ƀƀƀƀ 10ƀƀ·ƀƀƀƀ 10ƀƀ·ƀƀƀƀ
code = (currentByte << 24);
currentByte = *input++ & 0xFF;
code |= currentByte << 16;
currentByte = *input++ & 0xFF;
code |= currentByte << 8;
currentByte = *input++ & 0xFF;
code |= currentByte;
output[outputIndex++] = isolatin1ForCode(code);
}
else break; // UTF8 error
}
return outputIndex;
}
void setup() {
Serial.begin(115200); Serial.println();
const char* UTF8_message = "L'ÉTÉ";
byte iso_Message[strlen(UTF8_message)]; // we know it will be shorter as ISO Latin 1 has only 1 byte per glyph
size_t length = utf8ToIsoLatin1(UTF8_message, iso_Message);
Serial.print(F("Converting [")); Serial.print(UTF8_message); Serial.println(F("] gives the following bytes"));
Serial.print(length); Serial.println(F(" bytes required"));
for (size_t i = 0; i < length; i++) {
Serial.print(iso_Message[i], HEX);
if (isAscii(iso_Message[i])) {
Serial.print(F(" ('")); Serial.write(iso_Message[i]); Serial.print(F("')"));
}
Serial.println();
}
Serial.println();
}
void loop() {}
the Serial monitor shows
Converting [L'ÉTÉ] gives the following bytes
5 bytes required
4C ('L')
27 (''')
C9
54 ('T')
C9
➜ seems to be what you wanted to get !
wow !
I am amazed by your work ! delighted with this demonstration which will allow me to really understand !!! I'm going to spend my weekend going through your code! Thank you very much for your generosity @J-M-L (a beer for you in Vendée...
)
Have fun