ASCII to char

Thanks man :slight_smile: I was away.

Seen that :slight_smile:

can i have one more question ?
this is my function and when i write AsciiToHex("č"); it write �

String AsciiToHex(String ascii)
{
	String hex="";
	int i = 0;
	while(i < ascii.length())
	{
		String letter = ascii.substring(i, i + 1);
		Serial.println(letter);
		if (letter == "A") hex += "";
		else if (letter == "B") hex += "";
		else if (letter == "C") hex += "";
		else if (letter == "D") hex += "";
		else if (letter == "E") hex += "";
		else if (letter == "F") hex += "";
		else if (letter == "G") hex += "";
		else if (letter == "H") hex += "";
		else if (letter == "I") hex += "";
		else if (letter == "J") hex += "";
		else if (letter == "K") hex += "";
		else if (letter == "L") hex += "";
		else if (letter == "M") hex += "";
		else if (letter == "N") hex += "";
		else if (letter == "O") hex += "";
		else if (letter == "P") hex += "0050";
		else if (letter == "Q") hex += "";
		else if (letter == "R") hex += "";
		else if (letter == "S") hex += "";
		else if (letter == "T") hex += "";
		else if (letter == "U") hex += "";
		else if (letter == "V") hex += "";
		else if (letter == "W") hex += "";
		else if (letter == "X") hex += "";
		else if (letter == "Y") hex += "";
		else if (letter == "Z") hex += "";
		else if (letter == "a") hex += "";
		else if (letter == "b") hex += "";
		else if (letter == "c") hex += "";
		else if (letter == "d") hex += "";
		else if (letter == "e") hex += "";
		else if (letter == "f") hex += "";
		else if (letter == "g") hex += "";
		else if (letter == "h") hex += "";
		else if (letter == "i") hex += "";
		else if (letter == "j") hex += "";
		else if (letter == "k") hex += "";
		else if (letter == "l") hex += "";
		else if (letter == "m") hex += "";
		else if (letter == "n") hex += "";
		else if (letter == "o") hex += "006F";
		else if (letter == "p") hex += "";
		else if (letter == "q") hex += "";
		else if (letter == "r") hex += "0072";
		else if (letter == "s") hex += "";
		else if (letter == "t") hex += "";
		else if (letter == "u") hex += "";
		else if (letter == "v") hex += "";
		else if (letter == "w") hex += "";
		else if (letter == "x") hex += "";
		else if (letter == "y") hex += "";
		else if (letter == "z") hex += "";
		else if (letter == "Á") hex += "";
		else if (letter == "Ä") hex += "";
		else if (letter == "Č") hex += "";
		else if (letter == "Ď") hex += "";
		else if (letter == "É") hex += "";
		else if (letter == "Í") hex += "";
		else if (letter == "Ĺ") hex += "";
		else if (letter == "Ľ") hex += "";
		else if (letter == "Ň") hex += "";
		else if (letter == "Ó") hex += "";
		else if (letter == "Ô") hex += "";
		else if (letter == "Ŕ") hex += "";
		else if (letter == "Š") hex += "";
		else if (letter == "Ť") hex += "";
		else if (letter == "Ú") hex += "";
		else if (letter == "Ý") hex += "";
		else if (letter == "Ž") hex += "";
		else if (letter == "á") hex += "";
		else if (letter == "ä") hex += "";
		else if (letter == "č") hex += "010D";
		else if (letter == "ď") hex += "";
		else if (letter == "é") hex += "00E9";
		else if (letter == "í") hex += "";
		else if (letter == "ĺ") hex += "";
		else if (letter == "ľ") hex += "";
		else if (letter == "ň") hex += "";
		else if (letter == "ó") hex += "";
		else if (letter == "ô") hex += "";
		else if (letter == "ŕ") hex += "";
		else if (letter == "š") hex += "";
		else if (letter == "ť") hex += "";
		else if (letter == "ú") hex += "";
		else if (letter == "ý") hex += "";
		else if (letter == "ž") hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "")  hex += "";
		else if (letter == "0")  hex += "0030";
		else if (letter == "1")  hex += "0031";
		else if (letter == "2")  hex += "0032";
		else if (letter == "3")  hex += "0033";
		else if (letter == "4")  hex += "0034";
		else if (letter == "5")  hex += "0035";
		else if (letter == "6")  hex += "0036";
		else if (letter == "7")  hex += "0037";
		else if (letter == "8")  hex += "0038";
		else if (letter == "9")  hex += "0039";
		else if (letter == "+")  hex += "002B";
		i++;
	}
	Serial.print(hex);
}

That’s possibly because in the arduino IDE this ("č") will be coded into UTF8 -> see the UTF-8 link I gave you in answer #8 to understand how cStrings get created if you don’t stick to ASCII in the code

You also might want to force in the call the creation of string and not leave String("č") not sure... String is a bad idea and I don’t use it, would stick to cstrings

so how to do it ? i know, that library something like cstring exist, but what then ?

A c String is just an array of char, terminated by a null character ‘\0’

char s[] = “HELLO”; will create an array of 6 chars, padded with the ASCII codes of the letters HELLO and a trailing null char

Now when you use non ASCII in your code, Internally the IDE and the gcc-compiler use UTF8-encoding, using two or more byte for special character encoding.

For example writing char amp[] = "5µA"; would internally be represented in UTF-8 as 5 bytes: char amp[] ={0x35, 0xC2, 0xB5 , 0x41, 0x00}; because the ASCII code for the character ‘5’ is 0x35, the UTF8 representation of the ‘µ’ character is 0xC2 0xB5, the ASCII character ‘A’ is 0x41 and because it’s a c-String the compiler adds the ‘\0’ at the end.

Now with the String class the representation is similar and The class "String" handles the UTF-8 characters just as a number of bytes, and doesn't care for UTF-8 characters. For example the myString.length() call returns the number of bytes and not the number of characters.

So When you do String letter = ascii.substring(i, i + 1); it’s too late, you are extracting bytes that do not make any sense as characters

if you try this code

char amp[] = "5µA";
String ampS = "5µA";

void setup() {
  Serial.begin(115200);
  Serial.print("The amp array is ");
  Serial.print(sizeof(amp));
  Serial.println(" bytes long");
  
  for (int i = 0; i < sizeof(amp); i++) {
    Serial.print("0x");
    if ((byte) amp[i] <= 0xF) Serial.print(0);
    Serial.print((byte) amp[i], HEX);
    Serial.print(" ");
  }
  Serial.println();

  Serial.print("The ampS String Object length is ");
  Serial.println(ampS.length());
  
  for (int i = 0; i < ampS.length(); i++) {
    Serial.print(ampS.substring(i, i + 1));
    Serial.print(" ");
  }
  Serial.println();
}

void loop() {}

you will see in the console

[sub][color=blue]The amp array is 5 bytes long
0x35 0xC2 0xB5 0x41 0x00 
The ampS String Object length is 4
5 ⸮ ⸮ A 
[/color][/sub]

See the two ⸮ ==> they correspond to the weird bytes that were not ASCII and that put together were the ‘µ’ character originally

Makes sense?

Yes, but how can i parse 0x35 0xC4 0x8C 0x41 0x00 and get from this UCS2 char ?
As you can see in my code, im working on my coding/decoding table, and i dont know how to make function to get equivalent values .. Because when i change the first word for any char from UCS2, there will be one more byte ...

Where does this

0x35 0xC4 0x8C 0x41

Come from?

You can read how to decode UTF8 form this page

thats 5ČA ...

I just need this 0x35 0xC4 0x8C 0x41 get into UCS2 ..
So i can do it with if(...== "Č" ... , but i dont know what to go with 0x35 and i cant parse it, because i dont know how ... there are not still stable bytes.

Did you read the link I just gave you? how to decode UTF8 form this page...

You have all you need in this:

The value of each individual byte indicates its UTF-8 function, as follows:

00 to 7F hex (0 to 127): first and only byte of a sequence.
80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
C2 to DF hex (194 to 223): first byte of a two-byte sequence.
E0 to EF hex (224 to 239): first byte of a three-byte sequence.
F0 to FF hex (240 to 255): first byte of a four-byte sequence.

So if you receive 0x35 0xC4 0x8C 0x41 0x00

You start with 0x35 is between 00 to 7F so this is only 1 byte, directly matching ASCII -> you can use it directly, "0035" will need to be in your UCS2 output (which represents '5')

Then you have 0xC4 which is between C2 to DF so you know according to the UTF8 rule above that this is the first byte of a two-byte sequence -> grab next byte and you get {0xC4, 0x8C} in UTF8

--> Your Arduino code will need to recognize this by comparing to cStrings such as "Č" (if you read my previous links and test code you should know that by now) as the same bytes will be in the memory. you can use a memcmp() for this for example. (compare to all UTF8 chars you need to translate into UCS2

Č in UTF8 is C4 8C and Č in UCS2 is "010C"

--> Once you found out it's "Č", then associate to the output "010C"

Then you have 0x41, which is between 00 to 7F so this is only 1 byte, directly matching ASCII -> you can use it directly, 0041 will need to be in your UCS2 output (which represents 'A')

last you have 0x00 which denotes the end of the entry.

So as you scanned your input you built "0035 010C 0041" as an output

and if you want to make sure this this is correct, you go back to the tool I mentioned before Unicode code converter and at the very bottom (in the Hexadecimal box) enter 0035 010C 0041 and click convert and at the top you'll see (in the characters box) 5 Č A...

Now if you want to do the opposite and transform 005000720065010D006F00200074006F0020006E0065006601480075006A006701610069006A0065 into Prečo to nefňujgšije then what you need to do is get groups of 4 ASCII letters and match those against the UCS2 table... you can be a bit smart there and group things together for standard ASCII stuff

A to Z will be between 0041 and 005A
a to z will be between 0061 and 007A
0 to 9 will be between 0030 and 0039

if outside these, may be map the standard chars you are likely to see such as č ň or š for example

that should get you going (the full UCS2 list can be found here)

So it's pretty straightforward... just need to make sure you capture exactly what is UTF8 and what is UCS2 and do the right transformation. will just require coding... (a b-tree structure could help instead of tons of if-else)

J-M-L:
Then you have 0xC4 which is between C2 to DF so you know according to the UTF8 rule above that this is the first byte of a two-byte sequence -> grab next byte and you get {0xC4, 0x8C} in UTF8

--> Your Arduino code will need to recognize this by comparing to cStrings such as "Č" (if you read my previous links and test code you should know that by now) as the same bytes will be in the memory. you can use a memcmp() for this for example. (compare to all UTF8 chars you need to translate into UCS2

Č in UTF8 is C4 8C and Č in UCS2 is "010C"

--> Once you found out it's "Č", then associate to the output "010C"

Then you have 0x41, which is between 00 to 7F so this is only 1 byte, directly matching ASCII -> you can use it directly, 0041 will need to be in your UCS2 output (which represents 'A')

last you have 0x00 which denotes the end of the entry.

So as you scanned your input you built "0035 010C 0041" as an output

Yeah bro. Im not that stupid as i looks like ...
The problem is, when i set decoder for 5 Č A and i will try it later on Č 5 A, there will be problem ..
I know how to transport it from UTF-8 to UCS2, you wrote it, but the problem is, how to make function based on how many char. i have and number of bytes ... how to get which character have more than one byte ..

Yeah bro. Im not that stupid as i looks like ...
The problem is, when i set decoder for 5 Č A and i will try it later on Č 5 A, there will be problem ..

Yeah bro. If I answer it means that I don’t judge if you are stupid or not. I’m trying to help and make it as understandable as possible, for you or whoever will come read this after you. If you don’t like my writing I can go do something else - just let me know.

Care to explain why you think it's a problem? It works the same way...

you will have 0xC4 which is between C2 to DF so you know according to the UTF8 rule above that this is the first byte of a two-byte sequence -> grab next byte and you get {0xC4, 0x8C} in UTF8 and you find it’s Č

Then the two others are in low ASCII range so just 5 and A

Note sure I get your issue. It’s just parsing input using the UTF-8 rule

lukeesvk:
the problem is, how to make function based on how many char. i have and number of bytes ... how to get which character have more than one byte ..

You use ISO/IEC 2022 escape sequences.

aarg:
You use ISO/IEC 2022 escape sequences.

This is Not applicable with UTF-8 nor UCS2 (UTF-8 = UCS Transformation Format 8 )

The UCS is an encoding system different from that specified in ISO/IEC 2022. (ISO/IEC 10646:2014 specifies the method to designate UCS from ISO/IEC 2022.)

If you go back to my post #49 and the link at the top, you have the rules to decide at parse time how many bytes to extract for a given char when reading your UTF-8 cstring

The value of each individual byte indicates its UTF-8 function, as follows:

00 to 7F hex (0 to 127): first and only byte of a sequence.
80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
C2 to DF hex (194 to 223): first byte of a two-byte sequence.
E0 to EF hex (224 to 239): first byte of a three-byte sequence.
F0 to FF hex (240 to 255): first byte of a four-byte sequence.

Look as well in the link for the Binary format of bytes in sequence information to see where the useful bits for the character are stored then.

For example; Č in UTF8 is C4 8C and Č in UCS2 is "010C"
To go from the UTF8 version to UCS2 you do the following steps

Read the first byte 0xC4 : it’s between C2 to DF so you know it is first byte of a two-byte sequence.
You thus take the two bytes in binary
C4 = 1100 0100
8C = 1000 1100

You get rid of the red pieces (which are UTF8 coding info) and stitch back the relevant bits which are in the blue part: 0 0100 00 1100, which if you rearrange for 8 bits reading is 001 0000 1100 and if you pad to the left with enough 0 to have your 8 bits conventional reading you get 0000 0001 0000 1100 which going back to hexa decimal is 0x010C which in HEX will be represented as an ASCII cstring « 010C » ==> here you go you have your UCS2 hex notation for SMS

I’m still unsure where the OP is confused....

(If security is important for what you do, Be careful as UTF or HEX decoding techniques requires to find the minimal form of an encoding, one of the smart UTF based attack against Microsoft IIS v4 and v5 in the past were based on passing / coded in a longer form to escape from a script virtual directory and go launch cmd.exe)

Hi again ! I found algorithm between unicode and utf-8 for c1..,c2...c3, but i cant get for others.
This is the page Complete Character List for UTF-8.
Could you help me ? I know, just when you have c3 - you had to calcuate it with 3rd char and you get the 3rd char from Ucs2, in c2, i you replace c2 with 00. Any ideas ?

I don't understand your question... can you give a very clear example of what you are trying to do?

lukeesvk:
Hi again ! I found algorithm between unicode and utf-8 for c1..,c2...c3,

So, post your code that implements this algorithm that you found. Perhaps then others can understand what you're trying to do and help you expand the code as needed.

Just tell me, how can i make from B D, when i can only calculate it with int.

lukeesvk:
Just tell me, how can i make from B D, when i can only calculate it with int.

Post your code and maybe someone can help you.

I need to do: String i = "b" +2 and get D ... understood ?