The String class contains an UTF8 representation of your characters so when you extract the cString you just have a char[N] array, that is individual bytes using the UTF8 coding.
Unicode currently defines 1114112 code points (17 planes of 2^16 code points). A code point is something rather abstract, it can be a glyph, a formatting code, or just reserved for future use. The point is you have tons of them !
UTF8 uses a 8 bit based variable length encoding scheme that encodes each Unicode code point using one to four bytes : indeed as UTF-8 works in terms of 8 bit code units and since 8 bits can only represent 256 values, code points are represented by a sequence of one to four UTF-8 code units (one to four bytes) to cover all code points and not all sequences of but make sense. There is a clear representation that lets you know how many bits to read to go get the code point.
UTF16 uses a 16 bit based variable length encoding scheme that encodes each Unicode code point using 2 or 4 bytes to represent all the code points (never one nor three).
If you want to send an UTF16 representation your function needs to decode the UTF8 representation and construct an UTF16 where 2 or 4 bytes are always used for each character even if one was only needed in the UTF8 representation
Decoding UTF8 (finding out how many bytes are used) is not difficult but you need the mapping to 16 bits or 32 bits for the characters. Because of the way it uses the first byte of multi-byte sequences, UTF-8 uses 3 bytes for some characters that require only 2 bytes in UTF-16.
That’s where the challenge is but it does work out for most character you ll want to use if you go to a bit representation and extract the Unicode code point and move it to the other representation
For ascii char it’s easy, they fit on one byte and the MSb is always 0 in UTF8. "A" in ASCII is hex 0x41; in UTF-8 it is also 0x41 and there is a straight mapping in UTF-16 with 2 bytes as 0x0041
When you get out of ASCII it’s a bit more complicated. For example "À" in UTF-8 uses up two byte 0xC3 0x80 and in UTF-16 it is 0x00C0.
You get from one to the other by removing the control bits of UTF8 and rebuild the code point value
code.
code point |
UTF-8 |
possible 1st byte |
#bits to code |
U+0000 to U+007F |
0xxxxxxx (ASCII) |
00 to 7F |
7 |
U+0080 to U+07FF |
110xxxxx 10xxxxxx |
C2 to DF |
5+6=11 |
U+0D00 to U+FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
E0 to EF |
4+6+6=16 |
U+10000 to U+10FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F0 to F4 |
2+6+6+6=20 |
Using the mapping will give you the right UTF16 code most of the time and should be enough for your needs probably