When I try to put a string of exotic character in a char array like this:
and if I just print this string on the serial monitor, I’m getting the following:
why this change? Are certain characters not allowed? If not, how do I work with the base64 library because it only allows me to encode/decode chars.
char ch  = "H<k‘zeµhØ¥.$ýõ#jÊÞynYh…v„";
Serial.println(ch); //gives me wrong string
base64_encode(ch2, ch, strlen(ch)); //messed up result
Characters above 127 (i.e., anything that's not a-z, A-Z, 0-9 and basic punctuation) is purely implementation specific. The ASCII specification only details characters up to 127. Anything over that is down to whatever is doing the displaying. In DOS they introduced the concept of "code pages" which had different characters in the upper 128 character slots, including some box drawing characters, accented characters, etc.
Then came along UTF-8 and UTF-16. They use multiple characters combined together to allow the display of many many more characters than a single char can display.
The Arduino IDE's editor speaks UTF-8 (I think it is), and so you can enter UTF-8 characters in a string. However, those will be represented by multiple actual characters within the string. You notice all the extra Ã? characters? That is a sure-fire sign of expanded UTF-8 characters. That character combined with the one next to it is the actual UTF-8 character.
If you were to view the serial output in a UTF-8 capable serial terminal (which the serial monitor isn't) you should see the proper text again.
For more information here is the UTF-8 Wikipedia article.
thanks for the explanation, but if the actual string information is correct then why am I not getting the correct base 64 output
instead of getting:
You can check it yourself at: Base64 Online - base64 decode and encode
base64 isn't encoding it as utf-8. Try taking the first base64 string and decoding it using the UTF-7 charset - look familiar?
I don't see any difference, same output for UTF-7 as for UTF-8
Aha - it's actually encoding in ISO-8859-1, not UTF-8.
Take the second (incorrect) string and decode it with ISO-8859-1, and you will see the right characters.
Your UTF-8 string is being encoded as ISO-8859-1 (ASCII with a "standard" codepage upper portion). If you try decoding it with UTF-8 it doesn't work. Decode it with ISO-8851-1 and it returns the original string - the browser is then interpreting that raw data as UTF-8 and displaying it right.