Hello everybody. My application reads SMS from SIM800L Arduino module. I set text mode in SIM800L module. When SMS text contains not latin characters (cyrilic in my case) I get text as hex which is encoded by BigEndian encoding:
SMS body situated in the end of this string.
Do some ways exist for decodind BigEndian to Win-1251 (in the best case) or UTF-8 (in the worst case)? My assembly has limitation of 500 bytes RAM (total 1024 bytes).
Thanks. I will be glad for any answer. Because overwise I will be forced to decode it on high level device where I can use my knowledges of C# forexample.
I am assuming you are using a UNO. Store a table of translated characters in EEPROM and read what is at the address of the incoming character. I did something similar many years ago. Here is a table: ASCII Table for Windows-1251: ASCII Code Reference I believe you will need to start at 128H, the first 127H are the same.
I use Pro Mini. My whole EEPROM is use for saving phone numbers which has access to work with my SIM800L.
I did similar to this on Univercity in 2006 but there was 1 bytes encoding, and here is BigEndian (2 bytes encoding).
Sorry, but the word "BigEndian" says nothing about encoding method, it means bit order only.
Supposedly, you meant a UTF-16BE coding.
This link could be useful for understanding:
Sure. The UTF16 and Win-1251 codes has direct correspondence between each other. The only what you need is a function to translate one to another similar to yours in #3.
You could express the lookup table compactly in code (and with less indenting than what you had before)
unsigned char ToWindows1251(uint16_t unicode) {
switch (unicode) {
case 0 ... 127: // ASCII
case 0x98: // unused, return unchanged as C1 control code
return (unsigned char) unicode;
case 0x0402: return 0x80; // Ђ
case 0x0403: return 0x81; // Ѓ
case 0x201A: return 0x82; // ‚ (same as 1252)
case 0x0453: return 0x83; // ѓ
case 0x201E: return 0x84; // „ (same as 1252)
case 0x2026: return 0x85; // … (same as 1252)
case 0x2020: return 0x86; // † (same as 1252)
case 0x2021: return 0x87; // ‡ (same as 1252)
case 0x20AC: return 0x88; // € (0x80 in 1252)
case 0x2030: return 0x89; // ‰ (same as 1252)
case 0x0409: return 0x8a; // Љ
case 0x2039: return 0x8b; // ‹ (same as 1252)
case 0x040a: return 0x8c; // Њ
case 0x040c: return 0x8d; // Ќ
case 0x040b: return 0x8e; // Ћ
case 0x040f: return 0x8f; // Џ
//
// TODO return 0x90 through 0xbf (except 0x98, handled above)
//
case 0x0410 ... 0x044f:
// А Б В Г Д Е Ж З И Й К Л М Н О П
// Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
// а б в г д е ж з и й к л м н о п
// р с т у ф х ц ч ш щ ъ ы ь э ю я
return (unsigned char) (unicode - 0x410 + 0xc0); // 0xc0 ... 0xff
}
return 127; // DEL as "unsupported"
}
I left some in the middle TODO. Might take a half-hour to fill in. The character set table at Wikipedia requires hovering to see the code point. You might find something better.
So convert each of the four hex, e.g. 0432, to uint16_t and run it through the function.
string data = "042F0020043404370432043E043D043804320028043B04300029002004120430043C00200020000A043E002000320033003A00310033002C002000300037002E00300034002E00320034002E003B000A";
int len = data.Length / 2;
byte[] array = new byte[len];
for (int i = 0; i < array.Length; i++)
{
string tmp = data.Substring(i * 2, 2);
array[i] = byte.Parse(tmp, System.Globalization.NumberStyles.HexNumber);
}
string text = Encoding.BigEndianUnicode.GetString(array);
I try to find the prepared solution for the last line. Yes, I can look at the C# code for this function and rewrite it in C++, but I don't want to reinvent the wheel if it's already been invented
You could take the data that you already have in C# and generate the C++ code, as I had it. Might also take about a half-hour But then it is done and you don't have to process it elsewhere.
Those are different representations of the same code points -- the numeric value is the same. (In C source the 0x literals are big-endian, although not limited to 16 bits.)
PS. I wrote a software for a DMD led panels with scrolling text. In this work a common task a decode the messages with national letters - Cyrillic, Spanish, Turkish and even Hindi
The entire printable ASCII sequence repeats every 256. Looks like pure C++ to me.
What I don't see is where to set the charset/encoding of the Serial Monitor to something other than UTF-8; and more importantly how that affects what you're actually trying to do.
Serial.printf("%c%c\n", 0xD0, 0xAF); // UTF-8
prints Capital Ya Я (you should pick Cyrillic examples that don't look like Latin ones)