Decoding BigEndian to Win-1251 encoding

Hello everybody. My application reads SMS from SIM800L Arduino module. I set text mode in SIM800L module. When SMS text contains not latin characters (cyrilic in my case) I get text as hex which is encoded by BigEndian encoding:

AT+CMGR=1 +CMGR: "REC READ","+380954785245","","24/04/08,00:44:23+12",145,4,0,8,"+380672021111",145,160 042F0020043404370432043E043D043804320028043B04300029002004120430043C00200020000A043E002000320033003A00310033002C002000300037002E00300034002E00320034002E003B000A

SMS body situated in the end of this string.
Do some ways exist for decodind BigEndian to Win-1251 (in the best case) or UTF-8 (in the worst case)? My assembly has limitation of 500 bytes RAM (total 1024 bytes).
Thanks. I will be glad for any answer. Because overwise I will be forced to decode it on high level device where I can use my knowledges of C# forexample.

I am assuming you are using a UNO. Store a table of translated characters in EEPROM and read what is at the address of the incoming character. I did something similar many years ago. Here is a table: ASCII Table for Windows-1251: ASCII Code Reference I believe you will need to start at 128H, the first 127H are the same.

I use Pro Mini. My whole EEPROM is use for saving phone numbers which has access to work with my SIM800L.
I did similar to this on Univercity in 2006 but there was 1 bytes encoding, and here is BigEndian (2 bytes encoding).

unsigned char* ToCyrilic(const char* st)
{
    int length = strlen(st);
    unsigned char* p = new unsigned char[length+1];  
    int i;
    for(i = 0; i < length; i++)
    {
		if((st[i]>=-64)&&(st[i]<=-17))
		{
			p[i] = (char)((int)st[i] + 64 + 128);
		}
		else
		{
			if((st[i]>=-16)&&(st[i]<=0))
			{
				p[i] = (char)((int)st[i]+16+224);
			}
			else
			{
				if((st[i] == '³')||(st[i] == 'i')) p[i] = 'i';
				else
				{
					if((st[i] == 'I')||(st[i] == '²')) p[i] = 'I';
					else
					{
						if(st[i] == 'ª') p[i] = (char)242;
						else
						{
							if(st[i] == 'º') p[i] = (char)243;
							else
							{
								if(st[i] == '¯') p[i] = (char)244;
								else
								{
									if(st[i] == '¿') p[i] = (char)245;
									else
									{
										if(st[i] == '¹') p[i] = (char)252;
										else
										{
											p[i] = st[i];
										}
									}
								}
							}
						}
					}
				}
			}
		}
	}
	p[i] = '\0';
	return p;
}

I'm not sure that it will work

You can add external FRAM or EEPROM if you want. If you have the RTC clock you have EEPROM on that.

Sorry, but the word "BigEndian" says nothing about encoding method, it means bit order only.
Supposedly, you meant a UTF-16BE coding.

This link could be useful for understanding:

Sure. The UTF16 and Win-1251 codes has direct correspondence between each other. The only what you need is a function to translate one to another similar to yours in #3.

You could express the lookup table compactly in code (and with less indenting than what you had before)

unsigned char ToWindows1251(uint16_t unicode) {
  switch (unicode) {
    case 0 ... 127:  // ASCII
    case 0x98:  // unused, return unchanged as C1 control code
      return (unsigned char) unicode;
    case 0x0402: return 0x80; // Ђ
    case 0x0403: return 0x81; // Ѓ
    case 0x201A: return 0x82; // ‚ (same as 1252)
    case 0x0453: return 0x83; // ѓ
    case 0x201E: return 0x84; // „ (same as 1252)
    case 0x2026: return 0x85; // … (same as 1252)
    case 0x2020: return 0x86; // † (same as 1252)
    case 0x2021: return 0x87; // ‡ (same as 1252)
    case 0x20AC: return 0x88; // € (0x80 in 1252)
    case 0x2030: return 0x89; // ‰ (same as 1252)
    case 0x0409: return 0x8a; // Љ
    case 0x2039: return 0x8b; // ‹ (same as 1252)
    case 0x040a: return 0x8c; // Њ
    case 0x040c: return 0x8d; // Ќ
    case 0x040b: return 0x8e; // Ћ
    case 0x040f: return 0x8f; // Џ
    //
    // TODO return 0x90 through 0xbf (except 0x98, handled above)
    //
    case 0x0410 ... 0x044f:
      // 	А 	Б 	В 	Г 	Д 	Е 	Ж 	З 	И 	Й 	К 	Л 	М 	Н 	О 	П
      // 	Р 	С 	Т 	У 	Ф 	Х 	Ц 	Ч 	Ш 	Щ 	Ъ 	Ы 	Ь 	Э 	Ю 	Я
      // 	а 	б 	в 	г 	д 	е 	ж 	з 	и 	й 	к 	л 	м 	н 	о 	п
      // 	р 	с 	т 	у 	ф 	х 	ц 	ч 	ш 	щ 	ъ 	ы 	ь 	э 	ю 	я
      return (unsigned char) (unicode - 0x410 + 0xc0);  // 0xc0 ... 0xff
  }
  return 127;  // DEL as "unsupported"
}

I left some in the middle TODO. Might take a half-hour to fill in. The character set table at Wikipedia requires hovering to see the code point. You might find something better.

So convert each of the four hex, e.g. 0432, to uint16_t and run it through the function.

I have such code on C# lang

string data = "042F0020043404370432043E043D043804320028043B04300029002004120430043C00200020000A043E002000320033003A00310033002C002000300037002E00300034002E00320034002E003B000A";
int len = data.Length / 2;
byte[] array = new byte[len];
for (int i = 0; i < array.Length; i++)
{
  string tmp = data.Substring(i * 2, 2);
  array[i] = byte.Parse(tmp, System.Globalization.NumberStyles.HexNumber);
}
string text = Encoding.BigEndianUnicode.GetString(array);

I try to find the prepared solution for the last line. Yes, I can look at the C# code for this function and rewrite it in C++, but I don't want to reinvent the wheel if it's already been invented

I think it would be fastest way to solve the issue. But it is your decision.

Your code is probably conversion from UTF8 and not from UTF-16BE

You could take the data that you already have in C# and generate the C++ code, as I had it. Might also take about a half-hour :slight_smile: But then it is done and you don't have to process it elsewhere.

Those are different representations of the same code points -- the numeric value is the same. (In C source the 0x literals are big-endian, although not limited to 16 bits.)

No
Please read the link in #5 for detail

Look at the last line of my code:

Encoding.BigEndianUnicode

For UTF-8 it must be Encoding.UTF8

Sorry
The message #9 was not to you, but for @kenb4

Yes. Capital Ya Я

  • is Unicode U+042F
  • the numeric value is 0x42F
  • in UTF-8, two bytes D0 AF
  • in UTF-16BE 04 2F (that one is easy)
  • in UTF-16LE 2F 04 (almost as easy)

You read the bytes of the encoding to arrive at a numeric value

$ echo -n "Я" | iconv -t utf-8 | hexdump
00000000  d0 af
00000002
$ echo -n "Я" | iconv -t utf-16BE | hexdump
00000000  04 2f
00000002
$ echo -n "Я" | iconv -t utf-16BE | iconv -f utf-16BE -t utf-32 | hexdump
00000000  ff fe 00 00 2f 04 00 00
00000008
$ echo -n "Я" | iconv -t utf-8 | iconv -f utf-8 -t utf-32 | hexdump
00000000  ff fe 00 00 2f 04 00 00
00000008

Look, there's also a BOM.

My "No" applies to your statement

You yourself prove that the statement is incorrect:

0xD0AF != 0x042F
:slight_smile:

Addition
However, I was wrong when I wrote that your code is for UTF8

It like to "mission impossible". I try to get code of cyrilic 'А':

Serial.println((int)'А');

and get -12144.

Then I do

Serial.println((car)-12144);

and get nothing (empty output).
Then I moved on. Write for cycle from -128 to 1024

for(int i=-128; i<1024; i++)
  {
    Serial.print(i);
    Serial.print("-");
    Serial.println((char)i); 
  }

And get list different characters but cyrilic there are missing.
Arduino is not c++. In pure C++ this sample worked nice. I'm disapointed

I've have found the same problem here

I have tested this way:

Serial.println((uint16_t)'А');

the result is 53392 or 0xD090.
It seems to be correct for capital 'А' in UTF8 encoding:
Screenshot from 2024-05-08 18-15-47

The table:
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024&number=128&names=2&utf8=0x

PS. I wrote a software for a DMD led panels with scrolling text. In this work a common task a decode the messages with national letters - Cyrillic, Spanish, Turkish and even Hindi :slight_smile:

Try adding

Serial.printf("sizeof(char) %d\n", sizeof(char));

I get

sizeof(char) 1

and the loop prints both

100-d

and

356-d

The entire printable ASCII sequence repeats every 256. Looks like pure C++ to me.

What I don't see is where to set the charset/encoding of the Serial Monitor to something other than UTF-8; and more importantly how that affects what you're actually trying to do.

Serial.printf("%c%c\n", 0xD0, 0xAF);  // UTF-8

prints Capital Ya Я (you should pick Cyrillic examples that don't look like Latin ones)

that's how it should be, isn't it?

ya