Macro to convert utf-8

artem_on · March 2, 2024, 9:20pm

Hello!
I want to store in program memory (flash) a bunch of strings which consist of utf-8 character subset but converted to 1 byte encoding format occupying codes let's say 128-200 in order to save some memory.
Of course, I could encode those via some sort of converter and insert them into my code.
But I'd like much more if it's possible to have a macro in my code just before each given utf-8 string obtaining at compilation time required array of single byte character codes.
I tried to write that macro with help of ChatGPT:

#define CUSTOM_ENCODING(c) ((c) + 0x80)

#define UTF8_TO_BYTE(utf8String) \
    ([] {                                      \
        const char* utf8 = utf8String;         \
        while (*utf8) {                        \
            if ((*utf8 & 0xC0) == 0xC0) {     \
                return CUSTOM_ENCODING((static_cast<unsigned char>(utf8[0]) & 0x0F) << 6 | \
                                         (static_cast<unsigned char>(utf8[1]) & 0x3F));   \
                utf8 += 2;                     \
            } else {                            \
                return *utf8;                  \
                ++utf8;                        \
            }                                  \
        }                                      \
        return '\0';                           \
    }())

Macro supposes each utf-8 character is 1-2 bytes length. It's not important part.
If I get working macro I will use PROGMEM to place it in flash, but for now I'm just trying to print it.
Compiler error I get when I use this macro in code like somePrintFunc(UTF8_TO_BYTE(""));
Error: call to non-constexpr function '<lambda()>'

My knowledge of C++ is not enough in this case. So I wonder is it even possible to create such macro that would keep only converted byte code in program memory and rid of utf-8 strings?

bobcousins · March 3, 2024, 5:27am

I would say that it is impossible.

artem_on · March 3, 2024, 1:21pm

Also was not there some time ago an option in Arduino IDE to choose editor encoding? I can't find it now. It would solve the problem too.

noiasca · March 3, 2024, 1:58pm

what is the real problem you have?
UTF-8 characters in the source code (content of variables or literals) are usually no problem at all.

J-M-L · March 3, 2024, 2:07pm

C++ does not support recursive macros so I doubt you could write a general one for what you try to do.

how would that solve your problem? which encoding would you pick?

UTF8 only uses 1 byte for ASCII characters so there is a one to one mapping there but what would be your mapping for € à ç or 你好 ?

artem_on · March 3, 2024, 2:20pm

That's not really a problem. In short, I want to save some memory. Yeah, that's pretty it.

artem_on · March 3, 2024, 2:20pm

I would pick Windows-1251 I guess (cyrillic characters) Windows-1251 - Wikipedia

J-M-L · March 3, 2024, 3:21pm

OK - not possible I think

artem_on · March 4, 2024, 11:42am

So at this moment I got a semi-solution. It really reduces program memory usage:


#define B(c1, c2) (static_cast<char>(128 + ((static_cast<unsigned char>(c1) << 6)|(static_cast<unsigned char>(c2) & 0x3F))))

#define BS_2(s) {B(s[0], s[1]), B(s[2], s[3]), '\0'}
#define BS_3(s) {B(s[0], s[1]), B(s[2], s[3]), B(s[4], s[5]), '\0'}
#define BS_4(s) {B(s[0], s[1]), B(s[2], s[3]), B(s[4], s[5]), B(s[6], s[7]), '\0'}
...

constexpr char my_ascii[] = BS_13("some unicode of 13 character length");
...

u8x8.drawString(0, 0, my_ascii);

I wonder if I can pack those bunch of macros into single one.

J-M-L · March 4, 2024, 1:58pm

The collection of BS_xxx macros is a hack to unwind a recursion. As mentioned previously

it's also ugly because you need a macro specific for the length of your text...

also it won't work because you need to eat up the bytes according to the UTF8 specification, some glyphs will require only 1 byte to be taken into account (like ASCII chars), some other will require 2, 3, or 4 bytes.

run this example to see the glyphs and associated bytes (calculated at run time)

you should see

Character: A 1 byte 0x41
Character: B 1 byte 0x42
Character: C 1 byte 0x43
Character: D 1 byte 0x44
Character: µ 2 bytes 0x0C2 0x0B5
Character: € 3 bytes 0x0E2 0x082 0x0AC
Character: à 2 bytes 0x0C3 0x0A0
Character: ü 2 bytes 0x0C3 0x0BC
Character: 你 3 bytes 0x0E4 0x0BD 0x0A0
Character: 好 3 bytes 0x0E5 0x0A5 0x0BD
Character: ( 1 byte 0x28
Character: n 1 byte 0x6E
Character: ǐ 2 bytes 0x0C7 0x090
Character: 1 byte 0x20
Character: h 1 byte 0x68
Character: ǎ 2 bytes 0x0C7 0x08E
Character: o 1 byte 0x6F
Character: ) 1 byte 0x29
Character: 4 bytes 0x0F0 0x09F 0x098 0x080

The message [ABCDµ€àü你好(nǐ hǎo)] has 33 bytes and 19 glyphs.

click to see the code

/* ============================================
  code is placed under the MIT license
  Copyright (c) 2023 J-M-L
  For the Arduino Forum : https://forum.arduino.cc/u/j-m-l

  Permission is hereby granted, free of charge, to any person obtaining a copy
  of this software and associated documentation files (the "Software"), to deal
  in the Software without restriction, including without limitation the rights
  to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
  copies of the Software, and to permit persons to whom the Software is
  furnished to do so, subject to the following conditions:

  The above copyright notice and this permission notice shall be included in
  all copies or substantial portions of the Software.

  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  THE SOFTWARE.
  ===============================================
*/

const char * message = "ABCDµ€àü你好(nǐ hǎo)😀";

void printUTF8Bytes(const char *str) {

  size_t currentPos = 0;
  size_t glyphsCount = 0;
  while (str[currentPos] != '\0') {
    unsigned char c = str[currentPos];
    int8_t glyphBytes = 0;
    glyphsCount++;

    if ((c & 0x80) == 0)         glyphBytes = 1;  // 1 byte UTF-8 character
    else if ((c & 0xE0) == 0xC0) glyphBytes = 2;  // 2 bytes UTF-8 character
    else if ((c & 0xF0) == 0xE0) glyphBytes = 3;  // 3 bytes UTF-8 character
    else if ((c & 0xF8) == 0xF0) glyphBytes = 4;  // 4 bytes UTF-8 character
    else                         glyphBytes = -1; // Invalid UTF-8 character

    if (glyphBytes > 0) {
      Serial.print(F("Character: ")); Serial.write(&(str[currentPos]), glyphBytes );
      Serial.write('\t');
      Serial.print(glyphBytes);
      Serial.print(glyphBytes > 1 ? F(" bytes\t") : F(" byte\t"));
      for (byte i = 0; i < glyphBytes; i++) {
        Serial.print(F("0x"));
        if (str[currentPos + i] < 0x10) Serial.write('0');
        Serial.print((byte) str[currentPos + i], HEX);
        Serial.write(' ');
      }
      Serial.println();
    } else {
      Serial.println(F("Invalid UTF-8 character"));
    }

    currentPos += glyphBytes;
  }

  Serial.print(F("-------------------\nThe message ["));
  Serial.print(str);
  Serial.print(F("] has "));
  Serial.print(strlen(str));
  Serial.print(F(" bytes and "));
  Serial.print(glyphsCount);
  Serial.print(F(" glyphs."));

}

void setup() {
  Serial.begin(9600);
  printUTF8Bytes(message);
}

void loop() {}

artem_on · March 4, 2024, 6:08pm

I just want to make few remarks.
As I mentioned I need only subset of utf-8 of 128 chars at max and I made sure they are all 2 bytes encoded.
It is not really executive code, it is compilation time routine to help me get pre-known constants in most readable way.
In order to obtain correct constant in case of mixed raw ascii chars and 2-bytes utf-8 chars I will use escape like '0' or any one with 2 zero lower bits, i.e. "абв02" instead of just "абв2"
It is safe in case I make mistake with macro choice. If I provide it much longer string it returns substring. If I give it a string that is too short it is compilator error.

J-M-L · March 4, 2024, 6:46pm

So it’s not a general solution… how much text do you have to store in flash? Is it worth it?

artem_on · March 4, 2024, 7:10pm

After couple of fonts added I found out 98% of program space of atmega328 is being used already. Hundreds of bytes I suppose. Yeah I think it worth it. Thank you!

J-M-L · March 4, 2024, 8:22pm

time to go to something else than the good old atmega328

system · August 31, 2024, 8:23pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Serial.print with UTF-8 characters Programming	13	52020	May 5, 2021
array of UTF-8 strings in FLASH (PROGMEM / strcpy_P / ...) Programming	24	6336	May 5, 2021
Binary constant generator macro Development	8	4542	May 6, 2021
Problem mit Makro Deutsch	12	1555	May 6, 2021
Macros, PROGMEM and such. Programming	9	804	May 5, 2021

Macro to convert utf-8

Related topics