Hello!
I want to store in program memory (flash) a bunch of strings which consist of utf-8 character subset but converted to 1 byte encoding format occupying codes let's say 128-200 in order to save some memory.
Of course, I could encode those via some sort of converter and insert them into my code.
But I'd like much more if it's possible to have a macro in my code just before each given utf-8 string obtaining at compilation time required array of single byte character codes.
I tried to write that macro with help of ChatGPT:
Macro supposes each utf-8 character is 1-2 bytes length. It's not important part.
If I get working macro I will use PROGMEM to place it in flash, but for now I'm just trying to print it.
Compiler error I get when I use this macro in code like somePrintFunc(UTF8_TO_BYTE("")); Error: call to non-constexpr function '<lambda()>'
My knowledge of C++ is not enough in this case. So I wonder is it even possible to create such macro that would keep only converted byte code in program memory and rid of utf-8 strings?
The collection of BS_xxx macros is a hack to unwind a recursion. As mentioned previously
it's also ugly because you need a macro specific for the length of your text...
also it won't work because you need to eat up the bytes according to the UTF8 specification, some glyphs will require only 1 byte to be taken into account (like ASCII chars), some other will require 2, 3, or 4 bytes.
run this example to see the glyphs and associated bytes (calculated at run time)
The message [ABCDµ€àü你好(nǐ hǎo)] has 33 bytes and 19 glyphs.
click to see the code
/* ============================================
code is placed under the MIT license
Copyright (c) 2023 J-M-L
For the Arduino Forum : https://forum.arduino.cc/u/j-m-l
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
===============================================
*/
const char * message = "ABCDµ€àü你好(nǐ hǎo)😀";
void printUTF8Bytes(const char *str) {
size_t currentPos = 0;
size_t glyphsCount = 0;
while (str[currentPos] != '\0') {
unsigned char c = str[currentPos];
int8_t glyphBytes = 0;
glyphsCount++;
if ((c & 0x80) == 0) glyphBytes = 1; // 1 byte UTF-8 character
else if ((c & 0xE0) == 0xC0) glyphBytes = 2; // 2 bytes UTF-8 character
else if ((c & 0xF0) == 0xE0) glyphBytes = 3; // 3 bytes UTF-8 character
else if ((c & 0xF8) == 0xF0) glyphBytes = 4; // 4 bytes UTF-8 character
else glyphBytes = -1; // Invalid UTF-8 character
if (glyphBytes > 0) {
Serial.print(F("Character: ")); Serial.write(&(str[currentPos]), glyphBytes );
Serial.write('\t');
Serial.print(glyphBytes);
Serial.print(glyphBytes > 1 ? F(" bytes\t") : F(" byte\t"));
for (byte i = 0; i < glyphBytes; i++) {
Serial.print(F("0x"));
if (str[currentPos + i] < 0x10) Serial.write('0');
Serial.print((byte) str[currentPos + i], HEX);
Serial.write(' ');
}
Serial.println();
} else {
Serial.println(F("Invalid UTF-8 character"));
}
currentPos += glyphBytes;
}
Serial.print(F("-------------------\nThe message ["));
Serial.print(str);
Serial.print(F("] has "));
Serial.print(strlen(str));
Serial.print(F(" bytes and "));
Serial.print(glyphsCount);
Serial.print(F(" glyphs."));
}
void setup() {
Serial.begin(9600);
printUTF8Bytes(message);
}
void loop() {}
I just want to make few remarks.
As I mentioned I need only subset of utf-8 of 128 chars at max and I made sure they are all 2 bytes encoded.
It is not really executive code, it is compilation time routine to help me get pre-known constants in most readable way.
In order to obtain correct constant in case of mixed raw ascii chars and 2-bytes utf-8 chars I will use escape like '0' or any one with 2 zero lower bits, i.e. "абв02" instead of just "абв2"
It is safe in case I make mistake with macro choice. If I provide it much longer string it returns substring. If I give it a string that is too short it is compilator error.
After couple of fonts added I found out 98% of program space of atmega328 is being used already. Hundreds of bytes I suppose. Yeah I think it worth it. Thank you!