How long can a string variable be?

(deleted)

String size is limited by program size and other variables used, i.e. memory size.

However, it is NOT RECOMMENDED to use Strings on the Arduino, but you can use null terminated character strings instead.

TimEllis:
I seem to be having trouble with long strings. ie Sting stringExample = ".. some long string ...";
I do not see an upper limit on the string length. What is the maximum length?

Your example actually contains two strings:

  1. One is a string literal ".. some long string ...", which is used as an initializer for stringExample object. That string literal will be persistently stored in SRAM, unless you explicitly relocate it to flash (F macro and/or PROGMEM).

  2. Another is the stringExample object which will perform dynamic memory allocation in SRAM at run-time and copy the original string literal into that allocated memory.

So, at run-time as long as stringExample lives, you will have two copies of your string in memory. And if you'll keep increasing the length of the original string literal, you will eventually run out of memory. Which memory you will run out of first and at what length - depends on the above factors and on how much memory you have available. In your example everything is in SRAM and you will eventually run out of SRAM.

This also demonstrates that using such declarations with long string literals and long-lived String objects is not a good idea: you are effectively storing the same data twice. This might make sense if you have to make significant modifications to the copy, but it is not clear from your example.


Also, from a bigger picture perspective:

Arduino UNO is a 16-bit memory platform. This means that there will always be a hard upper limit of 65536 bytes imposed on any continuous object size. However, due to some peculiarities of C/C++ implementations, C/C++ compilers usually limit the maximum object size by half that size (32768 for 16-bit platform).

It is not a good idea to use the String (capital S) class on an Arduino as it can cause memory corruption in the small memory on an Arduino. This can happen after the program has been running perfectly for some time. Just use cstrings - char arrays terminated with '\0' (NULL).

Using cstrings puts you in complete control of the amount of memory you allocate for the cstrings.

...R

That string literal will be persistently stored in SRAM, unless you explicitly relocate it to flash (F macro and/or PROGMEM).

That string literal will be stored in flash. It will be copied to SRAM at run time, unless the F() macro or PROGMEM keyword is used to stop that from happening.

PaulS:
That string literal will be stored in flash. It will be copied to SRAM at run time, unless the F() macro or PROGMEM keyword is used to stop that from happening.

True. On any platform all initialized data with static storage duration has to be stored somewhere "with the program" or "in the program" in one way or another. It does not have to be stored literally, but with string literals this will usually be the case.

Yet I'm presenting it from the point of view of C or C++ platform, which doesn't have a concept of "copying to SRAM". In C and C++ each string literal is a non-modifiable array object with static storage duration. Objects with static storage duration on Arduino reside in SRAM. How they got there is something C and C++ platform doesn't concern itself about.

Montmorency:
Arduino UNO is a 16-bit memory platform. This means that there will always be a hard upper limit of 65536 bytes imposed on any continuous object size. However, due to some peculiarities of C/C++ implementations, C/C++ compilers usually limit the maximum object size by half that size (32768 for 16-bit platform).

Arduino UNO is ATmega328P MCU based Learning Kit. The MCU contains 32768 bytes 'flash memory space' which from 'program codes storage point of view' is organized as 'word (2-byte) locations'. Thus, the MCU contains 16384 'word locations'. What is this: '65536 bytes' figure?

GolamMostafa:
Arduino UNO is ATmega328P MCU based Learning Kit. The MCU contains 32768 bytes 'flash memory space' which from 'program codes storage point of view' is organized as 'word (2-byte) locations'. Thus, the MCU contains 16384 'word locations'. What is this: '65536 bytes' figure?

'65536' figure is the value of standard SIZE_MAX constant defined in <stdint.h> (as well as of UINTPTR_MAX constant)

#define SIZE_MAX UINT16_MAX

It simply means that standard size_t is a 16-bit type in this implementation. This constant is the upper limit for a continuous object size in any C or C++ implementation. This upper limit is not guaranteed to be "tight", i.e. its not guaranteed to be reachable (and normally it is not, as I said above). But it is still an inherent limitation of each C or C++ platform.

It has nothing to do with how much memory you have in your Arduino UNO. It refers to the assumptions and restrictions our avr-gcc compiler relies upon when generating code.

The question got me wondering how GCC would handle this kind of declaration at namespace scope

char str[256] = "ab";

An optimizing compiler might be smart enough to realize that the above initialized data is predominantly zeros and that the best way to implement this would be to mark str as zero-initialized data (which occupies nothing in compiled code/in flash) and then simply do str[0] = 'a'; str[1] = 'b'; in SRAM at program startup.

However, a quick experiment shows that avr-gcc takes a lazy way out: the whole 256 byte "image" of the future string is plainly and openly stored in program (flash) memory. I.e. changing the array size will change the compiled program size by the same amount. This is surprisingly lazy.

This points at a manual optimization opportunity (when regard to program size, i.e. flash usage):

When you have something like the above in your code, i.e. a static array which is mostly initialized with zeros, a more optimal way to initialize it would be

char str[256];

void setup()
{
  str[0] = 'a'; 
  str[1] = 'b';
}

In this version the "image" of the array will not be presented in the flash memory and the program size will not be affected by the array size.


Wow, even this

unsigned char a[512] = {};

makes the program size to grow by the array size. This is sloppy.

This is surprisingly lazy.

It is necessary. You have declared an array of 256 elements. The compiler has no way of knowing that you will never use all of them.

(deleted)

PaulS:
It is necessary. You have declared an array of 256 elements. The compiler has no way of knowing that you will never use all of them.

This has nothing to with whether I will use them or not. In both cases I get a full array in SRAM and I can use all elements, if I wish.

The question is about how this array is initially created for me at program startup.

When I declare

unsigned char a[512];

this array is implicitly zero-initialized. The compiler is smart enough to realize that creation of this array in SRAM can be "compressed" into a very efficient and very compact sequence of steps at program startup:

  1. Make sure 512 bytes block is allocated in SRAM for a.,
  2. Use something like memset(a, 0, 512) to fill this memory with zeros.
    Done.

This results in a very compact code. Note that if I change the array size to 256 or to 1024, the program's code size will not change at all (try it). In the above sequence of steps 512 will simply get replaced with 256 or with 1024. This is how it should be.

Now, if I do

unsigned char a[512] = { 42 };

the program size suddenly jumps up. Even though it is still very easy to "compress" the initialization code for this array to something very very very compact, avr-gcc now insists on including the "image" of this array into the compiled program code: 42 followed by 511 bytes of zeros. The compiled program size consequently jumps up.

There's no need to do that. This array can be initialized efficiently. But avr-gcc refuses to.

I understand that such optimization is not a priority on "large" platforms. But on microcontrollers it would be nice to have.

Montmorency:
This has nothing to with whether I will use them or not. In both cases I get a full array in SRAM and I can use all elements, if I wish.

The question is about how this array is initially created for me at program startup.

When I declare

unsigned char a[512];

this array is implicitly zero-initialized. The compiler is smart enough to realize that creation of this array in SRAM can be "compressed" into a very efficient and very compact sequence of steps at program startup:

  1. Make sure 512 bytes block is allocated in SRAM for a.,
  2. Use something like memset(a, 0, 512) to fill this memory with zeros.
    Done.

This results in a very compact code. Note that if I change the array size to 256 or to 1024, the program's code size will not change at all (try it). In the above sequence of steps 512 will simply get replaced with 256 or with 1024. This is how it should be.

Now, if I do

unsigned char a[512] = { 42 };

the program size suddenly jumps up. Even though it is still very easy to "compress" the initialization code for this array to something very very very compact, avr-gcc now insists on including the "image" of this array into the compiled program code: 42 followed by 511 bytes of zeros. The compiled program size consequently jumps up.

There's no need to do that. This array can be initialized efficiently. But avr-gcc refuses to.

and what is your proposal?

arduino_new:
and what is your proposal?

A proposal? OK...

Here's two sketches. These are contrived examples that just make use of an array and make sure it is not optimized out. Let' s watch the compiled program size and global data size.

Number 1:

const size_t SIZE = 1024;
unsigned char a[SIZE] = { 42 };

void setup()
{
  for (unsigned n = 10; n > 0; --n)
    a[random(SIZE)] = random(100);

  unsigned sum = 0;
  for (unsigned i = 0; i < SIZE; ++i)
   sum += a[i];

  Serial.begin(9600);
  Serial.println(sum);
}

void loop() {}

Sketch uses 3226 bytes (10%) of program storage space. Maximum is 32256 bytes.
Global variables use 1216 bytes (59%) of dynamic memory, leaving 832 bytes for local variables. Maximum is 2048 bytes.

Number 2:

const size_t SIZE = 1024;
unsigned char a[SIZE];

void setup()
{
  a[0] = 42;
  
  for (unsigned n = 10; n > 0; --n)
    a[random(SIZE)] = random(100);

  unsigned sum = 0;
  for (unsigned i = 0; i < SIZE; ++i)
   sum += a[i];

  Serial.begin(9600);
  Serial.println(sum);
}

void loop() {}

Sketch uses 2208 bytes (6%) of program storage space. Maximum is 32256 bytes.
Global variables use 1216 bytes (59%) of dynamic memory, leaving 832 bytes for local variables. Maximum is 2048 bytes.


Both sketches are functionally equivalent. Both sketches use exactly the same amount of data. However, for the first sketch program size is 1018 bytes larger.

My proposal (as you called it) is: when AVR-GCC compiler is invoked with -Os parameter (optimize for size), it should be smart enough to follow the same strategy as I did in sketch #2. It should be smart enough to generate a 2208 byte code, instead of 3226 byte code it generates now.

Montmorency:
My proposal (as you called it) is: when AVR-GCC compiler is invoked with -Os parameter (optimize for size), it should be smart enough to follow the same strategy as I did in sketch #2. It should be smart enough to generate a 2208 byte code, instead of 3226 byte code it generates now.

That's not a proposal. What I meant is that how do you suggest the compiler writers to handle this specific case?

arduino_new:
That's not a proposal. What I meant is that how do you suggest the compiler writers to handle this specific case?

Oh, easy:

  1. The compiler writers look at the initializers supplied for an aggregate A (array or class) and calculate, what percentage of the initialized aggregate will end up being non-zero.

For example: int A[128] = { 542, 1234 };. Only 1.5% of this aggregate is initially non-zero.

  1. If the percentage calculated at step 1 is smaller than some threshold value, then the following initialization strategy is used at program startup:
  • Zero-out the whole aggregate in SRAM: e.g. memset (&A, 0, sizeof A)
  • Then place the required non-zero values in the proper location of the aggregate in SRAM. E.g. A[0] = 542; A[1] = 1234.
    Done.
  1. If the percentage calculated at step 1 is greater than the threshold value, then the currently implemented initialization strategy is used:
  • The full "image" of the future array is stored in program's code in flash.
  • At startup this "image" is copied from flash to SRAM.
    Done.
  1. A better decision strategy for choosing between steps 2 or 3 is not a set threshold value, but rather a size comparison between a) size of the code generated for step 2, and b) size of "image" generated for step 3. Choose whichever is smaller.

That's all.

Montmorency:
Oh, easy:

  1. The compiler writers look at the initializers supplied for an aggregate A (array or class) and calculate, what percentage of the initialized aggregate will end up being non-zero.

For example: int A[128] = { 542, 1234 };. Only 1.5% of this aggregate is initially non-zero.

  1. If the percentage calculated at step 1 is smaller than some threshold value, then the following initialization strategy is used at program startup:
  • Zero-out the whole aggregate in SRAM: e.g. memset (&A, 0, sizeof A)
  • Then place the required non-zero values in the proper location of the aggregate in SRAM. E.g. A[0] = 542; A[1] = 1234.
    Done.
  1. If the percentage calculated at step 1 is greater than the threshold value, then the currently implemented initialization strategy is used:
  • The full "image" of the future array is stored in program's code in flash.
  • At startup this "image" is copied from flash to SRAM.
    Done.
  1. A better decision strategy for choosing between steps 2 or 3 is not a set threshold value, but rather a size comparison between a) size of the code generated for step 2, and b) size of "image" generated for step 3. Choose whichever is smaller.

That's all.

Cool, let us sit here and wait for the compiler writer people to come and fetch this idea.

arduino_new:
Cool, let us sit here and wait for the compiler writer people to come and fetch this idea.

Oh, not need to fetch anything. This idea has been around for quite a while. It is very well known to "compiler writer people". Nothing revolutionary here. It just that it is unlikely to become a priority in the near future.

My posts above are not intended for "compiler writer people". It simply demonstrates how, given the current behavior of the compiler, a well-meaning use of seemingly innocent language feature might result in unjustified jump of program code size. And how it can be worked around.

Montmorency:
Oh, easy:

  1. The compiler writers look at the initializers supplied for an aggregate A (array or class) and calculate, what percentage of the initialized aggregate will end up being non-zero.

For example: int A[128] = { 542, 1234 };. Only 1.5% of this aggregate is initially non-zero.

  1. If the percentage calculated at step 1 is smaller than some threshold value, then the following initialization strategy is used at program startup:
  • Zero-out the whole aggregate in SRAM: e.g. memset (&A, 0, sizeof A)
  • Then place the required non-zero values in the proper location of the aggregate in SRAM. E.g. A[0] = 542; A[1] = 1234.
    Done.
  1. If the percentage calculated at step 1 is greater than the threshold value, then the currently implemented initialization strategy is used:
  • The full "image" of the future array is stored in program's code in flash.
  • At startup this "image" is copied from flash to SRAM.
    Done.
  1. A better decision strategy for choosing between steps 2 or 3 is not a set threshold value, but rather a size comparison between a) size of the code generated for step 2, and b) size of "image" generated for step 3. Choose whichever is smaller.

That's all.

Have you dissasembled the code ?

bidouilleelec:
Have you dissasemble the code ?

Disassemble? For what?

One can inspect sizes of .text, .data and .bss segments in the compiled code using avr-size to see what is gong on.

But the information about program and data size you see after successful compilation provides pretty much the same thing (since avr-size is exactly what Arduino IDE uses to obtain these values).