Using multiple files increases object size?

While restructuring my first non-trivial Arduino program, I noticed that using multiple files increases the size of the binary. Attached is a very simple demo in two flavors. SizeTest-1file.zip has everything together in SizeTest.ino; it compiles to 2756 bytes. SizeTest-2files.zip moves the data and function bodies to a new file Module.cpp, and puts just the declarations in SizeTest.ino; it compiles to 2812 bytes. That's 56 bytes more for no functional difference. I wonder what causes the increase? I would expect that once everything is linked together, there would be no significant difference in object size (maybe a few bytes for alignment or so).

SizeTest-1file.zip (484 Bytes)

SizeTest-2files.zip (727 Bytes)

That code for the 2 file version is guaranteed not to compile, didn't look at the 1 file version. If you had the two examples in the same folder then your results are not what you think.

Of course I didn't mix them together in the same folder. Each represents an independent example: you can unzip one, build it, note the object file size, delete the SizeTest folder then unzip the other one, build that and note its object file size too. My numbers were for a Teensy 2.0 board and both versions compile fine as is. I've noticed however that when you switch the board to, say, a Diecimila, both versions need to #include <string.h>. This is probably what you meant with "guaranteed not to compile". Teensyduino probably adds some stuff behind the scenes that makes it work.

The question still stands however: with the #include added and a Diecimila as target board, I get 466 bytes for the 1 file version and 530 for the 2 file version = 64 bytes difference.

Your code will not compile is what i'm saying.

No matter how much '#include <string.h>' you give it, it still will not compile, there is in fact no executable code there.

#include <WProgram.h> //Not 1.0 compatible.

extern char data[];                     //Great that this is external, but where is its external declaration.
int mysum(int x, int y, int z);       //This is a function prototype, no definition, you encounter an 'undefined reference' error.
int mylength(char* x);                 //Ditto

void setup()
{
    int length = mylength(data);               //What is mylength, I know what to pass to it, but where is the code?
    int answer = mysum(length, 42, 69);   //Ditto
}

void loop()
{
}

Those two functions are implemented in the cpp file that is included. What is missing, though, is anything that makes the IDE actually compile that file. There is no matching header file for the source file, and the header file is not included in the sketch.

hackdog:
While restructuring my first non-trivial Arduino program, I noticed that using multiple files increases the size of the binary. Attached is a very simple demo in two flavors. SizeTest-1file.zip has everything together in SizeTest.ino; it compiles to 2756 bytes. SizeTest-2files.zip moves the data and function bodies to a new file Module.cpp, and puts just the declarations in SizeTest.ino; it compiles to 2812 bytes. That's 56 bytes more for no functional difference. I wonder what causes the increase? I would expect that once everything is linked together, there would be no significant difference in object size (maybe a few bytes for alignment or so).

Without looking at your files, which others say don't compile at all, this doesn't totally surprise me. The compiler optimizes to quite a high level. Compiling in separate files makes it harder to do so. With everything in one file, the compiler might remove code that it see aren't used. However the linker isn't quite as smart as that. It might omit whole files, or maybe complete functions inside a file, but it doesn't really have the same ability to optimize that the compiler has.

Besides, 56 bytes isn't a lot. That is about 30 instructions. If you lost 1000 bytes I would be more worried.

I find it quite amusing how everyone is inferring various things about my examples obviously without bothering to try them or even looking at them in full... It's not like I "made up" those file sizes. But remember, you don't have to use them. Feel free to make your own example and look at the binary size before and after splitting a file up in multiple parts. I was merely providing an example to clarify my point.

To conclude, I think Nick is right; the compiler can probably optimize a single complete file better than multiple partial files which are then linked together. It's not that I'm worried about it in any way, it just surprised me that there was a discrepancy at all: since the two versions are functionally equivalent, I assumed that the object file would be the sum of its constituent parts too, give or take a few bytes. Apparently that is not the case, but the difference is not large enough to be troublesome.

@PaulS: well, for simplicity, I didn't include a header file but inserted the forward declarations verbatim at the top of the .ino. That section would normally be in a .h, yes. The .cpp doesn't need them in this case; there are no interdependencies so the definitions are sufficient. To rule out the possibility that the IDE actively checks for the presence of a .h file, I took SizeTest-2files, put the 3 declarations in a Module.h and #included that in SizeTest.ino an Module.cpp. It compiles fine and has the same byte size as the "original" SizeTest-2files.

@pYro_65: thanks for annotating my code. Given that there are only two files inside SizeTest-2files.zip, isn't it obvious that the other file might contain the definitions you were looking for? :slight_smile: You're right about WProgram.h though, that should become Arduino.h. Searching through my Arduino installation directories reveals just one WProgram.h, used by the Teensy core. Since my target was originally a Teensy, that's why I didn't get any compilation errors.

@PaulS: well, for simplicity, I didn't include a header file but inserted the forward declarations verbatim at the top of the .ino. That section would normally be in a .h, yes. The .cpp doesn't need them in this case; there are no interdependencies so the definitions are sufficient. To rule out the possibility that the IDE actively checks for the presence of a .h file, I took SizeTest-2files, put the 3 declarations in a Module.h and #included that in SizeTest.ino an Module.cpp. It compiles fine and has the same byte size as the "original" SizeTest-2files.

OK. That was a necessary step in locating the file size discrepancy. I guess, then, that the IDE copies and compiles all code in the sketch directory, used or not.

Thank you for confirming that.

In both cases, the values returned by the function calls are stored in local variables. In the all-code-in-one-file scenario, the compiler can see that the local variable is never used anywhere, and that the function call itself does not modify any global variables, so, it is likely that the function call is optimized away.

In the multiple file situation, the compiler can not tell that there are no side-effects from the function calls, so they can not be optimized away.

@pYro_65: thanks for annotating my code. Given that there are only two files inside SizeTest-2files.zip, isn't it obvious that the other file might contain the definitions you were looking for? You're right about WProgram.h though, that should become Arduino.h. Searching through my Arduino installation directories reveals just one WProgram.h, used by the Teensy core. Since my target was originally a Teensy, that's why I didn't get any compilation errors.

No, the compiler is not a magician, I think you are misusing or misunderstanding the IDE you are using. How does the Module.cpp get included ???

You can put your code in a billion different files, but if you do not include them, They are not going to be magically added for you.
Multiple files can be optimised just as much as single files.

If your compiler is actually magic, the Module.cpp must be included after the code in the .pde, Then the definition is not available until after the function call and therefore cannot be optimised.

If you provide the definition at compile-time, it will be optimised. If in a separate unit, it has to be linked in.

cpp files usually create translation units, with out a header specifying linkage, it cannot be added in. something is happening for you. So if you rely on this mechanism, it will eventually cause pain.

pYro_65, I think I now understand the cause of the confusion here. Let me explain.

From the perspective of a simple command-line build with makefiles, the compiler is of course indeed not a magician. It must be told exactly which files should be compiled. When all files are compiled, avr-gcc can link them together and produce the final object file. Again this is something that must be told explicitly to avr-gcc. If there are no other processes going on, you are correct in saying that just adding extra files will make no difference, because they are not offered to avr-gcc for compilation and linking.

But although avr-gcc is not a magician, the standard Arduino IDE is, sort of: when compiling a sketch, it does a couple of transformations on the .ino (automatically inserting function prototypes where needed). The .ino is translated to a .cpp and then the IDE takes every .S, .c and .cpp file in the sketch's directory and compiles and them for you! It also automatically includes the .cpp files of the libraries that you use. You can see this process for yourself in the Java source code of the IDE that comes with it: look in src\Compiler.java, around line 122 and beyond. And if you enable verbose logging, you'll see that all .cpp files are nicely compiled and linked to the processed sketch file.

Of course I don't know how you compile your sketches, but I use the Arduino IDE for compilation and uploading precisely because it automatically handles all those boring details for me.

cpp files usually create translation units, with out a header specifying linkage, it cannot be added in.

That is not correct: a header file is just a simple piece of text with forward declarations (and often macros or external variable declarations) that is copy-pasted by the C preprocessor into the #include line. No magic there either. The reason header files are used is because C and C++ need to know the function declarations before you can use them. The actual function body can be anywhere, as long as the linker can find it.

hackdog:
To conclude, I think Nick is right; the compiler can probably optimize a single complete file better than multiple partial files which are then linked together.

It is baffling. Use this test code:

#include <Arduino.h>

char data[] = "HELLO WORLD!";
int mysum(int x, int y, int z) { return x+y+z; }
int mylength(char* x) { return strlen(x); }

int length;
int answer;

void setup()
{
    length = mylength(data);
    answer = mysum(length, 42, 69);
}

void loop()
{
    length++;
    answer++;
}

...or the equivalent with an #include "Module.h" (declaring the items defined in Module.cpp, just to exclude any arguments in this area)

#include <Arduino.h>
#include "Module.h"

int length;
int answer;

void setup()
{
    length = mylength(data);
    answer = mysum(length, 42, 69);
}

void loop()
{
  length++;
  answer++;
}

...which use globals now and some operations on them, and the code sizes are still not the same.

This is not surprising. In the all-in-one case, the compiler can tell that there are no side-effects of the function calls, and can optimize away the function calls. After all, the value that the function will return is known at compile time.

If the functions depended on things that can not be known at compile time, I think you would see similar file sizes.

The only reason the code is larger is because the code is in a different translation unit and therefore not available for the optimiser. It is forced to a function call by design. These function calls cannot be optimised away.

functions in a header file can be optimised as the code is available to the current translation unit.

PaulS:
If the functions depended on things that can not be known at compile time, I think you would see similar file sizes.

Good point! This test code:

#include <Arduino.h>

long foo() {return random(1, 1000);}
long bar;

void setup() {
  bar = foo();
}

void loop() {
  bar++;
}

... and the corresponding case using #include (which I omit as it's obvious) produce same code sizes OMM (Arduino 1.0, Mega board).

pYro_65:
functions in a header file can be optimised as the code is available to the current translation unit.

Don't think so. The compiler does not know the code (definition) of a function included via header file. The compiler gets the locations of the header files (option -I), but not the corresponding implementations.

Indeed! So actually I've stated my question the wrong way around: it's not getting bigger when putting things in multiple files, no, that's the regular expected size. And depending on the code, it's possible it gets even smaller when putting everything together in 1 file, because the compiler may then be able to optimize things even more. Neat!