Is there a way to determine the size of a function runtime? (solved?)

Hi,

I am optimizing some code and wondered if there is a way to determine the footprint of a function runtime.

I tried subtracting the addresses of the functions, assuming they are adjacent, but this seems not always true.

void setup()
{
  Serial.begin(115200);
  Serial.print("print0\t");
  Serial.println(((char *)print0) - ((char *)print1));
}

void loop(){}

size_t print0(float n)
{
  return n %10;  // dummy
}

size_t print1(float n)
{
  return n %100;  // dummy
}

Anyone ideas?

idea while writing this: put the addresses in an array and sort them

the idea seems to work if the code does not have too much functions.

  • printed the address of the functions, including setup and loop,
  • copy them in Excel ,
  • sorted them
  • derived the difference between adjacent numbers
  • size column seems to match the actual size. (not checked yet)
func	addr	size
========================
loop	3774	1
print11	3775	28	dummy function
print	3803	212
print2	4015	318
print4	4333	274
print5	4607	288
print6	4895	287
print7	5182	355
print8	5537	534
print9	6071	484
print10	6555	527
print3	7082	359
print1	7441	203
print0	7644	200
setup	7844	-7844
?????	????	-????	which pointer to check to determine size of setup?

The order of the functions in memory do not match the order in which they are called, although there seems some order.
In the source they are called "alphabetically" from setup(). and setup is called first and loop() as last.

code to print the address:

  Serial.print("setup\t");
  Serial.println((char *)setup - (char*)0, DEC);  // difference between two pointers is an int.

of course the adding to the list and sorting and printing should be wrapped in a nice Class => todo

to be continued...

loop() is very efficient :slight_smile:

I would say the order in the source code means nothing, placement in memory is up to the compiler/linker Gods.


Rob

Graynomad:
loop() is very efficient :slight_smile:

The fastest code is the code not written :wink:

I would say the order in the source code means nothing, placement in memory is up to the compiler/linker Gods.

Very true, even individual statement do not necessary need to be adjacent in theory..

but I could imagine that the compiler would place them in a list/memory in the order of appearance (either usage or definition) in the source code. But if the compiler optimizes the parse tree or uses some hashing table, order will be changed.

The trick mentioned above only works if all functions are in the list (or functions will be bigger)

Thought of scanning the memory for typical "save restore register" patterns, but that would be affected by optimizations too.

robtillaart:
but I could imagine that the compiler would place them in a list/memory in the order of appearance (either usage or definition) in the source code.

With a linear address space, every linker I've ever encountered does that; just plops them down in the order encountered.

If the the AVR linker is any good (and it appears to be) it puts things in an order the minimizes jump distances to eliminate "trampolines".

I'm afraid you are not going to find a foolproof solution so you may want to take a step back and describe why you need the runtime size of a set of functions. Maybe there is a reasonable alternative that still gets you what you ultimately need.

I know that I can use objdump.exe to find the size,

but it would be great to compare different implementations of an algorithm in a sketch and make a table with footprint and performance side by side.

Did some prototyping, code is still very experimental,
no documentation (yet) - code is quite self explaining

As always remarks and comments are welcome.

test sketch,

//
//    FILE: testFunctionSizeFinder.ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.00
// PURPOSE: runtime determine size of functions
//    DATE: 
//     URL:
//
// Released to the public domain
//

int dunn;

#include "functionSizeFinder.h"

functionSizeFinder FSF;

void setup() 
{
  Serial.begin(115200);
  Serial.println("Start ");

  k();
}

void loop() 
{
}

int k()
{
  FSF.add("setup", (char*) setup);
  FSF.add("loop", (char*) loop);
  FSF.add("f", (char*) f);
  FSF.add("g", (char*) g);
  FSF.add("k", (char*) k);
  FSF.add("dunn", (char*) &dunn);
  FSF.add("Serial", (char*) &Serial);
  FSF.add("SF", (char*) &FSF);

  for (int i = 0; i < FSF.getCount(); i++)
  {
    Serial.print(FSF.getName(i));
    Serial.print("\t");
    Serial.print(FSF.getAddr(i));
    Serial.print("\t");
    Serial.println(FSF.getSize(i));
  }
}

int f() 
{
  return 25;
}

int g() 
{
  return f() * f() * 3;
}

functionSizeFinder.h

#ifndef FUNCTIONSIZEFINDER_H
#define FUNCTIONSIZEFINDER_H
//
//    FILE: FunctionSizeFinder.h
//  AUTHOR: Rob Tillaart
// PURPOSE: tries to find size of functions runtime
// VERSION: 0.1.00
// HISTORY: 
//     URL: 
//
// Released to the public domain
//

#include "Arduino.h"

#define FUNCTIONSIZEFINDER_VERSION "0.1.00"

class functionSizeFinder
{
private:
    uint8_t _cnt;
    uint8_t _size;
public:
    functionSizeFinder();
    void add(const char * name, const char * address);
    uint8_t getCount();
    char* getName(uint8_t i);
    uint16_t getAddr(uint8_t i);
    uint16_t getSize(uint8_t i);
};

#endif
// END OF FILE

functionSizeFinder.cpp

//
//    FILE: functionSizeFinder.cpp
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.00
// PURPOSE: .
//
// HISTORY:
// 0.1.00 - 2013-12-25 initial version (experimental)
//
// Released to the public domain
//

#include "functionSizeFinder.h"

struct 
{
  char name[8];
  uint16_t addr;
} ar[20];
  

functionSizeFinder::functionSizeFinder()
{
    _size = 20;
    _cnt = 0;
};

// add keeps the list sorted
void functionSizeFinder::add(const char * name, const char * address)
{
    if (_cnt < _size)
    {
        uint16_t tmp = address - (char *) 0;
        uint8_t idx = _cnt;
        while (idx > 0 && ar[idx-1].addr > tmp)
        {
            strncpy(ar[idx].name, ar[idx-1].name, 8);
            ar[idx].addr = ar[idx-1].addr;
            idx--;
        }
        strncpy(ar[idx].name, name, 8);
        ar[idx].addr = tmp;
        _cnt++;
    }
};

uint8_t functionSizeFinder::getCount()
{
    return _cnt;
};

char * functionSizeFinder::getName(uint8_t i)
{
    return ar[i].name;
};

uint16_t functionSizeFinder::getAddr(uint8_t i)
{
    return ar[i].addr;
};

uint16_t functionSizeFinder::getSize(uint8_t i)
{
    if (i < _cnt) return ar[i+1].addr - ar[i].addr;
    return 0;
};

// END OF FILE

output
name, address, size

Start 
loop	111	1
f	112	3
g	115	8
k	123	126
setup	249	65
dunn	314	2
SF	316	202
Serial	518	65018

sounds quite good.

compared to objdump:

000000de <loop>:
  de:   08 95           ret

000000e0 <_Z1fv>:
  e0:   89 e1           ldi     r24, 0x19       ; 25
  e2:   90 e0           ldi     r25, 0x00       ; 0
  e4:   08 95           ret

000000e6 <_Z1gv>:
  e6:   83 e5           ldi     r24, 0x53       ; 83
  e8:   97 e0           ldi     r25, 0x07       ; 7
  ea:   08 95           ret

000000ec <_GLOBAL__I_dunn>:
  ec:   8c e3           ldi     r24, 0x3C       ; 60
  ee:   91 e0           ldi     r25, 0x01       ; 1
  f0:   0e 94 0e 01     call    0x21c   ; 0x21c <_ZN18functionSizeFinderC1Ev>
  f4:   08 95           ret

There seems to be a factor 2 involved
loop
111(dec) == de(hex) / 2 0xDE = 13x 16 + 14 = 222

_Z1fv - f()
112 dec == e0(hex)/2 0XE0 = 14*16 = 224

will patch lib/sketch tomorrow.


update: tested on UNO only

Created a function that scans flash from a function address until it encounters a RET statement in assembly. The code is definitely not fool proof to determine the size but so far it seems to work pretty well. The values returned agree with objdump.exe for the functions tested (about 2 dozen). I expect this trick is portable to other processor architectures.

Description:
The function has one parameter, the address of a function.
This address needs to be casted to (char*) to be accepted by the function.
The function starts at the given address and gets 2 bytes from flash (opcode/operand) until the RETurn statement is encountered.
Then it returns the difference between the address of RET and the start address.

Notes:

  • I added a test for maxint, but that might be superfluous.
  • the code use 32 bit where 16 bit (UNO) probably is sufficient.
  • the AVR addresses - from functions - need to be multiplied by 2 to get the real address.
    (This allows the AVR to have a 16 bit Program Counter when using a 128K flash memory)

As always remarks and comments are always welcome.

//
//    FILE: getBytesUntilReturn.ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.00
// PURPOSE: determine footprint of function runtime
//    DATE: 2013-12-26
//     URL:
//
// Released to the public domain
//

#include "Arduino.h"

// the magic number
#define RET    0x9508  

uint32_t getBytesUntilReturn( char* address )
{
  uint32_t start = (address - (char*)0) * 2;
  uint32_t ad = start;
  uint16_t opcode = 0;
  do
  {
    opcode = pgm_read_word_near(ad);
    ad += 2;
  }
  while ( (ad < 65535) && (opcode != RET) );

  return ad - start;
}

////////////////////////////////////////

void setup() 
{
  Serial.begin(115200);
  Serial.println("getBytesUntilReturn\n");

  uint32_t size = getBytesUntilReturn((char*) btz);
  Serial.print("btz:\t");
  Serial.println(size);

  size = getBytesUntilReturn((char*) f);
  Serial.print("f:\t");
  Serial.println(size);

  size = getBytesUntilReturn((char*) g);
  Serial.print("g:\t");
  Serial.println(size);

  size = getBytesUntilReturn((char*) k);
  Serial.print("k:\t");
  Serial.println(size);

  size = getBytesUntilReturn((char*) setup);
  Serial.print("setup:\t");
  Serial.println(size);

  size = getBytesUntilReturn((char*) loop);
  Serial.print("loop:\t");
  Serial.println(size);

  size = getBytesUntilReturn((char*) getBytesUntilReturn);
  Serial.print("getBytesUntilReturn:\t");
  Serial.println(size);
}

void loop() 
{
}

int k()
{
  return 42;
}

int f() 
{
  return 25;
}

int g() 
{
  return f() * k() * 3;
}

int btz(int n)
{
  if (n > 0) return 1;
  return 0;
}

the library mentioned earlier still needs refactoring, the above function should be in it.

You're not actually parsing the code though, so there's a chance you'll encounter the RET opcode as a literal, for example if the value 0x9508 is used as immediate data.

Maybe try a func that does "int x = 0x9508" and see what happens, not that I know how the ASM would look like for that, presumably something like

ldi r30, $9508

where the opcode would be one word and the value the next.

Yeah I know, it's contrived and unlikely but it might happen and there may be other scenarios.


Rob

If you are using Linux it is easy to script this using avr-nm. The command below will generate the info you are looking for.

avr-nm -CS /tmp/build*/*.elf

This will actually list all builds. The option C generates demangled symbols. The option S gives the function size.

Cheers!

BW: I use the below script (avr-list) to generate an assembly listing and symbols for verification.

avr-size -C /tmp/build*.tmp/$1.cpp.elf > $1.lst
avr-objdump -d /tmp/build*.tmp/$1.cpp.elf >> $1.lst
avr-nm -CS /tmp/build*.tmp/$1.cpp.elf > $1.sym

Below is a snippet of "CosaRS485master.sym" which was generated with the command "avr-list CosaRS485master". The example sketch is part of the Cosa library. The output contains function start address and size. Additional filtering may be used to create a file for further analysis.

/tmp/build2072171476263976679.tmp/CosaRS485master.cpp.elf:
00000210 t .do_clear_bss_loop
00000212 t .do_clear_bss_start
00000220 t .do_global_ctors_loop
00000228 t .do_global_ctors_start
000005ca 000000ac t global constructors keyed to free_memory()
00001630 00000128 t global constructors keyed to UART::uart
00002058 00000022 t global constructors keyed to Event::queue
00000c64 00000012 t global constructors keyed to IOStream::Filter::Filter(IOStream::Device*)
00001f7a 00000020 t global constructors keyed to Watchdog::s_handler
0000207a 000000c4 t global constructors keyed to trace_log_mask
00000394 00000026 W endl(IOStream&)
00002214 00000016 W init()
000006a0 000002a8 T loop()
0000090c 00000120 T setup()
00800347 00000008 V guard variable for uart
00800483 00000008 V guard variable for trace
0080034f 00000045 b ibuf
00800394 00000045 b obuf
008003d9 00000001 B RTC::s_initiated
00001d76 0000003a T RTC::begin(void (*)(void*), void*)
008003e6 00000002 B RTC::s_env
008003e0 00000004 B RTC::s_sec
00001db0 0000004a T RTC::micros()
000019de 00000016 W RTC::millis()
008003de 00000002 B RTC::s_ticks
008003da 00000004 B RTC::s_uticks
008003e4 00000002 B RTC::s_handler
00002236 00000058 T Head::on_event(unsigned char, unsigned int)
000011b6 00000020 T UART::on_rx_interrupt()
000011d6 00000022 T UART::on_tx_interrupt()
00001174 00000042 T UART::on_udre_interrupt()
000010e2 00000002 W UART::on_transmit_completed()
000003d4 0000001e W UART::room()
0080033f 00000008 B UART::uart
000010e4 00000090 T UART::begin(unsigned long, unsigned char)
00000446 0000001e W UART::flush()
00000428 0000001e W UART::getchar()
00001758 0000006e T UART::putchar(char)
00000410 00000018 W UART::peekchar(char)
000003f2 0000001e W UART::peekchar()
000003ba 0000001a W UART::available()
0080042e 00000052 B Event::queue
000010ca 00000018 T Power::sleep(unsigned char)
00000464 00000022 W RS485::on_transmit_completed()
000019f0 00000398 T RS485::recv(void*, unsigned int, unsigned long)
000017c4 0000021a T RS485::send(void const*, unsigned int, unsigned char)
000021c6 0000004e T Trace::begin(IOStream::Device*, char const*)
00002160 0000006a T Trace::fatal_P(char const*, int, char const*)
00000562 00000068 W IOBuffer<(unsigned char)64>::gets(char*, unsigned int)
0000049e 00000018 W IOBuffer<(unsigned char)64>::room()
00000676 0000002c W IOBuffer<(unsigned char)64>::flush()
0000053c 00000026 W IOBuffer<(unsigned char)64>::getchar()
000004b6 0000002e W IOBuffer<(unsigned char)64>::putchar(char)
00000508 00000034 W IOBuffer<(unsigned char)64>::peekchar(char)
000004e4 00000024 W IOBuffer<(unsigned char)64>::peekchar()
00000486 00000018 W IOBuffer<(unsigned char)64>::available()
00000a34 00000016 T IOStream::set_device(IOStream::Device*)
00000a4a 00000064 T IOStream::print_prefix(IOStream::Base)
00000e96 000000b8 T IOStream::print(int, IOStream::Base)
00000d2c 00000072 T IOStream::print(unsigned int, IOStream::Base)
00000e18 00000082 T IOStream::print(long, IOStream::Base)
00000d9a 00000082 T IOStream::print(unsigned long, IOStream::Base)
00000c76 00000090 T IOStream::Device::gets(char*, unsigned int)
00800333 00000003 B IOStream::Device::null
00000d04 00000028 T IOStream::Device::puts(char const*)
00000c10 0000004e T IOStream::Device::read(iovec_t*)
00000bbc 00000054 T IOStream::Device::read(void*, unsigned int)
00000ab4 00000006 T IOStream::Device::room()
00000c5e 00000006 T IOStream::Device::flush()
00000b5c 0000004e T IOStream::Device::write(iovec_t const*)
00000b08 00000054 T IOStream::Device::write(void const*, unsigned int)
00000ac0 00000048 T IOStream::Device::puts_P(char const*)
00000bb6 00000006 T IOStream::Device::getchar()
00000aba 00000006 T IOStream::Device::putchar(char)
00000bb0 00000006 T IOStream::Device::peekchar(char)
00000baa 00000006 T IOStream::Device::peekchar()
00000aae 00000006 T IOStream::Device::available()
0000213e 00000022 W IOStream::printf_P(char const*, ...)
00000f4a 00000188 T IOStream::vprintf_P(char const*, void*)
00000a24 00000010 T IOStream::IOStream()
00000370 00000024 W IOStream::operator<<(char const*)
0080042c 00000001 B Watchdog::s_prescale
00001f9a 000000c2 T Watchdog::await(bool (*)(void*), void*, unsigned int)
00001e8e 00000064 T Watchdog::begin(unsigned int, unsigned char, void (*)(void*), void*)
008003ea 00000002 B Watchdog::s_env
0080042d 00000001 B Watchdog::s_mode
00800428 00000004 B Watchdog::s_ticks
008003ec 0000003c B Watchdog::s_timeq
008003e8 00000002 B Watchdog::s_handler
avr-nm -nS *.elf
  :
00000100 00000056 T loop
00000156 0000000c T setup
00000162 00000090 T __vector_16
000001f2 00000076 T init
00000268 000000d0 T delay
00000338 0000007e T pinMode
000003b6 000000a8 T digitalWrite
0000045e 0000001e T main
0000047c 00000002 t __empty
0000047c 00000002 W yield
0000047e T _exit
0000047e W exit
00000480 t __stop_program
00000482 A __data_load_start
00000482 T _etext
  :

@Graynomad
Yes, the code is very naive at best and indeed when the magic number is used as either an offset in the code or as operand the code will fail to deliver.

I had a look at the AVR instruction set - http://www.atmel.com/images/doc0856.pdf - to see how complex it would be to write a "disassembler" that would be smart enough to catch the operand scenario and that seems doable. 9508 as operand can only occur with the 32 bit instructions and there are less than 10 of those. E.g. CALL is one of them (CALL calls another function). Recognizing these will catch the operand scenario.

A far more difficult scenario to tackle is compiled functions with multiple RET in it. It is easy to imagine that the assembler code for a function like below (but more elaborate) has three places it returns. Did not encounter this yet.

uint32_t example(uint32_t n)
{
  if (n== 0) return 0;
  // ...lines of code 
  if (n%2)  return 3*n + 1;
  return n/2;
}

@kowalski, westfw
avr-nm is a great tool, and far superior to what I have written (no doubt), but it cannot be called on the UNO itself.
The reason why that is important for me is as follows:

I am often optimizing code both for fun and more serious needs. When I try different approaches I want to compare the effects on speed and space as quickly as possible. So if I can do it runtime and generate performance and footprint in the output of the sketch it would be optimal. Then I do not need to swap between different tools, parsing and merging their output.

With the getBytesUntilReturn function I can (within its limits) do :

start = micros();
func();  // or a loop 
duration =micros() - start;
size = getBytesUntilReturn((char*) func);

Serial.print("func\t");
Serial.print(duration);
Serial.print("\t");
Serial.print(size);
Serial.println();

This done for multiple functions will generate output like :

name    time    size
f1  960 234
f2  1024    324
f3  876 236
f4  400 1200
f5  404 896

in which I can see the effects of a new algorithm/variation directly and copy paste the numbers into a spreadsheet to graph them.

And yes, knowing the limits of getBytesUntilReturn() I must not accept the numbers blindly. (you never should :wink:
And yes, it should have a better name too, but at least now it just does what it says it does.

Thanks for the feedback,

I haven't delved into the bowels of the machine to the opcode level for a few years now, brings back good memories :slight_smile:

WRT the multiple RET scenario, you could make it a little more robust by identifying a function preamble immediately following. This might be as simple as finding a PUSH.

Alternatively only count a RET if it immediately follows a POP. I'm pretty sure the post amble would have a POP as the last instruction. Exactly what register is popped may change but it's possible the upper 7 and lower 4 bits of the POP opcode are unique so you don't care about the 5 register-select bits in the middle.

EDIT: Just thinking about that, as the 5 reg select bits can be any value from 0 to 31 the rest of the opcode has to be unique.


Rob

Graynomad:
WRT the multiple RET scenario, you could make it a little more robust by identifying a function preamble immediately following. This might be as simple as finding a PUSH.

A full stack frame (PUSHs and POPs) is somewhat rare with AVR processors. They don't seem to be needed very often due to the number of registers and they way the compiler manages them.

(In general it is a good idea. Just one that probably won't be reliable with our favourite Arduino processor.)

Some inline assembly and an illegal opcode (or rarely used opcode) may do the trick...

int f() 
{
  return 25;
  asm volatile(".dw 0xA5A5\n\t"::);
}

int g() 
{
  return f() * f() * 3;
  asm volatile(".dw 0xA5A5\n\t"::);
}

(I believe ".dw" is used to output a word sized value. 0xA5A5 is just an example; I have no idea if it is or is not an illegal opcode.)

@Graynomad
That idea had crossed my mind too, could be just a mask & compare: POP = 1001 000d dddd 1111
would improve the quality IF a POP is present. A small function like
int f()
{
return 42;
}
does not push / pop,

so the algo should count the PUSHes at the begin and match them with POPs at the end,
optionally take into account the reverse order of registers.

maybe the runtime disassembler idea is not a bad one after all. Needs: MEGA + graduation project + student...

@Coding Badly
injecting code/tags this way is fun, however it might affect the optimized code under test :wink:

robtillaart:
@Coding Badly
injecting code/tags this way is fun, however it might affect the optimized code under test :wink:

Yes, it will. It will make the function one word larger but no slower nor faster. :wink:

what I meant was it will affect its behaviour

How? The eye-catcher is after the return.