Function() vs Speed

DuaneB:
ISR post suggested that there was a standard prolog and epilogue to calling and ISR which added about 55 cycles if I remember correctly.

There is no such thing as a "standard prolog and epilogue". The compiler outputs what is necessary and no more. Unless you're willing to delve into assembly, the minimum is 19 cycles plus 4 for the interrupt and 2 for the relative jump for a total of 25 cycles.

Am I missing something, are ISRs fundamentally different in the way that registers are preserved ?

Interrupt service routines have to preserve all registers that are or could be used. Normal functions only have to preserve a very specific subset of registers.

Found it, here is the original post to which I am referring...

Bear in mind that article is not about interrupt service routines in general but specifically about the Arduino interrupt API.

Am I missing something, are ISRs fundamentally different in the way that registers are preserved ?

Yes.

Let's not just guess, shall we?

Here's a test sketch:

volatile byte a, b, c;

void foo ()
  {
  a++;
  b++;
  c++;  
  }
  
void setup ()
  {
  Serial.begin (115200);
  }
  
void loop ()
  {
  a = 5;
  foo ();
  b = 6;
  foo ();
  c = 7;
  foo ();
  }

This was designed to show the overhead of calling foo three times. However, interestingly, the compiler inlined the lot:

void loop ()
  {
  a = 5;
  foo ();
  b = 6;
  e2:	86 e0       	ldi	r24, 0x06	; 6
  e4:	80 93 13 01 	sts	0x0113, r24
void loop ();
volatile byte a, b, c;

void foo ()
  {
  a++;
  e8:	80 91 12 01 	lds	r24, 0x0112
  ec:	8f 5f       	subi	r24, 0xFF	; 255
  ee:	80 93 12 01 	sts	0x0112, r24
  b++;
  f2:	80 91 13 01 	lds	r24, 0x0113
  f6:	8f 5f       	subi	r24, 0xFF	; 255
  f8:	80 93 13 01 	sts	0x0113, r24
  c++;  
  fc:	80 91 14 01 	lds	r24, 0x0114
 100:	8f 5f       	subi	r24, 0xFF	; 255
 102:	80 93 14 01 	sts	0x0114, r24
  {
  a = 5;
  foo ();
  b = 6;
  foo ();
  c = 7;
 106:	87 e0       	ldi	r24, 0x07	; 7
 108:	80 93 14 01 	sts	0x0114, r24
void loop ();
volatile byte a, b, c;

void foo ()
  {
  a++;
 10c:	80 91 12 01 	lds	r24, 0x0112
 110:	8f 5f       	subi	r24, 0xFF	; 255
 112:	80 93 12 01 	sts	0x0112, r24
  b++;
 116:	80 91 13 01 	lds	r24, 0x0113
 11a:	8f 5f       	subi	r24, 0xFF	; 255
 11c:	80 93 13 01 	sts	0x0113, r24
  c++;  
 120:	80 91 14 01 	lds	r24, 0x0114
 124:	8f 5f       	subi	r24, 0xFF	; 255
 126:	80 93 14 01 	sts	0x0114, r24
  foo ();
  b = 6;
  foo ();
  c = 7;
  foo ();
  }
 12a:	08 95       	ret

0000012c <setup>:
  c++;  
  }

Notice how the code for "foo" is actually inlined into loop in two places? (It worked out that the first call was completely redundant). So you can save yourself the trouble, the compiler is smart enough to do it for you.

And if I add a call to millis(), the code generated for millis() is this:

000001d2 <millis>:

unsigned long millis()
{
	unsigned long m;
	uint8_t oldSREG = SREG;
 1d2:	8f b7       	in	r24, 0x3f	; 63

	// disable interrupts while we read timer0_millis or we might get an
	// inconsistent value (e.g. in the middle of a write to timer0_millis)
	cli();
 1d4:	f8 94       	cli
	m = timer0_millis;
 1d6:	20 91 19 01 	lds	r18, 0x0119
 1da:	30 91 1a 01 	lds	r19, 0x011A
 1de:	40 91 1b 01 	lds	r20, 0x011B
 1e2:	50 91 1c 01 	lds	r21, 0x011C
	SREG = oldSREG;
 1e6:	8f bf       	out	0x3f, r24	; 63

	return m;
}
 1e8:	b9 01       	movw	r22, r18
 1ea:	ca 01       	movw	r24, r20
 1ec:	08 95       	ret

I can't see a single register there being pushed or popped.

My suggestion still is: don't try to outsmart the compiler. Write simple, readable code.

When a function is called the program counter and any parameters to pass to the function are pushed onto the stack. The parameters are then popped off the stack for the function.

When you exit the function the program counter is popped off the stack. The overhead depends on how many parameters you are passing to the function, you can improve this by passing parameters as pointers, wrap up all your parameters into an object or structure and pass a pointer to that.

Using an inline definately has it's uses, it effectively turns your function into a macro and removes all the stack operations, however if the function is called from many different places I wouldn't use inline, just try to optimize the function and its parameters.

Much as I hate to contradict people, the evidence doesn't support this claim.

This sketch:

void setup ()
  {
  }
  
void loop ()
  {
  delay (1000);
  }

Generates, for loop:

000000a8 <loop>:
  
void loop ()
  {
  delay (1000);
  a8:	68 ee       	ldi	r22, 0xE8	; 232
  aa:	73 e0       	ldi	r23, 0x03	; 3
  ac:	80 e0       	ldi	r24, 0x00	; 0
  ae:	90 e0       	ldi	r25, 0x00	; 0
  b0:	0e 94 a3 00 	call	0x146	; 0x146 <delay>
  }
  b4:	08 95       	ret

Nothing is being pushed onto the stack there. Certainly, the number 1000 (unsigned long) which is 0x000003e8 is set up into 4 registers. But nothing is pushed, and nothing is popped. The compiler is doing the minimal (and therefore fastest) it needs to do.

I really don't see how you can pass (unsigned long) 1000 any faster or more efficient way to a function.

Once again, don't try to outsmart the compiler by writing obscure code.

Hi Chaps,

I am still a little confused. In the case of a non trivial function, lets assume that its sufficiently complex to require half of the available registers -

  1. Is the compiler sophisticated enough to only push/pop the required registers

and

  1. If yes to 1) why is it not smart enough to do the same for and ISR ?

Duane B

rcarduino.blogspot.com

For a simple ISR it is. Take a look here:

For an SPI interrupt it saves and restores 7 registers (not all of them).

But once an ISR calls another function which might call another function it bails out and saves the lot. There is a limit to what the compiler might deduce might happen if function A calls function B which might call function C and so on.

So it's smart. But there gets to be a point where the problem is too undefined to be certain. Like, for example, if an ISR had an "if" in it. Where if the "if" was true it called a function otherwise not.

  1. Is the compiler sophisticated enough to only push/pop the required registers

I believe it is, again bearing in mind the above. A simple function, where it can see what the function does, almost certainly will have limited pushing/popping. But with ifs and branches and sub-function calls it can't be too certain. However from what I am seeing it is probably making some running assumptions. So for example, each function only has to push/pop what it knows it uses.

Plus I think the compiler has some "scratch" registers which it assumes are available for every function. But for an ISR, which might interrupt half-way through a function, it can't assume that, so it has to save them too.

The way the stack is used is pretty basic stuff, looking at the prototype for delay its simply:

void delay(unsigned long);

Which as I said would imply the parmeter delay is pushed onto the stack along with the program counter. The context of the application has to be stored somewhere before it jumps into the function.

This is a well documented fact.

I cannot speak for the code the Arduino compiler generates, however this is exactly how every other C compiler I've used works.

Hi,
Thanks, seems reasonable enough and further reinforces the point - 'make it so you can read it and let the compiler worry about the rest'

Duane B

rcarduino.blogspot.com

Which as I said would imply the parmeter delay is pushed onto the stack along with the program counter. The context of the application has to be stored somewhere before it jumps into the function.

To be fair, I used to believe that too. However compiler writers recently (Lua is an example) have worked out that careful tracking of registers is actually faster than blindly pushing things onto a stack. Conceptually, there is a stack. In practice, the compiler can replace pushing/popping with stuff like ("hey: I know I have 32 registers, why not just allocate 4 of them for the first 4 arguments that get sent to a function?").

Certainly there will come a time (eg. with 30 arguments) when the compiler just has to push them all. But you must admit, for functions that just take a few bytes of arguments, a smart compiler-writer will allocate a register or four for the initial arguments. This saves loading them into a register AND pushing then AND popping them. You just load them into a register. If they hadn't done this by now people would have complained.

Having an understanding of how things work at the assembly level is always useful. Compilers can introduce problems, especially when optimizations are enabled.

Also, when talking about the compiler, remember the job of the compiler is to produce native machine code that the processor understands. The way a function is called and the way parameters are pushed and popped onto and off the stack are not compiler issues, it comes down to the way the processor works.

as I said would imply the parmeter delay is pushed onto the stack along with the program counter.

But you're wrong, and Nick is right. MANY "RISC" processors (which have lots of registers, and generally "slow" access to memory) have a calling convention that places the first several arguments in registers, rather than pushing them on the stack. (If the function then calls other functions, or recurses, it will end up saving those on the stack, if necessary.)

Interestingly (?), his information is hard to find. Frequently Asked Questions mentions it, but a FAQ is hardly a specification!

This does mean that if the function you are calling is relatively simple, the overhead is pretty low. Register allocation has gotten smart. Usually there isn't even any overhead of moving intermediate results into the proper "argument" registers. (so for example "delay(1000);" does NOT result in (ldi32 tmp32,1000; mov32 args32, tmp32; call delay;) Just (ldi32 args,1000; call delay) (where xxx32 mean whatever is necesary for 32 bits. usually 4 8bit moves into 4 registers.))

I puzzles me why -g is used together with -Os...

Why? -g controls debugging info generated; it doesn't turn off optimization or add code. Optimized code can sometimes get re-ordered, with local variables eliminated or reused, making debugging a bit more "exciting" than usual, but it's not awful. I like the quote on the page you reference:

Nevertheless it proves possible to debug optimized output. This makes it reasonable to use the optimizer for programs that might have bugs.

Registers are only used when available which depends entirely on what else the CPU is doing at the time, at all other times the stack is used, so you should not rely on the use of registers, but you should factor the stack into your operation as this is the worst case senario.

westfw:
(...snip...)

It puzzles me why -g is used together with -Os...

Why? -g controls debugging info generated; it doesn't turn off optimization or add code. Optimized code can sometimes get re-ordered, with local variables eliminated or reused, making debugging a bit more "exciting" than usual, but it's not awful. I like the quote on the page you reference:

Nevertheless it proves possible to debug optimized output. This makes it reasonable to use the optimizer for programs that might have bugs.

My (I guess wrong) assumption was that debug information has to be stored into the final executable, thus making it bigger. But we are optimizing for size as we are on small devices... What am I missing here ?

SPlatten:
Registers are only used when available which depends entirely on what else the CPU is doing at the time, at all other times the stack is used, so you should not rely on the use of registers, but you should factor the stack into your operation as this is the worst case senario.

Yes, but if you dump the .elf file and you find that registers are being used, then that won't change. Which, it appears, happens for simple functions. Which I believe the original question was about.

tuxduino:
My (I guess wrong) assumption was that debug information has to be stored into the final executable, thus making it bigger. But we are optimizing for size as we are on small devices... What am I missing here ?

I'm not sure what debug information would be stuck in the executable. Some optimisation options just make the object easier to follow, that's all (by not moving instructions around, like out of loops).

assumption was that debug information has to be stored into the final executable, thus making it bigger.

The .elf files becomes significantly swollen with debugging information, but it is all is separate linker sections that are easily stripped out when making the .hex files that actually load on the Arduino HW.

There's some elucidation.
I was figuring to snap things up a bit, but it looks like the returns for the effort would be negligible.
Thanks.
"Don't Try to Outsmart The Compiler" - that's long for a bumper sticker, but it has potential.

SPlatten:
Registers are only used when available which depends entirely on what else the CPU is doing at the time, at all other times the stack is used, so you should not rely on the use of registers, but you should factor the stack into your operation as this is the worst case senario.

My experience and the FAQ states the opposite of what you state. Can you provide evidence to support your statement?

The evidence is in numerous books. My experience working with assembler dates back to the early 80's with 6502 and Z80.

You only have a limited number of registers, when you run out of registers, this doesn't mean you can't call any functions, it simply switches to the stack. Go read some books or search online.

Like-wise, if you nest to many function calls you will run out of stack space and encounter a stack overflow.

If too many, those that don't fit are passed on the stack.

This might be useful: Calling convention - Wikipedia

SPlatten:
The evidence is in numerous books. My experience working with assembler dates back to the early 80's with 6502 and Z80.

You only have a limited number of registers, when you run out of registers, this doesn't mean you can't call any functions, it simply switches to the stack. Go read some books or search online.

Like-wise, if you nest to many function calls you will run out of stack space and encounter a stack overflow.

If too many, those that don't fit are passed on the stack.

As suspected, your statements are generalized comments about compiler functionality and not specific to gcc, avr and arduino. I would suggest using the disassembler in AVRStudio to evaluate the specifics of a particular function call.

My experience also dates back to the 1980's and the 6502. A copy of Principles of Compiler Design is in my library.