Function() vs Speed

The way the stack is used is pretty basic stuff, looking at the prototype for delay its simply:

void delay(unsigned long);

Which as I said would imply the parmeter delay is pushed onto the stack along with the program counter. The context of the application has to be stored somewhere before it jumps into the function.

This is a well documented fact.

I cannot speak for the code the Arduino compiler generates, however this is exactly how every other C compiler I've used works.