as I said would imply the parmeter delay is pushed onto the stack along with the program counter.
But you're wrong, and Nick is right. MANY "RISC" processors (which have lots of registers, and generally "slow" access to memory) have a calling convention that places the first several arguments in registers, rather than pushing them on the stack. (If the function then calls other functions, or recurses, it will end up saving those on the stack, if necessary.)
Interestingly (?), his information is hard to find. Frequently Asked Questions mentions it, but a FAQ is hardly a specification!
This does mean that if the function you are calling is relatively simple, the overhead is pretty low. Register allocation has gotten smart. Usually there isn't even any overhead of moving intermediate results into the proper "argument" registers. (so for example "delay(1000);" does NOT result in (ldi32 tmp32,1000; mov32 args32, tmp32; call delay;) Just (ldi32 args,1000; call delay) (where xxx32 mean whatever is necesary for 32 bits. usually 4 8bit moves into 4 registers.))
I puzzles me why -g is used together with -Os...
Why? -g controls debugging info generated; it doesn't turn off optimization or add code. Optimized code can sometimes get re-ordered, with local variables eliminated or reused, making debugging a bit more "exciting" than usual, but it's not awful. I like the quote on the page you reference:
Nevertheless it proves possible to debug optimized output. This makes it reasonable to use the optimizer for programs that might have bugs.