Code is executed from flash, yes, that's in the manual (should be common sense to figure this out too, ATmega chips have much less RAM than flash, the Arduino bootloader alone is 2K, 2x as much as the RAM present in the ATmega168).
However, when you do something like
int i = 6;
two bytes in memory are allocated and filled
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
and thus the six is stored.
If you then do something like
int *a = &i;
another 16 bits of memory are allocated (the size of size_t) and written with the address the value of i is stored at. Some other sizes that may be of interest:
sizeof(int)=2
sizeof(size_t)=2
sizeof(uint8_t)=1
sizeof(uint8_t*)=2
When you do something like:
typedef struct _LCD {
int datapin[8];
int rs_pin;
int rw_pin;
int enable_pin;
int bitmode;
} LCD;
every time you do something like:
LCD l;
you are allocating 816 + 16 + 16 + 16 + 16 = 1216 = 192 bits = 24 bytes of memory. In C, you don't get OOP as direct as you would with C++. You use pointer to functions and things like that, pointing to a function from within a struct will cost you 2 bytes (size_t), and probably a bit more for the arguments.
When you create an object in C++. That object is loaded into memory. If a function is a part of your object, where does that function go to? Probably the C++ compiler here works differently than it does on x86, I don't know. Or I just have a bad understanding on how C++ actually handles objects at low level. Enlighten me.
Ok, for the inline, someone wrote about it a bit higher, so I used it... and I need the smallest binary possible! What's the speed difference?
As for this, it depends on how many times you use that function and how big this function is. If this is a big function and you call it often, you can expect your binary to be quite large. A quick glance at the ATmega44/88/168 manual say that branching instructions involved in function handling (relative jump, direct jump, relative subroutine call, subroutine return...) take 2 - 4 clocks to complete. If your CPU is running at 16MHz, the slowest of this instructions will take 4/16000000 = 1/4000000 seconds to complete. Your call on this one, you're the designer.