Here come the results of my "low power experiments". The goal was to decrease power consumption as much as possible while retaining precise timing. Needless to say that I had to do this without an Arduino Board.
Here is what I did:
1) I used an Atmega 48PV, basically an Arduino chip but with less memory and I used the 10 MHz version (this is because I intend to decrease the voltage as well). I used it without the standard 16 MHz crystal, instead I fused it to internal 8MHz RC Oscillator divided by 8, thus the clock runs at ~1MHz
2) I connected a 32kHz watch crystal to XTAL1 and XTAL2.
3) I shutdown all peripherals but timer2
PRR = ~(1<<PRTIM2);
4) Just in case somebody wonders: I did not use the bootloader, I used an ISP. To be more precise an AVRISPmkII. My cheap ISP from Ebay was useless because I still can not figure out how to decrease SCK for the cheap ISP.
5) I put the prescaler to 1024 and interrupt after 1 tick --> each 1/16s. Then I enable the pullup for some button pin, wait 5 cycles and poll if it is high or low. Then disable the pullup again. I use timer2 to wakeup the device in the meantime I go to power_save mode.
6) After the button is pushed and released I switch to sleeping (PWR_SAVE) for 8s, wake up, count and go to sleep again. I also enable PCInterrupts for the button that was polled before.
Here are my measurements running at 5V, 8MHz RC, CKDIV8, external 32kHz 7pF crystal, no additional load (but the breadboard), 20°C:
Polling every 1/16s, button not pushed --> 4-4.5 uA
Polling every 1/16s, button pushed --> 4.5uA
Counting every 8s --> 1.6uA
Conclusion: it is possible to achieve the advertised performance. Actually the device performed a little bit better (but of course at 5°C below the advertised temperature). However it achieved it while running a 32kHz oscillator. I am satisfied

The biggest pitfall I found was that updating timer2 registers must be finished prior to going to sleep. The magic incanation that did the trick was
while (ASSR & ((1<<TCN2UB)|(1<<OCR2AUB)|(1<<OCR2BUB)|(1<<TCR2AUB)|(1<<TCR2BUB))) { nop(); }
Additional insight: it is most probably completely impossible to implement such low power consumptions after you conceived a design. Either you go for <10uA from the start or you better forget it. The point is that any processing must happen as fast as possible in order to get the device back to sleep ASAP. Even pullups become an issue because they easily leak several 100uA if a button gets pushed. So you have to control everything to get power consumption really down.
The next experiment will focus on frequency stability at low power.
Udo