Thanks Peter, that makes sense: in the 8 bit version, maybe the compiler can't find ways to use as many registers as it can in the 16/32 bit versions, hence its a little slower. In the 64 bit version, it doesn't have enough registers and has to start using stack/heap memory to hold partial results data temporarily, casuing a dramatic slowdown. All those extra memory operations could explain the sudden increase in program size too.
Of course, this result is going to be unique to the ATmega328. Other 8 bit processors might have more register space and could deal better with the 64 bit operations. Others with less register space might not cope as well with the 32 bit version. Certainly a 32 bit processor like Due/Teensy 3/Spark Core/RPi would be a very different story.
Thanks!
Paul