256 byte wavetable is worth doing for speed - if you force the waves to be aligned to a 256 byte boundary, then taking an index into the wavetable can be done by just copying a byte into the low byte of the wavetable pointer, which saved me 3 instructions per sample per oscillator (which was quite a lot given each oscillator takes about 15 instructions total. 256 bytes also saves needing to do any maths on the wavetable lookup variable, you can just use the high byte as the lookup directly. It's all small improvements, but compared to the original wavetable code I played with off the internet, it is something like five times as efficient with all the optimisation.
The avr assembler is dead easy, especially if you use inline assembler in c code, I learnt a lot of it from reading the code of meeblip, which is a whole synth written in avr assembler, the rest I got from the atmel instruction set reference. It's well worth doing - my c code could only just scrape 8 voices with no filter, whereas in assembler the 8 voices take something like 10-20% of processor (not sure how much time the filter takes, but I still seem to have a fair bit of time for other stuff).