I think you could do it much faster with SPI.
There's a little trick when calculating the bitvalues
You can calc every bit with only 5 cycles and without branching
you can just use the carry flag result from the cp instruction and shift it to a register, like this:
cp value, counter
(I think the carry flag is set when the counter is greater than the value, so this is exactly what you need for pwm)
So after executing this 8 times you got 8 values packed to one byte. you can now send this with the spi.
when you put your PWM Values in an array you can generate a whole byte with this code:
ldi r20, 8
ld r21, X+
cp r21, counter
>> at this point you have a calculated a whole byte (in reg22) with 1+8*(2+2+1+1)+7*2+1 = 1+8*6+14+1 = 64 cycles
(I have not tested it so this might not be completely right but I think you get the idea)