Just for interest I decided to look at the assembler for the speed test above, you can clearly see it performs an AND IMMEDIATE with 127 to do the mod 128.
The strange thing is that it looks like its promoted the uint8_t to be 16 bit, possibly because of the xA5 assignment
Ah yeh its adding int i that causes the promotion to 16 bit.
1be: 0e 94 8f 01 call 0x31e ; 0x31e <micros>
1c2: 1b 01 movw r2, r22
1c4: 2c 01 movw r4, r24
1c6: 85 ea ldi r24, 0xA5 ; 165
1c8: 90 e0 ldi r25, 0x00 ; 0
1ca: fc 01 movw r30, r24
1cc: ef 77 andi r30, 0x7F ; 127
1ce: f0 70 andi r31, 0x00 ; 0
1d0: fb 83 std Y+3, r31 ; 0x03
1d2: ea 83 std Y+2, r30 ; 0x02
1d4: 01 96 adiw r24, 0x01 ; 1
1d6: f2 e0 ldi r31, 0x02 ; 2
1d8: 89 39 cpi r24, 0x99 ; 153
1da: 9f 07 cpc r25, r31
1dc: b1 f7 brne .-20 ; 0x1ca <setup+0x7e>
1de: 0e 94 8f 01 call 0x31e ; 0x31e <micros>