The code has a "DebugPulse()" call in the inner loop. It looks like the intent was to fine-tune the constants with an oscilloscope... but then when you take out the "#define _DEBUG 1" the timing would be different!
My calculations show:
8 MHz: subtract 8.5 cycles and divide the remainder by 3.5
16 MHz: subtract 65 cycles and divide the remainder by 14.0
20 MHz: subtract 13 cycles and divide the remainder by 7.0
For example, 7800 baud:
8,000,000/7800 = 1025.64 - 8.5 = 1019.14 / 3.5 = 291 (9600->236, 4800->474)
16,000,000/7800 = 2051.28 - 65 = 1986.28 / 14 = 142 (9600->114, 4800->233)
20,000,000/7800 = 2564.10 - 13 = 2551.10 / 7 = 364 (9600->297, 4800->595)
The results seem to fall within the expected range: greater than 9600 and less than 4800.
The various factors are averages. For 8 MHz the factor subtracted ranges from 7.167 to 11.0. For 16 MHz it's 58.667 to 68.889. For 20 MHz it's the worst: 1.667 to 46.222. Fortunately at those speeds it's only a range of a couple of microseconds. I'm just wondering how the same loop, run at different clock speeds, can have such vastly different numbers of clock cycles for overhead and loop time.