I sorta figured the rates would be linear (9600 would be 1/2 of 4800) but that doesn't seem to be the case.

They probably have to account for some fixed processing overhead.

I tried to reverse-engineer the timings but my results look odd. I calculated ( (clockrate / baudrate) / loopcount) to see how many instruction cycles the loop count was delaying. For the slower baud rates that should be a significant portion of the loop delay. For the 8 MHz table there seems to be about 3.5 instruction cycles per loop count. For the 16 MHz table it's 14 instruction cycles per loop count and for the 20 MHz it's 7 instruction cycles per loop count. Since the code is the same for all three it seems odd that the results aren't even close to the same for the three clock rates.