Thank you AWOL! I was so deeply thinking about my problem and forgot about the most basic way to find out.
//Sending n bytes at a time testing speed
I did some testings and my finding is:
Average over 121 tests is 12us (87us calculated) for 1 byte.
Average over 94 tests is 16us(173us calculated) for 2 byte.
Average over 69 tests is 101us (260us calculated) for 3 bytes.
Average over 37 tests is 865us (1042us calculated) for 12 bytes.
I was surprised by the 12us short delay but then your explanation makes almost perfect sense since my tests have enough delay in it between iterations. There is only one thing I don't agree with you, the UART tx register may be a word register instead of byte. If it's a byte register, then sending 2 bytes at a time should have a delay of about 10/115200s+12us=99us instead of 16us, sending 3 bytes at a time should have a delay of 20/115200s+12us=186us, not 101us as I tested, and the 12 byte send will last 110/115200s+12us=967us, instead of 865us.
If instead the register is a word and can take 2 bytes at a time, then everything checks out.
Is the UART serial separate from the processing unit in the ATMEGA 328 so that while UART does the sending and receiving, ATMEGA can just do its thing in parallel? Thanks.