There is another speed optimization to make when you are using divmod10 to convert a long into base-10 digits: Only the first few digits of a 10-digit number need full 32-bit precision. Once the high byte is 0 you only need 24 bit precision (saves at least 10 cycles), then when the two highest bytes are 0 you only need a 16 bit divmod10 (saves over a microsecond). Once the three highest bytes are 0 the 8-bit divmod10 only takes a microsecond, and for the very last digit you don't need to do a division at all.

while(n & 0xff000000){divmod10_asm32(n, digit, tmp8);*--p = digit + '0';}

while(n & 0xff0000){divmod10_asm24(n, digit, tmp8);*--p = digit + '0';}

while(n & 0xff00){divmod10_asm16(n, digit, tmp8);*--p = digit + '0';}

while((n & 0xff)>9){divmod10_asm8(n, digit, tmp8);*--p = digit + '0';}

*--p = n + '0';

The downside is an increase in code size of about 200 bytes. I've attached a modified Print.cpp including all the assembler macros, my profiling indicates it's 0.5us-1.5us faster per digit on average (which adds up to 7us faster for 5-10 digit numbers). I have not fully tested for accuracy yet.