There was a minor glitch in all of the functions which is now correct. The fix only costs 2 clock cycles so they are still very fast. I have updated the attachment in my previous post with the new version.
I have also added divu7() for division by 7. This takes slightly longer at on average 40.5 clock cycles (numbers <=0x7FFF are 40, those > are 41).