It's much faster to do in assembly.
Lot's of things that don't work are faster than those that do work. So, I'm not buying this argument (yet). If you look at the code generated by the 8 bit * 16 bit multiplication operation, and figure out what is wrong with your attempt, fix you attempt, and then find that it is still faster, then, I'll listen.