Arduino inline assembly: 16 bit x 8 bit multiplication!

The book "Beginners Introduction to the Assembly Language of ATMEL-AVR-Microprocessors" discusses multiplication.

http://www.avr-asm-tutorial.net/

He gives an example of 16 bit x 8 bit multiplication. He says it takes 10 clock cycles. However this is for two variables, not a variable and a constant. So I suppose 10 cycles isn't too bad. But that is more than the 7 cycles shown above for multiplying by a constant 5.

The code generated by C seems to me to be more than 10 cycles, but there is a bit of a crossover from the example on that web page as to what he is counting (in other words, is he counting loading and storing all variables?). In fact glancing at it, it seems to me that the code he is showing takes more than 10 cycles.

FWIW this is what I got from the C compiler:

 c = a * b;
  d6:	20 91 00 01 	lds	r18, 0x0100
  da:	30 91 01 01 	lds	r19, 0x0101
  de:	80 91 02 01 	lds	r24, 0x0102
  e2:	90 e0       	ldi	r25, 0x00	; 0
  e4:	ac 01       	movw	r20, r24
  e6:	42 9f       	mul	r20, r18
  e8:	c0 01       	movw	r24, r0
  ea:	43 9f       	mul	r20, r19
  ec:	90 0d       	add	r25, r0
  ee:	52 9f       	mul	r21, r18
  f0:	90 0d       	add	r25, r0
  f2:	11 24       	eor	r1, r1
  f4:	90 93 17 01 	sts	0x0117, r25
  f8:	80 93 16 01 	sts	0x0116, r24

LDS, STS and MUL are 2 cycles. LDI, MOVW, ADD and EOR are 1 cycle. So I count 22 cycles there. But again, if you let the compiler do it, it may not need to do some of those loads and stores, if it knows it has the variable in a register already.

It looks to me from the LDI of zero, that the compiler is extending the byte variable to an int, which is probably why the code above is a bit longer than it needs to be.

Test sketch:

volatile int a = 42;
volatile byte b = 16;
volatile int c;
void setup ()
 {
 Serial.begin (115200);
 c = a * b;
 Serial.println (c);
 }
void loop () {}