Arduino inline assembly: 16 bit x 8 bit multiplication!

Hi all,
How to multiply 8 bit unsigned int with a 16 bit int in Arduino inline assembly?

I'm trying to do very basic task of: a*5 (where is a is 16 bit unsigned int)
the code is as follow but doesn't work;

.
unsigned int a = 511;

void setup() {  
 asm volatile(  
              "mov r24, %1"       "\n\t"
              "ldi   r26, 5"         "\n\t"
              "mul r24, r26"      "\n\t"    // doesn't work... how to split r24 and r25 as HIGH and LOW byte ?
              
              "mov %0, r0"        "\n\t"
              :"+r"(a)       
               );

  Serial.begin(19200);
  Serial.print("a = ");
  Serial.println(a);
}

Thanks in advance

Why don't you look at the code generated when you perform the multiplication the usual way? Why are you trying to do this in assembler?

PaulS:
Why don't you look at the code generated when you perform the multiplication the usual way? Why are you trying to do this in assembler?

It's much faster to do in assembly.. :smiley:

Where can I find the reference for In-line Assembly in GCC? Only resource I can find is:
http://www.nongnu.org/avr-libc/user-manual/inline_asm.html
Which is not sufficient... :frowning:

It's much faster to do in assembly.

Lot's of things that don't work are faster than those that do work. So, I'm not buying this argument (yet). If you look at the code generated by the 8 bit * 16 bit multiplication operation, and figure out what is wrong with your attempt, fix you attempt, and then find that it is still faster, then, I'll listen.

Have you read AVR201? There are 16 x 16 = 24 multiplication, which is close to what you like to accomplish, probably you can skip couple lines:

;******************************************************************************
;*
;* FUNCTION
;*	mul16x16_24
;* DECRIPTION
;*	Unsigned multiply of two 16bits numbers with 24bits result.
;* USAGE
;*	r18:r17:r16 = r23:r22 * r21:r20
;* STATISTICS
;*	Cycles :	14 + ret
;*	Words :		10 + ret
;*	Register usage: r0 to r1, r16 to r18 and r20 to r23 (9 registers)
;* NOTE
;*	Full orthogonality i.e. any register pair can be used as long as
;*	the 24bit result and the two operands does not share register pairs.
;*	The routine is non-destructive to the operands.
;*
;******************************************************************************

mul16x16_24:
	mul		r23, r21		; ah * bh
	mov		r18, r0
	mul		r22, r20		; al * bl
	movw	r17:r16, r1:r0
	mul		r23, r20		; ah * bl
	add		r17, r0
	adc		r18, r1
	mul		r21, r22		; bh * al
	add		r17, r0
	adc		r18, r1
	ret

Magician:
Have you read AVR201? There are 16 x 16 = 24 multiplication, which is close to what you like to accomplish, probably you can skip couple lines:

mul16x16_24:
mul		r23, r21		; ah * bh

Thanks a lot for your help.
This is exactly what I'm trying to do but in In-line Assembly.
The main hurdle is I'm not finding a way to to Load the 16 bit value for an integer(a) in to two different rigesters;
i.e: AVR assembly it is;

.EQU a = 511;
mov r24 = HIGH(a) ; Stores Higher 8-bit of integer 'a'
mov r25 = LOW(a) ; Stores Lower 8-bit of integer 'a'

but how to do this in In-line assembly?

Don't know assembler, but a whale ago I come across this blog
Arduino – AVR GCC multiplication | Mekonikuv blog and here macro implementation:

#define MultiU16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
                      //<<  Removed for this post, look on a blog full version
: \
"=&r" (longRes) \
: \
"a" (intIn1),    \<< This is you looking for?
"a" (intIn2) \
: \
"r26" \
)

Magician:
Don't know assembler, but a whale ago I come across this blog
Arduino – AVR GCC multiplication | Mekonikuv blog and here macro implementation:

#define MultiU16X16to32(longRes, intIn1, intIn2) \

asm volatile (
"clr r26 \n\t"
                      //<<  Removed for this post, look on a blog full version
:
"=&r" (longRes)
:
"a" (intIn1),    << This is you looking for?
"a" (intIn2)
:
"r26"
)

Thanks for the link! I'm going through it

PaulS:
It's much faster to do in assembly..

Are you sure now? i *= 5 compiles to:

12e:   9c 01           movw    r18, r24
 130:   22 0f           add     r18, r18
 132:   33 1f           adc     r19, r19
 134:   22 0f           add     r18, r18
 136:   33 1f           adc     r19, r19
 138:   28 0f           add     r18, r24
 13a:   39 1f           adc     r19, r25

(that's (i+i)+(i+i)+i )

Where can I find the reference for In-line Assembly in GCC? Only resource I can find is:
http://www.nongnu.org/avr-libc/user-manual/inline_asm.html
Which is not sufficient... :([/color]
[/quote]
You probably need the gnu assembler manual (to cover all the non-cpu-specific features of the assembler), and the Atmel AVR instruction set manual (which you have to apply some salt to, since it uses a different syntax for some things than the gnu assembler.) And the device data sheet to tell you which instructions are present on the specific chip you are using. (for example, there is an 8x8 multiply instruction, but it is not present on ATtiny cpus.)

There's quite a bit of stuff on AVR assembler at http://www.avr-asm-tutorial.net/avr_en/

westfw:
Are you sure now? i *= 5 compiles to:

12e:   9c 01           movw    r18, r24

130:   22 0f           add     r18, r18
132:   33 1f           adc     r19, r19
134:   22 0f           add     r18, r18
136:   33 1f           adc     r19, r19
138:   28 0f           add     r18, r24
13a:   39 1f           adc     r19, r25



(that's (i+i)+(i+i)+i )

I can't see a way of beating that. Just for fun I tried a few other factors too, and GCC always comes up with extremely efficient code.

Ah, how I don't miss the days of writing assembler. :slight_smile:

(OTOH, for an 8-bit variable times a 16bit variable, writing assembler might very well be beneficial, because that's the sort of thing that C is defined NOT to do. It will almost certainly convert the 8bit number to a 16bit number and then do a 16x16 multiply.)

(On the third hand, the AVR multiply instruction is rather inconvenient to use from C, with more that the usual number of restrictions on which registers are used...)

I would code x = x* 5 as x = x + x <<2; if I wanted to optimize

don't know the assembly for that but it uses no multiply at all..

1 Like

DirtyBits:
It's much faster to do in assembly.. :smiley:

That is very unlikely to be true. The compiler writers know the processor backwards. As others have pointed out earlier the compiler already optimizes your requirements into six adds and a move. That's only 7 clock cycles. So even looking up "multiplication" in the assembler manual is going to lead you down the path of something that is potentially "much slower".

robtillaart:
I would code x = x* 5 as x = x + x <<2; if I wanted to optimize

Rob's idea is what I would have suggested myself. Bit shift by 2 to multiply by 4, and then add in the last one.

The other problem with assembly is that you will probably generate code to load the variables from memory, into registers. What else could you do? But the compiler may know the variables are already in certain registers and can skip that step.

This thread is starting to sound like a lot of other ones, like "how do I break out of an interrupt?". Let's step back ... WHY do you need to multiply something by 5 "much faster" than can be done in C?

WHY do you need to multiply something by 5 "much faster" than can be done in C?

because it can be done ?

1 Like

The book "Beginners Introduction to the Assembly Language of ATMEL-AVR-Microprocessors" discusses multiplication.

http://www.avr-asm-tutorial.net/

He gives an example of 16 bit x 8 bit multiplication. He says it takes 10 clock cycles. However this is for two variables, not a variable and a constant. So I suppose 10 cycles isn't too bad. But that is more than the 7 cycles shown above for multiplying by a constant 5.

The code generated by C seems to me to be more than 10 cycles, but there is a bit of a crossover from the example on that web page as to what he is counting (in other words, is he counting loading and storing all variables?). In fact glancing at it, it seems to me that the code he is showing takes more than 10 cycles.

FWIW this is what I got from the C compiler:

 c = a * b;
  d6:	20 91 00 01 	lds	r18, 0x0100
  da:	30 91 01 01 	lds	r19, 0x0101
  de:	80 91 02 01 	lds	r24, 0x0102
  e2:	90 e0       	ldi	r25, 0x00	; 0
  e4:	ac 01       	movw	r20, r24
  e6:	42 9f       	mul	r20, r18
  e8:	c0 01       	movw	r24, r0
  ea:	43 9f       	mul	r20, r19
  ec:	90 0d       	add	r25, r0
  ee:	52 9f       	mul	r21, r18
  f0:	90 0d       	add	r25, r0
  f2:	11 24       	eor	r1, r1
  f4:	90 93 17 01 	sts	0x0117, r25
  f8:	80 93 16 01 	sts	0x0116, r24

LDS, STS and MUL are 2 cycles. LDI, MOVW, ADD and EOR are 1 cycle. So I count 22 cycles there. But again, if you let the compiler do it, it may not need to do some of those loads and stores, if it knows it has the variable in a register already.

It looks to me from the LDI of zero, that the compiler is extending the byte variable to an int, which is probably why the code above is a bit longer than it needs to be.

Test sketch:

volatile int a = 42;
volatile byte b = 16;
volatile int c;
void setup ()
 {
 Serial.begin (115200);
 c = a * b;
 Serial.println (c);
 }
void loop () {}

robtillaart:
I would code x = x* 5 as x = x + x <<2; if I wanted to optimize

don't know the assembly for that but it uses no multiply at all..

Superb!

Are you sure now?  i *= 5 compiles to:
 
Code:
12e:   9c 01           movw    r18, r24
 130:   22 0f           add     r18, r18
 132:   33 1f           adc     r19, r19
 134:   22 0f           add     r18, r18
 136:   33 1f           adc     r19, r19
 138:   28 0f           add     r18, r24
 13a:   39 1f           adc     r19, r25
(that's (i+i)+(i+i)+i )

Nice one!!

Really appreciate for your help!! :smiley:

robtillaart:
I would code x = x* 5 as x = x + x <<2; if I wanted to optimize

don't know the assembly for that but it uses no multiply at all..

Actually I tried that too, but in fact it works out to be less efficient than the "double, double again, and add" sequence Bill posted. It appears that GCC knows plenty of multiplication tricks, because it generates quite different and specialised code for different multiplication factors.

Hmm - I was impressed with tim7's code until I considered that a*5 could be wider than 16 bits, in which case the algorithm would produce incorrect results. Interestingly, avr-gcc was willing to do the unsigned int computation and return the (possibly incorrect) result as an unsigned long. To force a correct result, I had to cast a to unsigned long before the multiplication:

unsigned long mul5m(unsigned a)
{  return (unsigned long) a * 5;
}

which produced:

mul5m:
  12               	/* prologue: function */
  13               	/* frame size = 0 */
  14 0000 A0E0      		ldi r26,lo8(0)
  15 0002 B0E0      		ldi r27,hi8(0)
  16 0004 282F      		mov r18,r24
  17 0006 392F      		mov r19,r25
  18 0008 4A2F      		mov r20,r26
  19 000a 5B2F      		mov r21,r27
  20 000c 220F      		lsl r18
  21 000e 331F      		rol r19
  22 0010 441F      		rol r20
  23 0012 551F      		rol r21
  24 0014 220F      		lsl r18
  25 0016 331F      		rol r19
  26 0018 441F      		rol r20
  27 001a 551F      		rol r21
  28 001c 280F      		add r18,r24
  29 001e 391F      		adc r19,r25
  30 0020 4A1F      		adc r20,r26
  31 0022 5B1F      		adc r21,r27
  32 0024 622F      		mov r22,r18
  33 0026 732F      		mov r23,r19
  34 0028 842F      		mov r24,r20
  35 002a 952F      		mov r25,r21
  36               	/* epilogue start */
  37 002c 0895      		ret

which is longer, but doesn't discard any high-order result bits.

until I considered that a*5 could be wider than 16 bits

Um, yes. That would be part of the language definition. You (normally) multiply 16bits by 16bits and get 16bits, with no error handling on overflow. I can't think of a reason for homebrew ASM to try to do better unless that was specifically part of the goal.

It looks to me from the LDI of zero, that the compiler is extending the byte variable to an int

Also required by the language definition.

I would code x = x* 5 as x = x + x <<2; if I wanted to optimize

I would say that that is as much of a mistake as coding assembler. Without evidence to the contrary, you should just trust the compiler to do reasonable optimization of a clear piece of source code: "x = x * 5;"

(that's (i+i)+(i+i)+i ) [what gcc generates.]

In this case, since the AVR has neither a 16byte shift, nor a multibit shift, the best possible implementation using shifts is going to be the same as the implementation using adds. "i+i" is left shift once. "(i+i)+(i+i)" is left shift twice. In fact, on the AVR, the "LSL Rd" (Logical Shift Left) instruction is just an alias for "ADD Rd, Rd"

The nice thing about using a compiler, and the reason that 'large' programs tend to be smaller and faster when written in C rather than ASM, is that this sort of optimization will be applied ALL OVER the program, and not just in the places where you remember to optimize by hand. (actually, the nice part about using the compiler is that you don't have to figure out how to do an 8x16 multiply given an 8x8 multiply instruction (or nothing) in the first place!)

westfw:
In this case, since the AVR has neither a 16byte shift, nor a multibit shift, the best possible implementation using shifts is going to be the same as the implementation using adds. "i+i" is left shift once. "(i+i)+(i+i)" is left shift twice. In fact, on the AVR, the "LSL Rd" (Logical Shift Left) instruction is just an alias for "ADD Rd, Rd"

The nice thing about using a compiler, and the reason that 'large' programs tend to be smaller and faster when written in C rather than ASM, is that this sort of optimization will be applied ALL OVER the program, and not just in the places where you remember to optimize by hand. (actually, the nice part about using the compiler is that you don't have to figure out how to do an 8x16 multiply given an 8x8 multiply instruction (or nothing) in the first place!)

Thanks for the valuable info... :smiley: