Go Down

Topic: Arduino inline assembly: 16 bit x 8 bit multiplication! (Read 3 times) previous topic - next topic

DirtyBits

Hi all,
How to multiply 8 bit unsigned int with a 16 bit int in Arduino inline assembly?

I'm trying to do very basic task of: a*5   (where is a is 16 bit unsigned int)
the code is as follow but doesn't work;

Code: [Select]

.
unsigned int a = 511;

void setup() { 
asm volatile( 
              "mov r24, %1"       "\n\t"
              "ldi   r26, 5"         "\n\t"
              "mul r24, r26"      "\n\t"    // doesn't work... how to split r24 and r25 as HIGH and LOW byte ?
             
              "mov %0, r0"        "\n\t"
              :"+r"(a)       
               );

  Serial.begin(19200);
  Serial.print("a = ");
  Serial.println(a);
}


Thanks in advance

PaulS

Why don't you look at the code generated when you perform the multiplication the usual way? Why are you trying to do this in assembler?

DirtyBits

#2
Mar 21, 2012, 01:10 pm Last Edit: Mar 21, 2012, 01:12 pm by DirtyBits Reason: 1

Why don't you look at the code generated when you perform the multiplication the usual way? Why are you trying to do this in assembler?


It's much faster to do in assembly.. :D

Where can I find the reference for In-line Assembly in GCC? Only resource I can find is:
http://www.nongnu.org/avr-libc/user-manual/inline_asm.html
Which is not sufficient... :([/color]

PaulS

Quote
It's much faster to do in assembly.

Lot's of things that don't work are faster than those that do work. So, I'm not buying this argument (yet). If you look at the code generated by the 8 bit * 16 bit multiplication operation, and figure out what is wrong with your attempt, fix you attempt, and then find that it is still faster, then, I'll listen.

Magician

Have you read AVR201? There are 16 x 16 = 24 multiplication, which is close to what you like to accomplish, probably you can skip couple lines:
Code: [Select]
;******************************************************************************
;*
;* FUNCTION
;* mul16x16_24
;* DECRIPTION
;* Unsigned multiply of two 16bits numbers with 24bits result.
;* USAGE
;* r18:r17:r16 = r23:r22 * r21:r20
;* STATISTICS
;* Cycles : 14 + ret
;* Words : 10 + ret
;* Register usage: r0 to r1, r16 to r18 and r20 to r23 (9 registers)
;* NOTE
;* Full orthogonality i.e. any register pair can be used as long as
;* the 24bit result and the two operands does not share register pairs.
;* The routine is non-destructive to the operands.
;*
;******************************************************************************

mul16x16_24:
mul r23, r21 ; ah * bh
mov r18, r0
mul r22, r20 ; al * bl
movw r17:r16, r1:r0
mul r23, r20 ; ah * bl
add r17, r0
adc r18, r1
mul r21, r22 ; bh * al
add r17, r0
adc r18, r1
ret


DirtyBits


Have you read AVR201? There are 16 x 16 = 24 multiplication, which is close to what you like to accomplish, probably you can skip couple lines:
Code: [Select]

mul16x16_24:
mul r23, r21 ; ah * bh



Thanks a lot for your help.
This is exactly what I'm trying to do but in In-line Assembly.
The main hurdle is I'm not finding a way to to Load the 16 bit value for an integer(a) in to two different rigesters;
i.e: AVR assembly it is;

.EQU  a = 511;
mov r24 = HIGH(a) ; Stores Higher 8-bit of integer 'a'
mov r25 = LOW(a) ; Stores Lower 8-bit of integer 'a'

but how to do this in In-line assembly?


Magician

Don't know assembler, but a whale ago I come across this blog
http://mekonik.wordpress.com/2009/03/18/arduino-avr-gcc-multiplication/ and here macro implementation:
Code: [Select]
#define MultiU16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
                      //<<  Removed for this post, look on a blog full version
: \
"=&r" (longRes) \
: \
"a" (intIn1),    \<< This is you looking for?
"a" (intIn2) \
: \
"r26" \
)

DirtyBits


Don't know assembler, but a whale ago I come across this blog
http://mekonik.wordpress.com/2009/03/18/arduino-avr-gcc-multiplication/ and here macro implementation:
Code: [Select]
#define MultiU16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
                      //<<  Removed for this post, look on a blog full version
: \
"=&r" (longRes) \
: \
"a" (intIn1),    \<< This is you looking for?
"a" (intIn2) \
: \
"r26" \
)



Thanks for the link! I'm going through it

westfw


It's much faster to do in assembly..



Are you sure now?  i *= 5 compiles to:
Code: [Select]
12e:   9c 01           movw    r18, r24
130:   22 0f           add     r18, r18
132:   33 1f           adc     r19, r19
134:   22 0f           add     r18, r18
136:   33 1f           adc     r19, r19
138:   28 0f           add     r18, r24
13a:   39 1f           adc     r19, r25

(that's (i+i)+(i+i)+i )

Where can I find the reference for In-line Assembly in GCC? Only resource I can find is:
http://www.nongnu.org/avr-libc/user-manual/inline_asm.html
Which is not sufficient... :([/color]
[/quote]
You probably need the gnu assembler manual (to cover all the non-cpu-specific features of the assembler), and the Atmel AVR instruction set manual (which you have to apply some salt to, since it uses a different syntax for some things than the gnu assembler.)  And the device data sheet to tell you which instructions are present on the specific chip you are using.  (for example, there is an 8x8 multiply instruction, but it is not present on ATtiny cpus.)

tim7

There's quite a bit of stuff on AVR assembler at http://www.avr-asm-tutorial.net/avr_en/


Are you sure now?  i *= 5 compiles to:
Code: [Select]
12e:   9c 01           movw    r18, r24
130:   22 0f           add     r18, r18
132:   33 1f           adc     r19, r19
134:   22 0f           add     r18, r18
136:   33 1f           adc     r19, r19
138:   28 0f           add     r18, r24
13a:   39 1f           adc     r19, r25

(that's (i+i)+(i+i)+i )


I can't see a way of beating that.  Just for fun I tried a few other factors too, and GCC always comes up with extremely efficient code.

Ah, how I don't miss the days of writing assembler.  :)

westfw

(OTOH, for an 8-bit variable times a 16bit variable, writing assembler might very well be beneficial, because that's the sort of thing that C is defined NOT to do.  It will almost certainly convert the 8bit number to a 16bit number and then do a 16x16 multiply.)

(On the third hand, the AVR multiply instruction is rather inconvenient to use from C, with more that the usual number of restrictions on which registers are used...)

robtillaart


I would code x = x* 5  as   x = x + x <<2; if I wanted to optimize

don't know the assembly for that but it uses no multiply at all..
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Nick Gammon


It's much faster to do in assembly.. :D


That is very unlikely to be true. The compiler writers know the processor backwards. As others have pointed out earlier the compiler already optimizes your requirements into six adds and a move. That's only 7 clock cycles. So even looking up "multiplication" in the assembler manual is going to lead you down the path of something that is potentially "much slower".


I would code x = x* 5  as   x = x + x <<2; if I wanted to optimize


Rob's idea is what I would have suggested myself. Bit shift by 2 to multiply by 4, and then add in the last one.

The other problem with assembly is that you will probably generate code to load the variables from memory, into registers. What else could you do? But the compiler may know the variables are already in certain registers and can skip that step.

This thread is starting to sound like a lot of other ones, like "how do I break out of an interrupt?". Let's step back ... WHY do you need to multiply something by 5 "much faster" than can be done in C?

robtillaart

Quote
WHY do you need to multiply something by 5 "much faster" than can be done in C?

because it can be done ?
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Nick Gammon

The book "Beginners Introduction to the Assembly Language of ATMEL-AVR-Microprocessors" discusses multiplication.

http://www.avr-asm-tutorial.net/

He gives an example of 16 bit x 8 bit multiplication. He says it takes 10 clock cycles. However this is for two variables, not a variable and a constant. So I suppose 10 cycles isn't too bad. But that is more than the 7 cycles shown above for multiplying by a constant 5.

The code generated by C seems to me to be more than 10 cycles, but there is a bit of a crossover from the example on that web page as to what he is counting (in other words, is he counting loading and storing all variables?). In fact glancing at it, it seems to me that the code he is showing takes more than 10 cycles.

FWIW this is what I got from the C compiler:

Code: [Select]
c = a * b;
  d6: 20 91 00 01 lds r18, 0x0100
  da: 30 91 01 01 lds r19, 0x0101
  de: 80 91 02 01 lds r24, 0x0102
  e2: 90 e0        ldi r25, 0x00 ; 0
  e4: ac 01        movw r20, r24
  e6: 42 9f        mul r20, r18
  e8: c0 01        movw r24, r0
  ea: 43 9f        mul r20, r19
  ec: 90 0d        add r25, r0
  ee: 52 9f        mul r21, r18
  f0: 90 0d        add r25, r0
  f2: 11 24        eor r1, r1
  f4: 90 93 17 01 sts 0x0117, r25
  f8: 80 93 16 01 sts 0x0116, r24


LDS, STS and MUL are 2 cycles. LDI, MOVW, ADD and EOR are 1 cycle. So I count 22 cycles there. But again, if you let the compiler do it, it may not need to do some of those loads and stores, if it knows it has the variable in a register already.

It looks to me from the LDI of zero, that the compiler is extending the byte variable to an int, which is probably why the code above is a bit longer than it needs to be.

Test sketch:

Code: [Select]
volatile int a = 42;
volatile byte b = 16;
volatile int c;
void setup ()
{
Serial.begin (115200);
c = a * b;
Serial.println (c);
}
void loop () {}

Go Up