AVR Math Optimization

Continuation of http://forum.arduino.cc/index.php?topic=335733.0 focusing on math optimizations.

[quote author=Coding Badly link=msg=2315254 date=1436838848] f1 = f1 * 1.10; f1 is float. 6.099 µs.

l1 = l1 / 10; l1 is long. 37.665 µs.

s1 = s1 / 10; s1 is short. 13.771 µs.

Tested on an Uno R3. Times adjusted for one load and one store.

Heed @MorganS's advice. And, don't "optimize" unless you know where the problem actually lies.

[/quote]

There's a guy called AddOhms with a channel on Youtube. He's got a song named "Post Your Code". hint. hint.

Not going to happen (yet). For one simple reason. Before the numbers I posted can have any merit someone else needs to independently reproduce them. Someone else rerunning my code is hardly independent.

I got similar results:

``````float (*1.10): 14.592 µs
long  (/10)  : 77.467 µs
short (/10)  : 28.926 µs
``````

This is on an 8MHz Pro Mini.

I was going to look places where compiler optimizer shortcut might come into it.

Instead of multiplying by 110 and then dividing by 100,

would multiplying by 141 and dividing by 128 work better ? You can then cheat on the division.

But really I like MorganS' answer better.

GoForSmoke:
I was going to look places where compiler optimizer shortcut might come into it.

Floating-point multiplication…

``````float f1;

static void tf2( void )
{
f1 = f1 * 1.10;
}
``````
``````00000158 <_ZL3tf2v>:
158:	2d ec       	ldi	r18, 0xCD	; 205
15a:	3c ec       	ldi	r19, 0xCC	; 204
15c:	4c e8       	ldi	r20, 0x8C	; 140
15e:	5f e3       	ldi	r21, 0x3F	; 63
160:	60 91 28 01 	lds	r22, 0x0128
164:	70 91 29 01 	lds	r23, 0x0129
168:	80 91 2a 01 	lds	r24, 0x012A
16c:	90 91 2b 01 	lds	r25, 0x012B
170:	0e 94 0a 08 	call	0x1014	; 0x1014 <__mulsf3>
174:	60 93 28 01 	sts	0x0128, r22
178:	70 93 29 01 	sts	0x0129, r23
17c:	80 93 2a 01 	sts	0x012A, r24
180:	90 93 2b 01 	sts	0x012B, r25
184:	08 95       	ret
``````

Two 32 bit loads, call to mulsf3, one 32 bit store. No optimization.

long division…

``````long l1;

static void tl4( void )
{
l1 = l1 / 10;
}
``````
``````000001a0 <_ZL3tl4v>:
1a0:	60 91 24 01 	lds	r22, 0x0124
1a4:	70 91 25 01 	lds	r23, 0x0125
1a8:	80 91 26 01 	lds	r24, 0x0126
1ac:	90 91 27 01 	lds	r25, 0x0127
1b0:	2a e0       	ldi	r18, 0x0A	; 10
1b2:	30 e0       	ldi	r19, 0x00	; 0
1b4:	40 e0       	ldi	r20, 0x00	; 0
1b6:	50 e0       	ldi	r21, 0x00	; 0
1b8:	0e 94 a3 08 	call	0x1146	; 0x1146 <__divmodsi4>
1bc:	20 93 24 01 	sts	0x0124, r18
1c0:	30 93 25 01 	sts	0x0125, r19
1c4:	40 93 26 01 	sts	0x0126, r20
1c8:	50 93 27 01 	sts	0x0127, r21
1cc:	08 95       	ret
``````

Two 32 bit loads, call to divmodsi4, one 32 bit store. No optimization.

short division…

``````short s1;

static void ts7( void )
{
s1 = s1 / 10;
}
``````
``````00000212 <_ZL3ts7v>:
212:	80 91 22 01 	lds	r24, 0x0122
216:	90 91 23 01 	lds	r25, 0x0123
21a:	6a e0       	ldi	r22, 0x0A	; 10
21c:	70 e0       	ldi	r23, 0x00	; 0
21e:	0e 94 6d 08 	call	0x10da	; 0x10da <__divmodhi4>
222:	70 93 23 01 	sts	0x0123, r23
226:	60 93 22 01 	sts	0x0122, r22
22a:	08 95       	ret
``````

Two 16 bit loads, call to divmodhi4, one 16 bit store. No optimization.

[quote author=Coding Badly link=msg=2315407 date=1436853299] long division... ... No optimization. [/quote] I wonder why the compiler is so dumb.

Indeed, dumber than a box of rocks.

Who do I complain to?

This is a contrived example to show a particular point. I suspect the compiler is detecting this construct and it 'knows' that you're testing it. Most real-world usage will be optimised in surprising ways.

``````#define MY_CONST 34
int OneTenth = MY_CONST/10;
``````

This will just load 3 into the variable OneTenth and, depending on how it's used in the program, it may not even load a SRAM memory location at all.

MorganS: This is a contrived example to show a particular point. I suspect the compiler is detecting this construct and it 'knows' that you're testing it.

Suppose I use division by 10, 100, 1000, etc., to get at a number's digits. How would the compiler deal with that, I wonder?

Oh @MorganS, I fear you are trying to explain to someone with 4OI.

@odometer, please try to make some effort to stay on-topic.

In the past here I've posted examples that compared calculations run 1000's of times and it took a couple tries to get the compiler to not optimize what I was testing out.

Maybe we need to use these?

GoForSmoke: In the past here I've posted examples that compared calculations run 1000's of times and it took a couple tries to get the compiler to not optimize what I was testing out.

Do these two things to force the compiler's hand... 1. Put the code to be tested in a separate function. 2. Call the function using a pointer.

As an added bonus, it then becomes trivial to measure the overhead so that can be subtracted from the result.

GoForSmoke: Maybe we need to use these?

I will try to make time this week to give those a spin and post the code I use for testing.

Feel free to "bump" if I forget / you get impatient.

Related Rabbit Hole, especially link to "hackers delight".

[quote author=Coding Badly link=msg=2316597 date=1436911310] Do these two things to force the compiler's hand... 1. Put the code to be tested in a separate function. 2. Call the function using a pointer.

As an added bonus, it then becomes trivial to measure the overhead so that can be subtracted from the result.

[/quote]

Cool! +1 to you.

[quote author=Coding Badly link=msg=2316600 date=1436911443] I will try to make time this week to give those a spin and post the code I use for testing.

Feel free to "bump" if I forget / you get impatient.

[/quote]

I'm not impatient. I'm using what I know while trying to relearn what I lost. These things are all "future things to deal with" while hoping the compiler catches them first!

Things have changed a lot since I was using edline to write my code, and that was huge step from punch cards.

for the /10 we had the divmod10() discussion 2 yrs ago - http://forum.arduino.cc/index.php?topic=167414.0 -