optimizing avr-gcc

Hi,

I consider myself an experienced programmer, but not mainly in C. I have some C experience, mainly on Arduinos, but I’m not a super C guru. This is the first time I’m really trying to optimize C code to squeece out the last couple of microseconds.

I’m using an Arduino Mega to read 24 pins as frequently as possible. I have 24 S0 counters connected to those pins. I’ve put the 24 pins in INPUT_PULLUP mode. The S0 counters will short the PINs to GND every now and again, and I want to count those occurances. The pulse width can be as little as 50us, so the code needs to be reasonably fast.

While optimizing the code, a couple of questions came up:

  1. Is there any way of telling avr-gcc to unroll a loop? I found that manually unrolling the for loop reduces my runtime by almost 50%, but I’m unable to tell the compiler to do it. attribute((optimize(“unroll-loops”))) seems to be ignored.

  2. Replacing count[i * 8 + pos]++; with count[(i << 3) + pos]++; gives a speed improvement. At university, I was taught to NOT do that optimization by hand because the compiler would understand that multiplying an int by a constant power of 2 can be replaced with a bitshift, and hence it would do it for me. It seems that avr-gcc doesn’t do this, though?

For reference, this is the relevant snipped of my sketch. (there are some other parts that send the counters somewhere and set them back to 0 before they can overflow etc). If you see any improvements, I’m of course happy to hear them :slight_smile:

#define readRegisters PINA, PINC, PINL
#define noRegisters 3

uint16_t count[noRegisters * 8];
uint8_t lastStateRegs[noRegisters];
uint8_t pinNoTranslation[noRegisters * 8] = { 22,23,24,25,26,27,28,29,    37,36,35,34,33,32,31,30,   49,48,47,46,45,44,43,42 };

void setup() {
	for (byte i = 0; i < noRegisters * 8; i++) {
		pinMode(pinNoTranslation[i], INPUT_PULLUP);
	}
}

void measure() {
	uint8_t regs[noRegisters] = { readRegisters };
	for (uint8_t i = 0; i < noRegisters; i++) {
		if (regs[i] != lastStateRegs[i]) {
			uint8_t delta = (lastStateRegs[i] ^ regs[i]) & lastStateRegs[i];
			uint8_t pos = 0;
			while (delta) {
				uint8_t bit = (delta & -delta);
				while (!(bit & (1 << pos))) {
					pos++;
				}
				count[i * 8 + pos]++;
				delta = delta & (~bit);
			}
			lastStateRegs[i] = regs[i];
		}
	}
}

void loop() {
	meaasure();
}

In your case you might want to try something like this, which I would think is much faster:

uint16_t *cnt=count+i*8;
while(delta)
{
  if(delta&1)
    ++*cnt;
  delta>>=1;
  ++cnt;
}

In general, one thing you can do is to check the generated assembly if you can see there some obvious room for improvement. That can guide you to change your C/C++ code to perform better, or even write the code in inline assembly if it's very performance intensive. Compilers do good optimization job in general but are not perfect by any means, so sometimes you need to massage the code where it matters.

And like you have done, of course profile if possible before jumping to conclusions (: For example, IIRC, AVR doesn't support bitshift with random literal (only by 1) so it may emit multiple bit shifts for integer literal (or worse, loop if it's not literal), so it may actually be slower than mul. I know you profiled this so it appears to perform better anyway, but just wanted to point that out.

The pulse width can be as little as 50us, so the code

Bin your code and use the pin change interrupts! This is the kind of job interrupts are intended for!

As a general point - any time you find your self in this kind of situation the you are doing something very wrong!

Mark

  1. Is there any way of telling avr-gcc to unroll a loop?

It would be in contradiction to the overall -Os compile directive. I don’t know if it will do that (you’d think it ought to…) Your inner loop looks a bit complicated for the compiler to unroll, I think.

  1. Replacing count[i * 8 + pos]++; with count[(i << 3) + pos]++; gives a speed improvement.

Really? Because the code IS doing a shift:

        count[i * 8 + pos]++;
 416:   e2 0f           add     r30, r18
 418:   f3 1f           adc     r31, r19
 41a:   ee 0f           add     r30, r30
 41c:   ff 1f           adc     r31, r31

You can eliminate the question by maintaining a pointer and just adding 8…
I came up with (untested):

void measure2() {
  uint8_t regs[noRegisters] = { readRegisters };
  uint16_t *countp = &count[0];
  for (uint8_t i = 0; i < noRegisters; i++) {
    uint8_t delta = (lastStateRegs[i] ^ regs[i]) & lastStateRegs[i];
    uint8_t pos = 0;
    while (delta) {
      if (delta & 1) {
        countp[pos++];
      }
      delta >>= 1;
    }
    lastStateRegs[i] = regs[i];
    countp += 8;
  }
}

I’m a little worried at the “small optimizations” (like “while (delta)”); the code needs to be fast enough even in the “worst case” scenario, right? Going faster in the average case may not be useful, and may lead to a false sense of security…

Modify Arduino Optimization Levels on the Fly

holmes4:

The pulse width can be as little as 50us, so the code

Bin your code and use the pin change interrupts! This is the kind of job interrupts are intended for!

As a general point - any time you find your self in this kind of situation the you are doing something very wrong!

Mark

I don't think the atmega2560 has 24 interrupt pins? Otherwise I agree, interrupt would usually be the way to handle this.

JarkkoL:
In your case you might want to try something like this, which I would think is much faster:

uint16_t *cnt=count+i*8;

while(delta)
{
  if(delta&1)
    ++*cnt;
  delta>>=1;
  ++cnt;
}



In general, one thing you can do is to check the generated assembly if you can see there some obvious room for improvement. That can guide you to change your C/C++ code to perform better, or even write the code in inline assembly if it's very performance intensive. Compilers do good optimization job in general but are not perfect by any means, so sometimes you need to massage the code where it matters.

And like you have done, of course profile if possible before jumping to conclusions (: For example, IIRC, AVR doesn't support bitshift with random literal (only by 1) so it may emit multiple bit shifts for integer literal (or worse, loop if it's not literal), so it may actually be slower than mul. I know you profiled this so it appears to perform better anyway, but just wanted to point that out.

This is a cool way, thanks. I tried it, but it's pretty much the same speed.

I should probably mention that I expect something between 100 and 2000 falling edges within 2 seconds, so the actual code to increase the counter does not run THAT often. I guess this is why your method does not give a larger speed improvement. None the less, it's much cleaner code, so I'll use it.

final:
I don't think the atmega2560 has 24 interrupt pins?

Emphasis added...

holmes4:
use the pin change interrupts

westfw:
It would be in contradiction to the overall -Os compile directive. I don't know if it will do that (you'd think it ought to...) Your inner loop looks a bit complicated for the compiler to unroll, I think.

I think its understandable that avr-gcc does not unroll loops at all by default because any kind of storage is such a premium on AVRs. so I guess using 1 byte of ram for probably half a KB of eeprom is usually the right thing to do. In my case however, because my sketch is so small, I have plenty of both.

Really? Because the code IS doing a shift:

        count[i * 8 + pos]++;

416:   e2 0f           add     r30, r18
418:   f3 1f           adc     r31, r19
41a:   ee 0f           add     r30, r30
41c:   ff 1f           adc     r31, r31



You can eliminate the question by maintaining a pointer and just adding 8...

That's funny, because I can very much reproduce the difference in speed. I will have to look at the assembly myself and see what it does differently in my case. For the record, I'm using avr-gcc (GCC) 5.4.0 on debian 9. This seems to be the last version of the compiler in a debian package, though it's from 2015.

I'm a little worried at the "small optimizations" (like "while (delta)"); the code needs to be fast enough even in the "worst case" scenario, right? Going faster in the average case may not be useful, and may lead to a false sense of security...

I found that in over 99% of cases, only one bit changes per register. In most cases, no bits changed.
I can execute this loop about 100'000 times per second, and I have roughly a maximum of 2'000 falling edges per second, but usually it's around 100 falling edges per second.

Can you give me an example how to get a pin change interrupt on all PINs of PINA, PINC and PINL registers on the mega2560? I'm unable to find any resources on this :frowning:

final:
Can you give me an example how to get a pin change interrupt ... I'm unable to find any resources on this

Try harder...

Google Search:
About 79,100 results (0.73 seconds)

I did that. And all I found was results saying that the UNO supports pin change on all ports, but not much information on the mega2560.

But then I found this:

Theoretically, there are exactly 24 pins that support pin change interrupts, but I need a bunch of those for serial communication.