GCC not too smart with register-assigned variables?

I am telling GCC to store a byte variable in a register to speed things up:

register byte b asm ("r3");

Then in my code I have:

b <<= 1;

After compiling I would expect that to turn into the following code:

lsl r3

Instead GCC emits the following instructions:

458:       83 2d           mov     r24, r3
45a:       88 0f           add     r24, r24
45c:       38 2e           mov     r3, r24

Which is of course 300% slower (add r24, r24 is basically the same as lsl r24, but why do that in another register?). Isn’t that silly? Doesn’t that pretty much defy part of the usefulness of using fixed registers for variables? Another example:

++cnt;

Since cnt is assigned to r4, that could just be:

inc r4

Instead I get:

466:       81 e0           ldi     r24, 0x01       ; 1
468:       84 0d           add     r24, r4
46a:       48 2e           mov     r4, r24

Is there any reason behind this? Is there any GCC flag that tells it to be “smarter” in cases like these?

EDIT: I have just discovered that some instructions only work with registers >= r16 (e.g.: ori), but this is not the case with those mentioned above!

What if you just declare it register and let gcc decide which register to use. Does that help?

I believe the Arduino optimises space over speed.

I'd just love that, but unfortunately with global variables you must specify the register yourself. I am using those variables to exchange data with an ISR.

Oddly enough, I have just assigned another variable (b) to r28, and:

b |= 0x01;

turned into:

ori     r28, 0x01

So I guess GCC knows how to be smart!

(EDIT: Tried to assign the other vars to r28, in case GCC was just thinking inc and lsl wanted registers >= r16, but still it insisted to go through r24, so it's not the case)

About space vs. speed yes, by default Arduino passes the -Os flag to GCC which makes it optimize for space, but emitting 3 instructions instead of one doesn't look very smart space-wise either :(.

Don’t try to out-guess the compiler.

This test:

volatile byte b = 1;

void setup ()
  {
  b <<= 1;
  }  // end of setup

void loop () { }

Generates:

000000a6 <setup>:
  a6: 80 91 00 01 lds r24, 0x0100
  aa: 88 0f       add r24, r24
  ac: 80 93 00 01 sts 0x0100, r24
  b0: 08 95       ret

Adding r24 to itself effectively doubles it, and only takes one cycle.

… to speed things up …

Well, it slowed things down. Let the compiler-writers do their thing.


Your second test, adding 1:

volatile byte b = 1;

void setup ()
  {
  ++b;
  }  // end of setup

void loop () { }

Generates:

 aa: 8f 5f       subi r24, 0xFF ; 255

OK, it subtracted -1 to add 1. But it got there in one cycle.


Is there any GCC flag that tells it to be “smarter” in cases like these?

Yes. Do nothing. By trying to out-guess the compiler you actually defeat it. It might have made smarter choices but you have to insist:

No, no! Use register R3, I insist!

The compiler replies:

Well, damn! I could have generated better code, but if you really want me to use R3, I will.

Remember the old saying:

Why buy a dog, and then bark yourself?

SukkoPera:
I'd just love that, but unfortunately with global variables you must specify the register yourself. I am using those variables to exchange data with an ISR.

You don't need to do that to exchange data with an ISR. Just use volatile.

I know I don’t need to do that just because of the ISR itself, but the ISR has to be fast!

About the other post, if I just use volatile, I get:

  b <<= 1;
     458:       80 91 de 01     lds     r24, 0x01DE
     45c:       88 0f           add     r24, r24
     45e:       80 93 de 01     sts     0x01DE, r24
  ++cnt;
     462:       80 91 dd 01     lds     r24, 0x01DD
     466:       8f 5f           subi    r24, 0xFF       ; 255
     468:       80 93 dd 01     sts     0x01DD, r24

Which is even slower… Maybe in your case memory usage was just optimized out since your variable wasn’t really used anywhere else? EDIT: Or maybe you just didn’t post the lds/sts part?

Or maybe you just didn’t post the lds/sts part

I didn’t post it because it wasn’t germane to how fast the add was.

Adding volatile is a bit of a cheat here, because it forces the compiler to load and store. Without it the compiler might optimize away the whole add/shift because it knows the result at compile-time.

... but the ISR has to be fast!

Why? You may be optimizing the wrong thing.

I am using an interrupt to detect falling edges on an input signal. I have to sample that signal right after that and be done with the ISR before 4 us (from the falling edge), since the signal will change by then.

I bet you know that "an ISR using the ISR define will take you 2.625 µS to execute, plus whatever the code itself does" ;D, so that leaves me time for 22 instructions in the worst case, which means I can't really waste time accessing RAM. I guess I'll have to write the whole ISR in assembler, so that it's not compiler-dependent, since if I even get GCC to generate the code I need, that might no longer be the case in a future/past version.

I already have the signal sampled correctly with a tight loop without using interrupts, but I am trying to come up with an interrupt-enabled version to see if I can do other things while the signal is being sampled. To get to this, the ISR must even be smaller than 22 instructions, so that it leaves some time for the main code to run between two successive invocations.

SukkoPera:
I am using an interrupt to detect falling edges on an input signal. I have to sample that signal right after that and be done with the ISR before 4 us (from the falling edge), since the signal will change by then.

If you sample a signal after a falling edge and before it changes won't it always be zero?

Eh eh eh, no, it will change inbetween:

http://afermiano.com/index.php/n64-controller-protocol

I also have a mostly working version using a 4 us timer to do the sampling, but I have some issues finding the right moment to start the timer. This is a good alternative anyway,

The Arduino sets the compiler to -Os for size optimization. You can change it with platform.local.txt for speed (-O3) or global optimization (-flto) or more use of register in loops (-fira-loop-pressure).

The 1us/3us signal will be hard or impossible to do with a normal ATmega microcontroller in software. I'm thinking about a trick by using a timer input or so, but I can't come up with something that might work.

Someone managed to make it work with assembly code : Gamecube-N64-Controller/gamecube.ino at master · brownan/Gamecube-N64-Controller · GitHub
Here is more explanation about the code : http://www.instructables.com/id/Use-an-Arduino-with-an-N64-controller/?ALLSTEPS

SukkoPera:
Eh eh eh, no, it will change inbetween:

afermiano.com

Hmm, I did a N64 decoder a while back. This is the copy I found on my hard disk:

/*
  Nintendo N64 game controller interface.
  Author: Nick Gammon
  Date: 14 October 2014
  
  Helpful information:
    http://afermiano.com/index.php/n64-controller-protocol
    https://code.google.com/p/micro-64-controller/wiki/Protocol
  
  Note: Connect the controller data line to D2 via a pullup resistor 
        (I used 4.7 k) to the 3.3 V pin.
        
  Released for public use.
*/  

#include <avr/wdt.h>
#include <avr/builtins.h>

#define CONTROLLER false   // true to emulator controller, false to emulate console

#define N64_PIN 2
#define N64_PIN_DIR DDRD
// these two macros set arduino pin 2 to input or output, which with an
// external 1K pull-up resistor to the 3.3V rail, is like pulling it high or
// low.  These operations translate to 1 op code, which takes 2 cycles
#define N64_high()    N64_PIN_DIR &= ~0x04
#define N64_low()     N64_PIN_DIR |= 0x04
#define N64_query()  (PIND & 0x04)

#define nop asm volatile ("nop\n\t")

#define WANT_8MHZ false

// status data1:
const byte BUTTON_D_RIGHT = 0x01;
const byte BUTTON_D_LEFT = 0x02;
const byte BUTTON_D_DOWN = 0x04;
const byte BUTTON_D_UP = 0x08;
const byte BUTTON_START = 0x10;
const byte BUTTON_Z = 0x20;
const byte BUTTON_B = 0x40;
const byte BUTTON_A = 0x80;

// status data2:
const byte BUTTON_C_RIGHT = 0x01;
const byte BUTTON_C_LEFT = 0x02;
const byte BUTTON_C_DOWN = 0x04;
const byte BUTTON_C_UP = 0x08;
const byte BUTTON_R = 0x10;
const byte BUTTON_L = 0x20;

// 8 bytes of data that we get from the controller
typedef struct {
    byte data1;
    byte data2;
    char stick_x;
    char stick_y;
} N64_status;

void N64_send (byte * output, byte length)
{
  byte bitOn;
  byte bits;
 
   while (length--)
    {
    PINB = bit (5); // toggle D13 
    byte currentByte = *output++;
    
    for (bits = 0; bits < 8; bits++)  // 5 cycles if staying in loop
      {
      bitOn = currentByte & 0x80;  // 2 cycles
      currentByte <<= 1;   // 1 cycle
      
      // ------- low for 1 uS
      PINB = bit (3); // toggle D11
      N64_low(); // start bit
      
      // we want 14 but subtract 2 for toggling the LED
      // and another 2 for the decision below
  #if WANT_8MHZ
      nop; nop; nop; // 3
  #else
      nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; // 11
  #endif
      // -------------------
      
      // -------low or high for 2 uS
      if (bitOn)  // test bit, 1 if high, 2 if low
        {
        N64_high(); // go high for one  - 2 cycles
        }
      else          
        {
        N64_low(); //  stay low for zero - 2 cycles
        nop;  // compensate for branch  - 1 cycle
        }
  #if WANT_8MHZ
      nop; nop; nop; nop; nop; nop; nop; nop;   // 8
      nop; nop; // 10
  #else
      nop; nop; nop; nop; nop; nop; nop; nop;   // 8
      nop; nop; nop; nop; nop; nop; nop; nop;   // 16
      nop; nop; nop; nop; nop; nop; nop; nop;   // 24
      nop; nop; // 26
  #endif
  
      // -------------------
  
      // ------- high for 1 uS
      N64_high(); //  stop bit must be high
      
      // we want 14 but subtract 7 because of loop overhead
  #if WANT_8MHZ
  #else    
      nop; nop; nop; nop; nop; nop; nop; nop;   // 7
  #endif
  
      // -------------------
      }  // end of each bit

    }
    
  // console stop bit - low for 1 uS then high
  N64_low(); // start bit (2 cycles)

#if WANT_8MHZ
  nop; nop; nop; nop; nop; nop;   // 6
#else    
  nop; nop; nop; nop; nop; nop; nop; nop;   // 8
  nop; nop; nop; nop; nop; nop;   // 14 
#endif

  N64_high(); 

}  // end of N64_send

void N64_get(byte * output, byte length)
{
  byte bits;

// controller will time out more slowly
  wdt_reset();  // pat the dog
#if CONTROLLER
  wdt_enable(WDTO_4S);
#else
 // wdt_enable(WDTO_30MS);
  wdt_enable(WDTO_8S);  // bootloader problems, arrgh!
#endif

  while (length--)
    {
    PINB = bit (5); // toggle D13 
    for (bits = 0; bits < 8; bits++)
      {
      // wait for start bit
      while (N64_query()) { }
        
      PINB = bit (3); // toggle D11
      
      // wait for 2 uS then check the line
#if WANT_8MHZ
      nop; nop; nop; nop; nop; nop; nop; nop;   // 8
      nop; nop;   // 10
#else
      nop; nop; nop; nop; nop; nop; nop; nop;   // 8
      nop; nop; nop; nop; nop; nop; nop; nop;   // 16 
      nop; nop; nop; nop; nop; nop; nop; nop;   // 24 
      nop; nop; nop; nop;  // 28
#endif
      PINB = bit (4); // toggle D12
        
      // shift left as the most significant bit comes first
      *output <<= 1;
      // or this bit in
      *output |= N64_query() != 0;
      
      // wait for line to go high again
      while (!N64_query()) { }
        
      }  // end of for each bit
    output++;
    }  // end of while each byte
 
  // wait for stop bit
  while (N64_query()) { }
  // then other end should let line go high
  while (!N64_query()) { }

  wdt_disable();  // disable watchdog
}  // end of N64_get

byte buf [4];

void setup ()
  {
  pinMode (N64_PIN, INPUT);  // do not make OUTPUT or INPUT_PULLUP!
  Serial.begin (115200);
  Serial.println ("Starting ...");

// debugging
  pinMode (10, OUTPUT);    
  pinMode (11, OUTPUT);    
  pinMode (12, OUTPUT);    
  pinMode (13, OUTPUT);    
  }  // end of setup

void loop ()
  {
  PINB = bit (2); // toggle D10
  
#if CONTROLLER
  noInterrupts ();
  N64_get (buf, 1);  // wait for query
  byte sendBuf [] = { 0x05, 0x00, 0x02 };
  N64_send (sendBuf, sizeof sendBuf);
  interrupts ();
#else
  byte sendBuf [1] = { 0x00 };  // identify controller
  /*
  noInterrupts ();
  N64_send (sendBuf, sizeof sendBuf);
  N64_get (buf, 3);  // get ID
  interrupts ();
  char printBuf [30];
  sprintf (printBuf, "ID: %02X %02X %02X", buf [0], buf [1], buf [2]);
  Serial.println (printBuf);
  Serial.flush ();
  */
  noInterrupts ();
  sendBuf [0] = 0x01;  // poll controller
  N64_send (sendBuf, sizeof sendBuf);
  N64_status status;
  N64_get ((byte *) &status, sizeof status);  // get ID
  interrupts ();
  
  static N64_status oldStatus;
 
  if (memcmp (&status, &oldStatus, sizeof status) != 0)
    {
    if (status.stick_x || oldStatus.stick_x)
      {
      Serial.print ("X: ");
      Serial.print ((int) status.stick_x);
      Serial.print (" ");
      }
    if (status.stick_y || oldStatus.stick_y)
      {
      Serial.print ("Y: ");
      Serial.print ((int) status.stick_y);
      Serial.print (" ");
      }     
    oldStatus = status;
    if (status.data2 &  BUTTON_C_RIGHT)
      Serial.print (F("C right "));
    if (status.data2 &  BUTTON_C_LEFT)
      Serial.print (F("C left "));
    if (status.data2 &  BUTTON_C_UP)
      Serial.print (F("C up "));
    if (status.data2 &  BUTTON_C_DOWN)
      Serial.print (F("C down "));
    if (status.data2 &  BUTTON_R)
      Serial.print (F("R "));
    if (status.data2 &  BUTTON_L)
      Serial.print (F("L "));
    
    if (status.data1 &  BUTTON_D_RIGHT)
      Serial.print (F("D right "));
    if (status.data1 &  BUTTON_D_LEFT)
      Serial.print (F("D left "));
    if (status.data1 &  BUTTON_D_UP)
      Serial.print (F("D up "));
    if (status.data1 &  BUTTON_D_DOWN)
      Serial.print (F("D down "));
    if (status.data1 &  BUTTON_START)
      Serial.print (F("Start "));
    if (status.data1 &  BUTTON_Z)
      Serial.print (F("Z "));
    if (status.data1 &  BUTTON_B)
      Serial.print (F("B "));
    if (status.data1 &  BUTTON_A)
      Serial.print (F("A "));
    
    if (status.data1 == 0 && status.data2 == 0)
      Serial.print (F("(no buttons)"));
    Serial.println ();
    Serial.flush ();
    }
    
  delay (200);
#endif
  
  }  // end of loop

I don’t think there is much point using an ISR here, because if you are trying to sample in 4 µS you may as well just loop around reading the blasted thing. What else can your code be doing in the one or two clock cycles you will have over?

Thanks for the link and code but, as said above, I already have two working versions: one with a tight loop (in C, using oversampling, works very well but hogs the CPU) and one with a 4 us timer (working mostly fine, but has some "interference" prolly due to starting the timer at a fixed moment, I should probably wait for the first falling edge).

The former one is published here, but don't spread the word yet :): GitHub - SukkoPera/N64PadForArduino: Use a Nintendo64 controller with Arduino. Or turn it into a USB joypad for your PC! Or into a pad for your Sega MegaDrive :).

I guess it could also be done with the input capture unit, but I'm not interested in that since I might want to put the code on an ATtiny85 in the end, and it doesn't have that.

So I'm just trying to come up with a version using interrupts. For a couple reasons I won't probably be able to use it anyway in the end (interrupt pins might already be in use) but now it's turned into a personal challenge :).

I see. I think it is not possible (I'm 50% certain of it). You need an Arduino Due or other faster processor. Or with extra hardware.

As I said on http://www.gammon.com.au/interrupts:

... an ISR using the ISR define will take you 2.625 µS to execute, plus whatever the code itself does ...

I already have two working versions: one with a tight loop (in C, using oversampling, works very well but hogs the CPU)

Yes, but your 4 µs timer also hogs the CPU, if it takes 2.625 µs of overhead. How many µs are over to do useful work? Let me see: 1.375 µs.

... so that leaves me time for 22 instructions in the worst case, which means I can't really waste time accessing RAM ...

You already admit you barely have time to do anything inside the ISR, so I'm afraid the interrupt-driven idea also hogs the CPU.

As a personal challenge that's fine, have fun.

All I'm saying is, you are basically using registers/assembler etc. to get the code done in 22 clock cycles (that isn't 22 instructions BTW). So how many will be over? One or two? And you can't afford to have timer 0 running because if that kicks in, it will throw your timing out. So you have to stop all other interrupts, and then tightly code your ISR, and make sure nothing else is running. So what is the point?

It's 22 instructions if you avoid RAM access and jumps :). Of course it's (almost?) impossible to not use jumps at all, but I think I'm already down to 18 clock cycles. It still isn't working correctly but the reason must be lying elsewhere now.

Anyway you're right: all things considered, there is no point. By now I guess it's only a personal challenge. And my first attempt at writing AVR assembler code. And actually my first contact with assembler code at all in years! :frowning:

What's the "least invasive" way to stop timer 0 BTW?

Thanks for all the help anyway!

avr-gcc seems to do a particularly poor job of using the (2nd-class) low-numbered registers.
I'm not sure why; especially for your examples. It's probably a question for avrfreaks.net rather than the arduino forums.
(Do you see the same behavior with both the old (Arduino 1.0.6) and new (Arduino 1.6.5) compilers?

(and it turns out that there isn't really an "lsl" instruction; it's just an alias for "add r,r" (it should still work, but the disassembly will probably always show "add"))

What's the "least invasive" way to stop timer 0 BTW?

TCCR0B = 0;

Although that changes WGM02 as well.

This would be less invasive:

TCCR0B &= ~7;