can't find a register in class 'SIMPLE_LD_REGS' while reloading 'asm'

Hi!

I'm trying to rewrite one of my functions using inline asm, however I have a trouble here. This is my first time to try assembler, actually.
I've prepared a minimal example:

void setup(void)
{
  Serial.begin(9600);
  Serial.println((uint32_t)asmfoo(1234, 5678));  //this code is just to call asmfoo() 
}

void loop(void)
{
}

uint64_t asmfoo(uint32_t x, uint32_t y)
{
  uint64_t z;
  asm volatile (
  "nop \n\t" //I'm going to make use of fmul instructions here
  "nop \n\t" //which only accept r16-r23 registers
  "nop \n\t"
  : [Z]"=&r"(z)           //"r" is any general register (r1-r32)
  : [X]"a"(x), [Y]"a"(y)  //"a" is a simple upper register (r16-r23)
  );
  return z;
}

This code should pass two 32 bit variables into r16-r23 registers (i.e. 8 registers for two 4-byte variables, which is enough) and get back a 64-bit variable on any other registers (there should be enough of them in the lower registers, for example).
When I try to compile the code on Arduino IDE 1.0.1 (Windows) for Mega2560 target I get the following error:

sketch_feb25b.cpp: In function 'uint64_t asmfoo(uint32_t, uint32_t)':
sketch_feb25b:19: error: can't find a register in class 'SIMPLE_LD_REGS' while reloading 'asm'
sketch_feb25b:19: error: 'asm' operand has impossible constraints

Does anybody have any idea?

My guess is the compiler uses some upper registers for it's internal usage. In your asm operand you're trying to use all upper registers for your asm command. Have you tried this first with uint16_t variables? For me that compiles, it even compiles with only one of the two variables declared as uint16_t.

Yes, it does compile with one of the variables declared as uint16_t, However I need them all, because what I'm trying to do is multiply them using fmul instruction which only accepts simple upper registers. Of cause I could use only two of the upper registers and store the values in different registers and move them from there, but that would be a waste of registers and cycles (two more additional registers and 16 more cycles (1 movw per each of 16 fmuls I need), not counting the overhead that the compiler would probably need to free up the two additional registers).
I think the compiler should be able to push its internal data in stack or move them away to other registers if I need all simple upper registers.
What is more strange is that it does compile if I pass it a single variable of 64 bit:

uint64_t asmfoo(uint64_t x, uint32_t y)    //     <- note the uint64_t
{
  uint64_t z;
  asm volatile (
  "nop \n\t" //I'm going to make use of fmul instructions here
  "nop \n\t" //which only accept r16-r23 registers
  "nop \n\t"
  : [Z]"=&r"(z)           //"r" is any general register (r1-r32)
  : [X]"a"(x)//, [Y]"a"(y)  //"a" is a simple upper register (r16-r23)     //   <-note the commented out y
  );
  return z;
}

Where did you find the fmul instruction? The AVR Assembler User Guide (http://www.atmel.com/Images/doc1022.pdf) does not list it.

I found it on AVR Instruction Set document (http://www.atmel.com/Images/doc0856.pdf). The instruction multiplies two registers and performs a left shift on the result one bit in just two cycles - perfect for my fixed-point library that I'm doing for my quadrocopter project!
It's also mentioned on other Atmel docs such as AVR200: Multiply and Divide Routines (http://www.atmel.com/Images/doc0936.pdf) and AVR201: Using the 8-bit AVR Hardware Multiplier (http://www.atmel.com/Images/doc1631.pdf).

Any ideas?

I would either pack the two variables into a uint64_t before calling the function or store them into any register (not just the high ones) and copy it to the high registers in the assembler code before you use them.

I don't suppose we can talk you out of using assembler?

This code:

uint64_t asmfoo(uint64_t x, uint32_t y)    //     <- note the uint64_t
{
  uint64_t z;
  asm volatile (
  "nop \n\t" //I'm going to make use of fmul instructions here
  "nop \n\t" //which only accept r16-r23 registers
  "nop \n\t"
  : [Z]"=&r"(z)           //"r" is any general register (r1-r32)
  : [X]"a"(x)//, [Y]"a"(y)  //"a" is a simple upper register (r16-r23)     //   <-note the commented out y
  );
  return z;
}

void setup ()
  {
  Serial.begin (115200);
  asmfoo (42, 56);
  }  // end of setup

void loop () { }

Generates:

000000c0 <setup>:
  c0:	8f 92       	push	r8
  c2:	9f 92       	push	r9
  c4:	af 92       	push	r10
  c6:	bf 92       	push	r11
  c8:	cf 92       	push	r12
  ca:	df 92       	push	r13
  cc:	ef 92       	push	r14
  ce:	ff 92       	push	r15
  d0:	0f 93       	push	r16
  d2:	1f 93       	push	r17
  d4:	88 e9       	ldi	r24, 0x98	; 152
  d6:	91 e0       	ldi	r25, 0x01	; 1
  d8:	40 e0       	ldi	r20, 0x00	; 0
  da:	52 ec       	ldi	r21, 0xC2	; 194
  dc:	61 e0       	ldi	r22, 0x01	; 1
  de:	70 e0       	ldi	r23, 0x00	; 0
  e0:	0e 94 12 01 	call	0x224	; 0x224 <_ZN14HardwareSerial5beginEm>
  e4:	0a e2       	ldi	r16, 0x2A	; 42   <------ asmfoo starts here
  e6:	10 e0       	ldi	r17, 0x00	; 0
  e8:	20 e0       	ldi	r18, 0x00	; 0
  ea:	30 e0       	ldi	r19, 0x00	; 0
  ec:	40 e0       	ldi	r20, 0x00	; 0
  ee:	50 e0       	ldi	r21, 0x00	; 0
  f0:	60 e0       	ldi	r22, 0x00	; 0
  f2:	70 e0       	ldi	r23, 0x00	; 0
  f4:	00 00       	nop
  f6:	00 00       	nop
  f8:	00 00       	nop
  fa:	1f 91       	pop	r17
  fc:	0f 91       	pop	r16
  fe:	ff 90       	pop	r15
 100:	ef 90       	pop	r14
 102:	df 90       	pop	r13
 104:	cf 90       	pop	r12
 106:	bf 90       	pop	r11
 108:	af 90       	pop	r10
 10a:	9f 90       	pop	r9
 10c:	8f 90       	pop	r8
 10e:	08 95       	ret

If we comment-out the call to asmfoo notice how much smaller the code is:

000000c0 <setup>:
  c0:	88 e9       	ldi	r24, 0x98	; 152
  c2:	91 e0       	ldi	r25, 0x01	; 1
  c4:	40 e0       	ldi	r20, 0x00	; 0
  c6:	52 ec       	ldi	r21, 0xC2	; 194
  c8:	61 e0       	ldi	r22, 0x01	; 1
  ca:	70 e0       	ldi	r23, 0x00	; 0
  cc:	0e 94 f3 00 	call	0x1e6	; 0x1e6 <_ZN14HardwareSerial5beginEm>
  d0:	08 95       	ret

We saved a whole lot of pushing and popping. Plus passing uint64_t variables requires 8 x LDI instructions (per argument).

Remember, compiler-writers know how to maximize the use of instructions. The compiler itself keeps track of what registers need to be saved and restored. You have to work pretty hard to do a better job than it does.

Hi Nick,

I don't suppose we can talk you out of using assembler?

I don't think so... What I'm trying to do is to multiply two signed 0.31 fixed-point variables stared as int32_t (long int). The best I can do in C is to write

z.value=(x.value>0) ? ((long long)(y.value)*((unsigned long)(x.value)<<1)+0x80000000)>>32 : -(((long long)(y.value)*((unsigned long)(-x.value)<<1)+0x80000000)>>32);

The 0x80000000 is just for rounding.
As you can see, it's far from optimal as the compiler has to perform 64X64->64 bit multiplication and then drop the lower 31 bits (actually I'm first shifting one bit to the right and then drop the lower 32 bits because AVR can't shift more than one bit per cycle, while I believe the compiler will shift 32 bits for me for free).
Using assembler, I could do it much faster by performing 32X32->64 bit multiplication and shifting at the same time utilizing fmul* instructions like this:

    asm volatile (
    "clr %[Z] \n\t"
    "fmuls %D[X], %D[Y] \n\t"
    "movw %C[R], r0 \n\t"
    "fmulsu %D[Y], %B[X]  \n\t"
    "sbc %C[R], %[Z]  \n\t"
    "sbc %D[R], %[Z]  \n\t"
    "movw %A[R], r0  \n\t"
    "fmulsu %D[Y], %C[X]  \n\t"
    "sbc %D[R], %[Z]  \n\t"
    "add %B[R], r0  \n\t"
    "adc %C[R], r1  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmulsu %D[Y], %A[X]  \n\t"
    "sbc %B[R], %[Z]  \n\t"
    "sbc %C[R], %[Z]  \n\t"
    "sbc %D[R], %[Z]  \n\t"
    "mov %D[T], r0  \n\t"
    "add %A[R], r1  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmulsu %D[X], %C[Y]  \n\t"
    "sbc %D[R], %[Z]  \n\t"
    "add %B[R], r0  \n\t"
    "adc %C[R], r1  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %C[X], %C[Y]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "add %A[R], r0  \n\t"
    "adc %B[R], r1  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %B[X], %C[Y]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "add %D[T], r0  \n\t"
    "adc %A[R], r1  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %A[X], %C[Y]   \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "mov %C[T], r0  \n\t"
    "add %D[T], r1  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmulsu %D[X], %B[Y]  \n\t"
    "sbc %C[R], %[Z]  \n\t"
    "sbc %D[R], %[Z]  \n\t"
    "add %A[R], r0  \n\t"
    "adc %B[R], r1  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %C[X], %B[Y]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "add %D[T], r0  \n\t"
    "adc %A[R], r1  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %B[X], %B[Y]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "add %C[T], r0  \n\t"
    "adc %D[T], r1  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %A[X], %B[Y]  \n\t"
    "adc %D[T], %[Z]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "mov %B[T], r0   \n\t"
    "add %C[T], r1  \n\t"
    "adc %D[T], %[Z]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmulsu %D[X], %A[Y]  \n\t"
    "sbc %B[R], %[Z]  \n\t"
    "sbc %C[R], %[Z]  \n\t"
    "sbc %D[R], %[Z]  \n\t"
    "add %D[T], r0  \n\t"
    "adc %A[R], r1  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %C[X], %A[Y]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "add %C[T], r0  \n\t"
    "adc %D[T], r1  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %B[X], %A[Y]  \n\t"
    "adc %D[T], %[Z]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "add %B[T], r0   \n\t"
    "adc %C[T], r1  \n\t"
    "adc %D[T], %[Z]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "fmul %A[X], %A[Y]  \n\t"
    "adc %C[T], %[Z]  \n\t"
    "adc %D[T], %[Z]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "add %B[T], r1  \n\t"
    "adc %C[T], %[Z]  \n\t"
    "adc %D[T], %[Z]  \n\t"
    "adc %A[R], %[Z]  \n\t"
    "adc %B[R], %[Z]  \n\t"
    "adc %C[R], %[Z]  \n\t"
    "adc %D[R], %[Z]  \n\t"
    "clr r1  \n\t"
    : [R]"=&r"(z.value), [T]"=&r"(tmp), [Z]"=&r"(zero)  // (long int)z.value is the result, (long int)tmp is a temporary variable for storing lower 32 bits of the multiplication result
    : [X]"a"(x.value), [Y]"a"(y.value) //(long int)x.value and (long int)y.value are the input variables
    );

If the above code compiles, it would (provided it's bug-free) do the fixed point multiplication in less than 160 cycles. It could even do it about twice faster with loosing precision of 1-2 bits by doing 32X32->32 multiplication and not calculating the lower 4 bytes internally.

I don't believe that compiler-writers can do it much better, to be honest.
You see, 10 push's and pop's are nothing comparing to the overhead to perform 64X64 multiplication.

If I may enquire, what is the real problem?

You are stating that you need to multiply two variables, blah blah, but what is the actual problem? Every now and then we get enquiries about "how do I break out of an interrupt?", or "how do I do X?" where, after looking at the problem, they are looking at things at too low a level.

Will floats do it? How about big numbers?

Maybe not, but I'm curious what the end application is.

Sorry, I didn't realize how high the high level of problem description should be.
I'm making a flight controller for quadrocopter that features a full IMU, which implies that it will inquiry gyro and accel sensors and update its position and orientation information at the highest possible rate.
The real challenge here is performance, because both accel and gyro sensors don't measure absolute values, but only derivatives, and the more frequent I'll read them, the less accumulative error would be.

Floats have the following two main disadvantages:

  1. They are much slower on AVR than a proper integer math implementation
  2. The precision is lower than a 0.31 fixed-point variable

I think the big numbers library is even more slow than floats.

Some test timings, example sketch:

// long test: multiplication

void setup ()
  {
  Serial.begin (115200);
  
  unsigned long start = micros ();
  volatile long a (3123456);
  volatile long b (82654321);
  long c;

  for (int i = 0; i < 1000; i++)
    c = a * b;

  unsigned long time = micros () - start;
    
  Serial.println (c);
  Serial.print (time);
  Serial.println (" uS");  
  }  // end of setup

void loop () { }

Results:

float:

258.17
9560 uS

big numbers:

258.167134
670424 uS

long:

945658112
4592 uS

64-bit long:

945658112
55960 uS

OK, big numbers are slow, as expected. But floats aren't much slower than longs and are much faster than 64-bit longs.

Well, longs are more than twice faster and have more precision, for me that does matter enough to start rewriting something on asm...