Inline assembly dec/brne loop question

I'm very new to this, I don't really know the ARM assembly and don't usually do inline assembly.
I want to make a tiny (code wise) delay loop that waits N cycles.
Does this syntax look right?

// %= is a compiler macro that inserts the line number for jump labels
// __tmp_reg__ is a compiler provided scratch register
// brne takes 2 cycles except for last iter.
// I don't get why I need to use \n\t, instead of a cleaner semicolon ";"
asm volatile (
		"ldi __tmp_reg__, %[loop] \n\t"
	"H_%=: \n\t"
		"dec __tmp_reg__ \n\t"
		"brne H_%= \n\t"
	:
	:
		[loop]		"I" ((cycles+2)/3)
);

FYI, I plan to code my own WS2812B LED driver library...

I can't get it to compile. I get a complaint that a register number above 15 is required. I think that refers to the LDI instruction, which is applicable to registers r16 through r31 only. When I changed the register to r16, it compiled.

How did you get it to compile?

I can't see what kind of object cycles is. if it's a variable, I don't see how an LDI instruction can be generated. The value to be loaded is hard-coded into an LDI instruction, so that value has to be unambiguous at compile time. If cycles is a constant, or you're using it as a placeholder for a constant that you'll add later, it might work. I didn't test it further.

[Edit: add this] I tested this on an Uno, but I see that you reference an ARM. Which platform are you developing it for?

If you are looking to learn more about the assembler, stay on your path. for the length of delay you need for the 2812s, you might consider just a string of nops. if you want to get fancy and make it a function, add entry point labels along the way and end it with a ret. but really, by the time you make a call and a ret, your delay is off. inline nops is the way to go for you IMO.

tmd3:

Indeed, I did not get it to compile, and I was not 100% sure the syntax is correct, like I said this is all new hence me coming here for advice.

Why would the compiler not provide a default register that can be used for a load immediate? That sounds silly. So I have to pick a register and declare it as used?

FYI, cycles is generated from #define constants (I was going to pass it as parameter of a macro that hides this ASM loop hence why it is a lower case variable). However if I read right, immediate must fit on 6 bits, and I'm not sure I have enough of 6 bits for ALL frequencies to support. It's OK for 8MHz and 16MHz. Looks like I can't support 1MHz per the specs.

I am coding for Arduino (uno, nano, tiny), which I thought was an ARM M3 or similar. No?

kvwilson:

I'd like to make the library work for any CPU frequency hence wanting to use a spin loop. The loop takes 3 cycles. For 16MHz I believe I need 8 cycles, so 3 iterations. However for higher frequency CPUs I'd need more
cycles, in case I get a fancier 32MHz or 66MHz controller.

Most LED libraries are ugly because they ifdef a million options. I want to keep this one using minimum FLASH and RAM memory to leave the space for other stuff.

By the way...

  • I want to support encoding LED colors on 1, 2 or 3 or 4 bytes, with some color palette capability for 1B pixels.
  • I want the library to be resilient to interrupts, as well as avoiding masking interrupts for most of the work.
  • I want to support APA LEDs next, and a have the library simple enough others looking at it know how to
    add support for other LED controllers.
  • I hope to make a shim class that will allow me to drop in my library in lieu of adafruit's library.

Thanks to both of you for responding! :slight_smile:

If you are talking about an Uno/Nano etc, they are AVRs - completely different to ARM processors and completely different instruction set.


Your questions:
(1) I don't get why I need to use \n\t, instead of a cleaner semicolon ";"

This is because inline assembler is basically just stuck in to the assembly as it appears in the asm() block. This assembly is what the compiler sees, and it expects each line of assembly to start on a new line. Interestingly in asm avr-gcc inline assembly, a semicolon is a comment, so this would work: "nop ; do nothing for one cycle\n\t".

(2) Why doesn't the assembly compile

Well there are a couple things. The first issue manifests itself as the error 'register number above 15 required'. This is because the 'ldi' instruction (load indirect) can only load a constant value into r16-r31. Your tmp_reg is r0, so it cannot be done.

What you can do instead is declare an uninitialized variable just before the ASM loop and then use that as the loop register - you can let the compiler decide which register is available at the time and use that one.

The second thing is not really an issue per-say, but it isn't ideal. You do this: "H_%=: \n\t". Now while you can do that, this creates a named point in the assembly code which could conflict with something else - it probably won't, but you never know. Instead what you can do is simply use a number, e.g. "1:\n\t" - the numbers don't actually make it into the finally assembly, the are just used to calculate jump offsets.

That gets us to this point:

  #define cycles 100
  unsigned char counter;
  asm volatile (
    "ldi %[counter], %[loop] \n\t"
    "1: \n\t"
    "dec %[counter] \n\t"
    "brne 1b ; Branch to marker 1 backwards\n\t"
    : [counter]  "=r" (counter)
    : [loop]     "I"  ((cycles+2)/3)
  );

Which compiles to:

 12a:	82 e2       	ldi	r24, 0x22	; 34
 12c:	8a 95       	dec	r24
 12e:	f1 f7       	brne	.-4      	;

An important question which you don't say is what 'cycles' is. I am assuming it is a #define, but if it is a variable, then you run into an issue - a variable is not a constant, even if it is declared 'const'. So you can't use ldi, and you can't use "I". Instead you have to do something like:

  unsigned char cycles = 100;
  //...

  asm volatile (
    "mov __tmp_reg__, %[loop] ; We use the mov as the value is already in a register. \n\t"
    "1: \n\t"
    "dec __tmp_reg__ \n\t"
    "brne 1b ; Branch to marker 1 backwards\n\t"
    : 
    : [loop]     "r"  ((cycles+2)/3)
  );

In this case we can do the loop using tmp_reg as we are using the mov instruction not ldi. mov can access all registers r0-r31.

Tom,

Thanks for your very instructive and didactic explanation. It actually helps a lot and is the best help I've had so far on this forum!

It also explains why I have a hard time finding good resources to understand how this works and put it to good use (my last real assembly programming was 6809 :stuck_out_tongue: ). I was almost certain AVR was an atmel spin of Arm which somehow merged their older controller ISA. With this template I can try to find resources that go in and out explaining inline assembly declarations (I found some docs obviously, but was still half baffled), and specific AVR ASM.

Note I assume I can't write

 asm volatile ("
    dec %[counter]
     brne
");

Instead of :

 asm volatile (
    "dec %[counter] \n\t"
    "brne 1b ; Branch to marker 1 backwards\n\t"
);

And why would I need the \t? The assembler expects text to be tabulated or is it for the unfortunate soul that wants to look at auto-gen ASM files?
Lastly I did not quite get the meaning of "volatile" in this context. It seems to not be the same as
standard use in C to avoid optimizing memory accesses for I/O registers and threaded apps.

Declaring an asm block as volatile means the compiler wont try and optimise it away. Anything that is in the asm volatile block will be in the final compilation.


With the "brne 1b", you will note there is also the label "1:". These go together. It basically says jump backwards to label 1. It doesn't just have to be 1, you can have any single digit number (maybe multiple digits, can't remember). There are also cases where the label might be forwards (i.e. the label after the statement). Here is a convoluted example of why the b and f are used:

asm volatile (
  "1:     \n\t"
  "brne 1f   \n\t"
  "dec r1  \n\t"
  "rjmp 1b  \n\t"
  "1:       \n\t"
  "nop     \n\t"
);

That code would perform a 'while' behaviour, rather than a 'do-while'. Essentially it says if initially the number was zero, then jump over the dec and rjmp instructions to the label 1 which appears forward (below) this line. Otherwise it will decrement r1 and then do a jump to the label that appears behind (above) the line. As there are two 1's you have to identify which to pick.


You couldn't do just 'brne' without anything else as you haven't specified where to branch off to. Now instead of adding labels, you can actually be explicit:

asm volatile ("
    "dec %[counter]  \n\t"
    "brne .-4          ;Note the '.'     \n\t"
);

This would say branch 4 "bytes" backwards. Why 4 bytes? Well the branch is done relative to the end of the brne instruction, and we need to get back to the start of the dec instruction - each one is 2 bytes, so a jump totalling -4 is required. To go forwards you would do "brne .+16" or whatever.

Now you could instead jump in terms of number of words (2 bytes) by doing "brne -2" which would compile to be the same as above - notice that in the second option there is no "."!

This is useful for very short loops, but it is a pain if you suddenly decide to add an instruction in the loop as you have to recalculate all of the jumps manually.


As for the tab characters, that one is mostly just for if you start looking at the assembly before compilation. In fact for the most part it doesn't make any real difference - I just do it because that's how I've seen it done! It works fine without the tab, but the new line is a must!

Also you have to wrap each line in its own "", as otherwise it won't work.


Oh, and this is a very useful reference from Atmel: AVR Instruction Set. It basically lists the full instruction set of the AVR CPU including small examples.

And then this one has some useful inline assembly hints: Inline Assembler Cookbook


Also, a correction to the example in my previous post. Really "I" shouldn't be used for passing the [loop] value into the inline assembler. This is because technically "I" represents a 6bit constant, when in fact you should be using "M" as that is an 8bit constant - though in practice it seems to make little difference.

Also if you happen to want to pass a 16bit constant, you can actually use a method which I haven't seen documented - found it by mistake.

asm volatile (
   "ldi r24, lo8(%0) \n"
   "ldi r25, hi8(%0) \n"
   ...
   : 
   : "i" (65535)
);

Notice in this case the use of the lower case "i" to specify a 16bit integer, and then the use of hi8() and lo8() to get the upper and lower 8bit chunks of the constant.

I couldn't get inline assembler to work for me until I read the Inline Assembler Cookbook, referenced in Tom Carpenter's post, above, and this post in AVRFreaks. There's a more there than I grasp, and the same will probably be true for you. But, the answers to some of your questions will be in there.

The reason for "\n\t" seems to be this: The "\n", a newline character, is required in so that the compiler sees a new line, as the compiler demands that each assembler instruction appear on a separate line. Without it, the compiler sees whatever follows as "garbage at end of line." The "\t" just makes the listing look tidier. You may be able to do without it.

The "volatile" keyword tells the compiler not to remove the code, even though it may appear to do nothing.

Your application for this code seems to be for generating short, precise delays for interfacing with WS2812B's. Here's a fairly detailed description of a way to do that. From the information on that page, the longest delay you have to generate is 900 ns, around 14 or 15 cycles of the system clock. It may be preferable to just use an appropriate number of nop's, as Tom Carpenter suggests, rather try to come up with some generic routine for multiple delays. Don't forget that you'll need other code - asm code - to determine whether to send a one or a zero, and that will take time, too, reducing your requirement for additional cycles of delay.

You should consider that the compiler may insert other code before and after the asm code, in order to get the registers into an appropriate state. If, for example, you list r16 in the clobber list, the compiler will have to preserve the contents of r16 before it can execute your asm routine, and maybe restore it afterward before it starts running compiled code again. If it adds that code, it will affect the delay time. To be sure that your timing is just right, you may have to manage all of the data transmission inside the same block of asm code.

Finally, be wary of using a variable for cycles. When I looked at the asm code for that, I saw that the compiler loaded cycles from memory - two clock ticks - added 2 - another tick - loaded 3 into a different register pair - two more ticks - and called a subroutine to do the division - a lot of ticks. The delay period you were hoping for would likely have elapsed before the division was complete, and before your own delay routine started.

Recent versions of the compiler support:

__builtin_avr_delay_cycles(n);

Where "n" is the number of cycles you want to delay. The compiler knows how long instructions take, and generates an optimal piece of code. For example, for one or two cycles, it will give you a couple of NOPs. For more, it generates a short loop, carefully calculated to give you the target delay.

Thanks for all this knowledge.

I really need to digest it while at the same time I build my library.

The __builtin_avr_delay_cycles() will come in handy, even though I do want to become
comfortable in writting short ASM segments for AVR (and later ARM I suppose). That'll handle
all AVRs chips 4MHz to up to the 60+MHz out there I think.

I know the timings of the WS2812B and understand them pretty well. I'm quite seasoned
in CS skills and grok how chips are designed, assemblers do their work and what code does.

I'm just very new to AVR and Arduino, and bare bones ASM (I usually at most use C built ins
for the few special atomic instructions). I picked it up on a whim (a friend gave me an uno
and it opened the floodgates) and now I wanna contribute and build a few cool things for
hum the next burning man :P.

Question is there room for yet another WS2812 library out there? XD

in asm, a semicolon is a comment, so this would work: "nop ; do nothing for one cycle\n\t".

It varies by assembler. The ARM gcc assembler, for instance, DOES use ";" as a line terminator, and "@" as the comment character. This makes actual assembler code look really crappy. (fortunately, it's common to use the c pre-processor with asm files when using gas, so you can use C-style comments instead.)

"temp_reg" doesn't mean "pick a temporary register to use here", it is A PARTICULAR register used as a temporary by gcc as a whole. It happens to be R0, which doesn't support a fair number of AVR instructions. registers supported by ALL instructions are valuable, and the compiler would rather use them for general purposes.

the AVR gcc assembler probably doesn't need a tab at the start, but it's pretty common for assemblers to attach special meaning to symbols that start in column 1, and so it's become "good practice" to make sure that opcodes don't end up there.

westfw,

That's one reason I could never accept learning fortran XD
... and one major Python flaw imfo

but off topic. :stuck_out_tongue:

It's useful to have learned several assembly languages.
IIRC, DEC assembly languages had "a label is a symbol followed by a colon" (Hmm. comma on PDP-1 and PDP-8, apparently!), and IBM assemblers used the "a label is a symbol that starts in column 1. Instruction op-codes never start in column 1." Different microprocessor vendors and/or assembler authors copied from different places...

Okay, I've made some progress but I'm still far from something decent.
Looks like I might not need inline ASM now that I have a built in delay for this specific task.

Some issues:

  • I'd like to see the output in ASM. I only find the HEX output. How can I generate ASM with arduino 1.5.8?
  • sei/cli seem to totally mess up delay(). I do want to protect my bit 0 signal by holding off interrupts in the critical section. :confused:
  • does bitSet() translate to an inline sbi, or does it do more, like being an actual function? If so maybe I do need to inline ASM here.
  • the Zero bit high-level max delay cycles is much lower than I expect. At 16.5MHz, a max of 563ns means less than 9 cycles. I have to put 2 or 3 cycles to make it work in my program. That sucks because it means this can't work at 8MHz or below.
 inline void
 sendBit(bool bit)
 {
   if (bit) {
     // Send a ONE
     bitSet(PORTB, 1); 
     __builtin_avr_delay_cycles(15);
     bitClear(PORTB, 1); 
     __builtin_avr_delay_cycles(30);
   } else {
     // Send a ZERO
     __builtin_avr_sei();
     bitSet(PORTB, 1); 
     __builtin_avr_cli();
     __builtin_avr_delay_cycles(2);
     bitClear(PORTB, 1); 
     __builtin_avr_delay_cycles(30);
   }
 }
        
 void
 sendBytes(int count, uint8_t * bytes)
 {
   for(int c=0; c < count; c++) {
     for(uint8_t b=8; b!=0; b--) {
       sendBit(bytes[c]>>(b-1) & 0x1);
     }
   }
 }

// demo

void setup() {
  bitSet(DDRB, 1); // Use port B, pin 1
}

uint8_t   GbrPixels[3*4] ={0xA0,0xA0,0xA0, 0xF0,0x00,0x00, 0x00,0xF0,0x00, 0x00,0x00,0xF0};
uint8_t BlackPixels[3*4] ={0x00,0x10,0x00, 0x10,0x00,0x00, 0x00,0x10,0x00, 0x00,0x00,0x10};

void loop() {
 sendBytes(3*4, GbrPixels);
 delay(1000);
 sendBytes(3*4, BlackPixels);
 delay(1000);
}

It is possible to see the assembler generated from the C++ code. I have seen it, but I forgot how. I'm sure someone who knows will come along soon.

Well while I proofread my post, I've managed to fix a lot of it:

  • don't use setBit().
  • I had reversed cli/sei!
  • Need to read ASM output for the following:

My max working delay is 6 cycles, so there's 3 lost cycles I must account for:
sbi is 2, where's the other cycle?

60ns per cycle at 16.5MHz means it works at 545ns, and fails at 606ns delay.

  • At 8Mhz, I still use 4 cycle sto run 480ns, that's OK, if 5 cycles, it's risky.
  • At 4MHz, I can run the set/clear, and it's 2 cycles, that'd be OK
  • At 1MHz, it's not possible?
	inline void
	sendBit(bool bit)
	{
		if (bit) {
                        PORTB |= 1U<<1;
			__builtin_avr_delay_cycles(15);
                        PORTB &= ~(1U<<1);
			__builtin_avr_delay_cycles(30);
		} else {
			__builtin_avr_cli();
                        PORTB |= 1U<<1;
			__builtin_avr_delay_cycles(5);
                        PORTB &= ~(1U<<1);
			__builtin_avr_sei();
			__builtin_avr_delay_cycles(30);
		}
	}

If you enable verbose mode during compilation, the last but one line will tell you where the .elf file has been saved to (it appears in an objcopy command when the hex is made).

Once you have to location of this file, open command prompt and change directory to:

<Arduino IDE Directory>\hardware\tools\avr\bin

The path may be slightly different for 1.5 of the IDE, I can't remember, but basically it will be the same directory from the objcopy command in the verbose output.

Now run the following command:

avr-objdump -S "path\to\your.elf" > "where\to\save\asm.lst"

This will produce an assembly listing of the compiled code, including intermixed source code (which is sometimes duplicated, so you need to hunt for the correct place).


What I will say is that there is no need to be shifting a 32bit constant when you are working with 8-bit numbers, "1<<1" will suffice (without the U). Not that that should/will affect the compiled output as it will be optimised as a constant, its just worth noting.


If I compile your function (without the inline to begin with), it results in the following assembly listing:

  if (bit) {
 12a:	88 23       	and	r24, r24
 12c:	31 f0       	breq	.+12     	; 0x13a <_Z7sendBitb+0x10>
    PORTB |= 1U<<1;
 12e:	29 9a       	sbi	0x05, 1	; 5
    __builtin_avr_delay_cycles(15);
 130:	85 e0       	ldi	r24, 0x05	; 5
 132:	8a 95       	dec	r24
 134:	f1 f7       	brne	.-4      	; 0x132 <_Z7sendBitb+0x8>
    PORTB &= ~(1U<<1);
 136:	29 98       	cbi	0x05, 1	; 5
 138:	07 c0       	rjmp	.+14     	; 0x148 <_Z7sendBitb+0x1e>
    __builtin_avr_delay_cycles(30);
  } 
  else {
    __builtin_avr_cli();
 13a:	f8 94       	cli
    PORTB |= 1U<<1;
 13c:	29 9a       	sbi	0x05, 1	; 5
    __builtin_avr_delay_cycles(5);
 13e:	00 c0       	rjmp	.+0      	; 0x140 <_Z7sendBitb+0x16>
 140:	00 c0       	rjmp	.+0      	; 0x142 <_Z7sendBitb+0x18>
 142:	00 00       	nop
    PORTB &= ~(1U<<1);
 144:	29 98       	cbi	0x05, 1	; 5
    __builtin_avr_sei();
 146:	78 94       	sei
    __builtin_avr_delay_cycles(30);
 148:	8a e0       	ldi	r24, 0x0A	; 10
 14a:	8a 95       	dec	r24
 14c:	f1 f7       	brne	.-4      	; 0x14a <_Z7sendBitb+0x20>
 14e:	08 95       	ret

If I restore the 'inline' keyword, and then call sendBit(true) followed by sendBit(false) in the loop(), you get the following (I've stripped it down and added the two comments):

//First sendBit(true)

    PORTB |= 1U<<1;
 12a:	29 9a       	sbi	0x05, 1	; 5
    __builtin_avr_delay_cycles(15);
 12c:	85 e0       	ldi	r24, 0x05	; 5
 12e:	8a 95       	dec	r24
 130:	f1 f7       	brne	.-4      	; 0x12e <loop+0x4>
    PORTB &= ~(1U<<1);
 132:	29 98       	cbi	0x05, 1	; 5
    __builtin_avr_delay_cycles(30);
 134:	8a e0       	ldi	r24, 0x0A	; 10
 136:	8a 95       	dec	r24
 138:	f1 f7       	brne	.-4      	; 0x136 <loop+0xc>

//Then we do the sendBit(false)

    __builtin_avr_cli();
 13a:	f8 94       	cli
    PORTB |= 1U<<1;
 13c:	29 9a       	sbi	0x05, 1	; 5
    __builtin_avr_delay_cycles(5);
 13e:	00 c0       	rjmp	.+0      	; 0x140 <loop+0x16>
 140:	00 c0       	rjmp	.+0      	; 0x142 <loop+0x18>
 142:	00 00       	nop
    PORTB &= ~(1U<<1);
 144:	29 98       	cbi	0x05, 1	; 5
    __builtin_avr_sei();
 146:	78 94       	sei
    __builtin_avr_delay_cycles(30);
 148:	8a e0       	ldi	r24, 0x0A	; 10
 14a:	8a 95       	dec	r24
 14c:	f1 f7       	brne	.-4      	; 0x14a <loop+0x20>

The ASM you present is as expected, taking 7 cycles from HIGH to LOW front.
So indeed no need for ASM inlined.

The one cycle I am missing is not accounted for in here, to have a +3 overhead
unless sbi's rise is in the middle of the instruction, not at the end, or cbi's
drop happens past the instruction's cycles.

       	cli
       	sbi	0x05, 1	; 0 high
       	rjmp	.+0      	; 2 nops
       	rjmp	.+0      	; 4 nops
      	nop                   ; 5 nop
       	cbi	0x05, 1	; 7 low
       	sei

michinyon:
It is possible to see the assembler generated from the C++ code. I have seen it, but I forgot how. I'm sure someone who knows will come along soon.

Now run the following command: ...

Ach, didn't notice the second page of this thread.