Go Down

Topic: Inline assembly dec/brne loop question (Read 4684 times) previous topic - next topic

sonyhome

Well while I proofread my post, I've managed to fix a lot of it:

* don't use setBit().
* I had reversed cli/sei!
* Need to read ASM output for the following:

My max working delay is 6 cycles, so there's 3 lost cycles I must account for:
sbi is 2, where's the other cycle?

60ns per cycle at 16.5MHz means it works at 545ns, and fails at 606ns delay.
* At 8Mhz, I still use 4 cycle sto run 480ns, that's OK, if 5 cycles, it's risky.
* At 4MHz, I can run the set/clear, and it's 2 cycles, that'd be OK
* At 1MHz, it's not possible?

Code: [Select]

inline void
sendBit(bool bit)
{
if (bit) {
                        PORTB |= 1U<<1;
__builtin_avr_delay_cycles(15);
                        PORTB &= ~(1U<<1);
__builtin_avr_delay_cycles(30);
} else {
__builtin_avr_cli();
                        PORTB |= 1U<<1;
__builtin_avr_delay_cycles(5);
                        PORTB &= ~(1U<<1);
__builtin_avr_sei();
__builtin_avr_delay_cycles(30);
}
}

Tom Carpenter

#16
Jun 09, 2015, 02:16 pm Last Edit: Jun 09, 2015, 02:25 pm by Tom Carpenter
If you enable verbose mode during compilation, the last but one line will tell you where the .elf file has been saved to (it appears in an objcopy command when the hex is made).

Once you have to location of this file, open command prompt and change directory to:

Code: [Select]
<Arduino IDE Directory>\hardware\tools\avr\bin

The path may be slightly different for 1.5 of the IDE, I can't remember, but basically it will be the same directory from the objcopy command in the verbose output.

Now run the following command:

Code: [Select]
avr-objdump -S "path\to\your.elf" > "where\to\save\asm.lst"

This will produce an assembly listing of the compiled code, including intermixed source code (which is sometimes duplicated, so you need to hunt for the correct place).




What I will say is that there is no need to be shifting a 32bit constant when you are working with 8-bit numbers, "1<<1" will suffice (without the U). Not that that should/will affect the compiled output as it will be optimised as a constant, its just worth noting.




If I compile your function (without the inline to begin with), it results in the following assembly listing:

Code: [Select]

  if (bit) {
 12a: 88 23        and r24, r24
 12c: 31 f0        breq .+12      ; 0x13a <_Z7sendBitb+0x10>
    PORTB |= 1U<<1;
 12e: 29 9a        sbi 0x05, 1 ; 5
    __builtin_avr_delay_cycles(15);
 130: 85 e0        ldi r24, 0x05 ; 5
 132: 8a 95        dec r24
 134: f1 f7        brne .-4      ; 0x132 <_Z7sendBitb+0x8>
    PORTB &= ~(1U<<1);
 136: 29 98        cbi 0x05, 1 ; 5
 138: 07 c0        rjmp .+14      ; 0x148 <_Z7sendBitb+0x1e>
    __builtin_avr_delay_cycles(30);
  }
  else {
    __builtin_avr_cli();
 13a: f8 94        cli
    PORTB |= 1U<<1;
 13c: 29 9a        sbi 0x05, 1 ; 5
    __builtin_avr_delay_cycles(5);
 13e: 00 c0        rjmp .+0      ; 0x140 <_Z7sendBitb+0x16>
 140: 00 c0        rjmp .+0      ; 0x142 <_Z7sendBitb+0x18>
 142: 00 00        nop
    PORTB &= ~(1U<<1);
 144: 29 98        cbi 0x05, 1 ; 5
    __builtin_avr_sei();
 146: 78 94        sei
    __builtin_avr_delay_cycles(30);
 148: 8a e0        ldi r24, 0x0A ; 10
 14a: 8a 95        dec r24
 14c: f1 f7        brne .-4      ; 0x14a <_Z7sendBitb+0x20>
 14e: 08 95        ret


If I restore the 'inline' keyword, and then call sendBit(true) followed by sendBit(false) in the loop(), you get the following (I've stripped it down and added the two comments):

Code: [Select]


//First sendBit(true)

    PORTB |= 1U<<1;
 12a: 29 9a        sbi 0x05, 1 ; 5
    __builtin_avr_delay_cycles(15);
 12c: 85 e0        ldi r24, 0x05 ; 5
 12e: 8a 95        dec r24
 130: f1 f7        brne .-4      ; 0x12e <loop+0x4>
    PORTB &= ~(1U<<1);
 132: 29 98        cbi 0x05, 1 ; 5
    __builtin_avr_delay_cycles(30);
 134: 8a e0        ldi r24, 0x0A ; 10
 136: 8a 95        dec r24
 138: f1 f7        brne .-4      ; 0x136 <loop+0xc>

//Then we do the sendBit(false)

    __builtin_avr_cli();
 13a: f8 94        cli
    PORTB |= 1U<<1;
 13c: 29 9a        sbi 0x05, 1 ; 5
    __builtin_avr_delay_cycles(5);
 13e: 00 c0        rjmp .+0      ; 0x140 <loop+0x16>
 140: 00 c0        rjmp .+0      ; 0x142 <loop+0x18>
 142: 00 00        nop
    PORTB &= ~(1U<<1);
 144: 29 98        cbi 0x05, 1 ; 5
    __builtin_avr_sei();
 146: 78 94        sei
    __builtin_avr_delay_cycles(30);
 148: 8a e0        ldi r24, 0x0A ; 10
 14a: 8a 95        dec r24
 14c: f1 f7        brne .-4      ; 0x14a <loop+0x20>
~Tom~

sonyhome

The ASM you present is as expected, taking 7 cycles from HIGH to LOW front.
So indeed no need for ASM inlined.

The one cycle I am missing is not accounted for in here, to have a +3 overhead
unless sbi's rise is in the middle of the instruction, not at the end, or cbi's
drop happens past the instruction's cycles.

Code: [Select]

        cli
        sbi 0x05, 1 ; 0 high
        rjmp .+0      ; 2 nops
        rjmp .+0      ; 4 nops
      nop                   ; 5 nop
        cbi 0x05, 1 ; 7 low
        sei


nickgammon

#18
Jun 09, 2015, 10:44 pm Last Edit: Jun 09, 2015, 10:45 pm by Nick Gammon
It is possible to see the assembler generated from the C++ code.   I have seen it,  but I forgot how.  I'm sure someone who knows will come along soon.
http://www.gammon.com.au/tips#info1
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

nickgammon

Quote
Now run the following command: ...
Ach, didn't notice the second page of this thread.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

nickgammon

Quote
sei/cli seem to totally mess up delay(). I do want to protect my bit 0 signal by holding off interrupts in the critical section. :/
Of course it does, delay uses interrupts.


The ASM you present is as expected, taking 7 cycles from HIGH to LOW front.
So indeed no need for ASM inlined.

The one cycle I am missing is not accounted for in here, to have a +3 overhead
unless sbi's rise is in the middle of the instruction, not at the end, or cbi's
drop happens past the instruction's cycles.

Code: [Select]

       cli
       sbi 0x05, 1 ; 0 high   (2)
       rjmp .+0       ; 2 nops  (2)
       rjmp .+0       ; 4 nops  (2)
       nop                   ; 5 nop    (1)
       cbi 0x05, 1 ; 7 low    (2)
       sei


With this code I measure 187 nS between high and low (3 cycles at 16 MHz):

Code: [Select]

    PORTD |= bit (2);   
    nop;
    PORTD &= ~bit (2);


So effectively you can assume that the SBI/CBI work in the middle of the instruction (which makes sense).
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

sonyhome

How did you measure? You have an oscilloscope? Uh you must have one I bet :P

What you have demonstrated is sbi/cbi have the same latency to the port output,
otherwise we'd be off by 10s of ns from the 187.5ns mark.

Personally I would assume the instruction effect to be visible at the end of the instruction
cycle because that's when you latch the results.

So I have an unnaccounted 1 cycle mystery, which I can't solve without both oscilloscope
and the LED strip... :)

sonyhome

So the upside is my tiny code is working and stable. I could make a nifty LEd library now.

The question is should I bother? There's already:
* Adafruit
* FastLED

Both however are bulky and require arrays of pixels, while I was thinking of making an ultra light
library...

Goals:
* smallest code and data footprint possible
* minimal time in bitbanging code
** resilient to interrupts during bitbanging for other devices
** smallest number of nops and nop cycles
** support all AVRs 4+MHz
* inline update of LEDs:
** No need to allocate arrays
** Ability to push smaller arrays one after the other on the strip
* support of 1/2/4/16/256 colors palettes to make effects and save SRAM
** For example simple blink patterns can happen with a simple bitmap, saving 8X SRAM space
* poll for LED strip refresh or spin for it
* maybe even fetch data from Flash


I believe FastLED supports palettes, the library is pretty big and comprehensive.
The STL C++ code is a bit offputting to browse through it. I could not find the bitbanging code
since it supports ARM and DMA LED drivers.

nickgammon

How did you measure? You have an oscilloscope? Uh you must have one I bet :P

...

Personally I would assume the instruction effect to be visible at the end of the instruction
cycle because that's when you latch the results.

So I have an unnaccounted 1 cycle mystery, which I can't solve without both oscilloscope
and the LED strip... :)
Not really. Assuming the instructions latch at the end we still expect 3 cycles. 0 for the SBI if it latches at the end), 1 for the NOP, and 2 for the CBI. Total of 3, which is what I measured.

Quote
You have an oscilloscope?
Yes, always handy for this sort of thing.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

westfw

Isn't the WS timing sensitive for the whole string of bits that you send, for however long your string of LEDs is?
Right now, you only have the cli() protecting the duration of the zero bit; shouldn't you turn off interrupts before you send the first bit of the first byte (1 or 0), and not turn them on until the last bit for the whole string has been sent?


sonyhome

westfw,

no its the only piece that's super critical:
zero is a single pulse followed by low.
if it is latched twice it becomes a one.

hence that's a very tiny critical section to turn off interrupts.


nick,

you are right, but specs indicate i should be allowed to add one more cycle than i'm allowed to, hence the mystery cycle...

sonyhome

i'm still actively thinking about this project but i'm evaluating the fastleds library.

here's a small 2 week nights project running fastleds i made with broken leds from a led strip.

 https://youtu.be/yhDU20cZM0c

Go Up