Go Down

Topic: Maximum pin toggle speed (Read 20022 times) previous topic - next topic

@condemmed

i'm a bit surprised you come with 2 cycles for
Code: [Select]
PORTD |= B1000; and 2 cycles for
Code: [Select]
PORTD &= B11110111; because I've benchmarked & runned real time with big N this
Code: [Select]
for(i=0;i<N;i++){
   PORTD |= B1000;
   PORTD &= B11110111;
 }  
in the arduino standard sketch and I come up with one cycle each instruction. Of course, in that case the iteration overhead consumes 6 extra-cycles whatever inside the loop hence a total of 8 cycles.

Please note my benchmarking method cannot trace execution time when using
Code: [Select]
while (true) {
   .....
 }
which was my initial question.

westfw

Quote
That's an inexcusable loss of speed.

It's a lot more complicated than that.  The digitalWrite code does a lot of things that the raw C code we're looking at here does not do:
  • Pin number is a variable rather than a constant
  • New pin value is a variable rather than a constant
  • Convert "arduino pin number" to AVR port address and bit
  • If the port is capable of hardware PWM, make sure PWM is off

I mean, it may SOUND awful that a digitalWrite loop is 20x slower than the "fastest possible" loop, but that "fastest" possible pin change operation is a single instruction, which means that digitalWrite is only using on the order of 20 instructions (including function call overhead), and there is not a lot of room for improvement while maintaining the same functionality (there is room for some, but then you hit the fact that people wanting ultimate speed still won't be happy, and  blinding speed is not particularly one of the goals of the arduino environment anyway.)  (I think digitalWrite got slower in Arduino15, too...)

westfw

@selfonlypath: read the whole thread.  The cbi/sbi instructions are two cycles according to the datasheet.  The extra instruction you removed to make things faster was added to make the output a square wave rather than asymetric.

Your for loop produces:
Code: [Select]

void loop()                     // run over and over again
{
 int i;
#define N 10000
 for(i=0;i<N;i++){
11c:   80 e0           ldi     r24, 0x00       ; 0  (initialization)
11e:   90 e0           ldi     r25, 0x00       ; 0
   PORTD |= B1000;
120:   5b 9a           sbi     0x0b, 3 ; 2 cycles
   PORTD &= B11110111;
122:   5b 98           cbi     0x0b, 3 ; 2 cycles
for(i=0;i<N;i++){
124:   01 96           adiw    r24, 0x01       ; 2 cycles
126:   27 e2           ldi     r18, 0x27       ;  1 cycle
128:   80 31           cpi     r24, 0x10       ; 1 cycle
12a:   92 07           cpc     r25, r18         ;1 cycle
12c:   c9 f7           brne    .-14            ; 0x120 <loop+0x8> (2 cycles if taken)
 }
or 11 cycles for the whole loop by "analysis."  How are you measuring 8 cycles?
Note that your code has the pin set to 1 for 1 or two cycles and set to zero for 9 or 10 cycles; I hope you didn't expect a square wave...

condemned

@selfonlypath:
The following code

   for(uint8_t i=0;i<250; i++){
       PORTD |= 0b00001000;
       PORTD &= 0b11110111;
   }

compiles (WinAVR-20090313, -Os) to the following instructions:

   
sbi      0x0b, 3      ; 11
   
cbi      0x0b, 3      ; 11
   subi      r24, 0xFF      ; 255
   cpi      r24, 0xFA      ; 250
   brne      .-10


sbi and cbi are 2 cycle instructions, according to my Mega88 datasheet.
This should therefore give a loop of 2 ticks ON, 6 ticks OFF, so 2Mhz output at 25% duty cycle.
I'd test this output, but the only test equipment I own is an ohmmeter, an LED and a damp finger (none of which help here).

Your original code (using while(1)) is slightly tighter, and should produce a 2.6666Mhz output at 33.3333% duty cycle (2 ticks ON, 4 ticks OFF).

I think that the best that is possible (without resorting to filling the chip with instructions) is the following code:
   while(1){
       PIND = 0b00001000 ;
   }

which results in a 2.6666Mhz output at 50% duty cycle (3 ticks ON, 3 ticks OFF)
The main caveat with this approach is it will only work on M88s and above.


If you know the state of the other pins on the port, then you could do something like:

   while(1){
       PORTD = 0b00001000;
       PORTD = 0b00000000;
   }

for a 4Mhz 25% duty cycle signal (1 tick ON, 3 ticks OFF)
or

   while(1){
       PORTD = 0b00001000;
       asm("nop");
       asm("nop");
       PORTD = 0b00000000;
   }

for a 2.6666Mhz 50% duty cycle (3 ticks ON, 3 ticks OFF) (e.g. similar to the earlier PIND example, except that in this case the state of the other pins must be known at compile time)

It would probably be useful to see the resulting assembly listing of sketches - does anyone know a simple way of getting a .lss file or similar from the arduino IDE?

Was there a real-world need for this square wave, or was it just a 'how fast can it go'?

westfw

(westfw is finding the idea of using a write to PINx to implement toggle to be rather ... distasteful.)
:-(

#20
Jun 15, 2009, 05:13 am Last Edit: Jun 15, 2009, 06:14 am by selfonlypath Reason: 1
ok, i'm totally new with arduino plus I don't have a debugger, compiler or any other tool for the moment so I just play with my Macintosh, ATmega1280 board & official SDK arduino 15 downloaded from arduino website.

As you may have found in other thread, i'm working on very fast PWM for my power electronics inverters....

The average poor man's method I use to measure real CPU cycle is as follows
Code: [Select]

#define NOP __asm__("nop\n\t")

int N = 0;
// long N=0;
unsigned long time, time1, time2;

void setup()
{
 Serial.begin(9600);
}

void loop{
 if (Serial.available()) {
   val = Serial.read();
   if (val == '+') {
     N += 1000;
   }
   if (val == '-') {
     N -= 1000;
     if (N<0) N = 0;
   }
 }
 
 time1 = micros();
 
 for (int i=0; i < N; i++){
//    __asm__("nop\n\t");
   PORTD |= B1000;
   PORTD &= B11110111;
//   if (UCSR0A & _BV(RXC0)) {
//    }
 }
 
 time2 = micros();

 Serial.print("N: ");
 time=time2-time1;
 Serial.print(time);
 Serial.print(" / ");
 Serial.println(N);
 delay(1000);
}

so with big N value, it converges with average ns measurement for whatever set of instruction I have inside for(i=0;i<;N;i++). For example, 444/1000 print output means 444ns or roughly 7 cycles at 16MHz.

This is how I observed a NOP was really 1 cycle, PORTD |= B1000 (setting bit 3 of PORTD) was 1 cycle,... 6 cycles for INT i and 10 cyles for LONG i overhead or surrounding management of for(i=0;i<;N;i++).

About my project, please note what i'm doing to generate high speed PWM
Code: [Select]
void loop()
{
cli();  // turn off interrupts
while (true) {
// Turns ON coil charging opto-coupler #1
   PORTH |= B10000;
   for(i=0;i<charge_on;i++) NOP;

// Turns OFF coil charging opto-coupler #1
   PORTH &= B11101111;
   for(i=0;i<charge_off;i++) NOP;
 
// Turns ON coil FE extracting opto-coupler #2
   PORTA |= B1;
   for(i=0;i<extract_on;i++) NOP;

// Turns OFF coil FE extracting opto-coupler #2
   PORTA &= B11111110;
   for(i=0;i<extract_off;i++) NOP;

   if (UCSR0A & _BV(RXC0)) { // check uart  (register name changes per port)
     break;  // looks like there is data.  Break out of loop to handle it
   }
 } // end of time critical loop
 sei();  // interrupts back on
 delay(10); // wait for some characters to arrive
 while (Serial.available()) {
// Macintosh serial monitor parameter management to update 4 loops
 }
}

so which is why I need to know while(true) overhead to compensate the computation of charge_on, charge_off, extract_on & extra_off values to get precise duty cycle. Please note I already compensated by including 1 cycle for PORTX writing along with 2 cycles for USB RX from if (UCSR0A & _BV(RXC0)) and 6+1 cycles  for local for(i=0;i<...) NOP

westfw

Quote
I don't have a debugger, compiler or any other tool for the moment

Sure you do; some of them are hiding inside the arduino distribution.  Off in <arduino-install-dir>/hardware/tools/avr/bin/*, you'll find a bunch of the standard gcc tools:

Code: [Select]
BillW-MacOSX-2<1011> ls /Applications/arduino/arduino-0014/hardware/tools/avr/bin/
avarice*      avr-gcc*      avr-gprof*      avr-project*      ice-insight*
avr-addr2line*      avr-gcc-3.4.6*      avr-help*      avr-ranlib*      kill-avarice*
avr-ar*            avr-gcc-4.3.2*      avr-info*      avr-readelf*      libusb-config*
avr-as*            avr-gcc-select*      avr-ld*            avr-size*      make*
avr-c++*      avr-gccbug*      avr-man*      avr-strings*      simulavr*
avr-c++filt*      avr-gcov*      avr-nm*            avr-strip*      simulavr-disp*
avr-cpp*      avr-gdb*      avr-objcopy*      avrdude*      simulavr-vcd*
avr-g++*      avr-gdbtui*      avr-objdump*      ice-gdb*      start-avarice*


I'm not entirely sure which ones work without special hardware support (JTAG ICE/etc) that isn't on Arduino, but I make pretty extensive used of avr-size and avr-objdump (which does disassembly.)  After you download a sketch to your arduino, you'll have an "applet" subdirectory of the sketch directory that will contain standard format  binaries and stuff:
Code: [Select]
BillW-MacOSX-2<1012>  pwd
/Users/billw/Documents/Arduino/-test-/Blink_double/applet
BillW-MacOSX-2<1013> ls
Blink_double.cpp      Print.cpp.o            wiring_analog.c.o
Blink_double.cpp.o      WInterrupts.c.o            wiring_digital.c.o
Blink_double.eep      WMath.cpp.o            wiring_pulse.c.o
Blink_double.elf*      core.a                  wiring_shift.c.o
Blink_double.hex      pins_arduino.c.o
HardwareSerial.cpp.o      wiring.c.o


westfw

I tried your test code with N=1000 on my duemilanove (and other numbers with similar results):

Code: [Select]
 time1 = micros();
 
 for (int i=0; i < N; i++){
   PORTD |= B1000;
   PORTD &= B11110111;
 }
 
 time2 = micros();
As shown, I get "628/1000" or 10.048 cycles per loop.
If I remove the first PORTD line, I get "504/1000" or 8.064 cycles per loop, which sure looks like 2 cycles for the bit set to me...

Oups, you're entirely right ::)

I did benchmark with only
Code: [Select]
 for (int i=0; i < N; i++){
   PORTD = B1000;
   PORTD = B11110111;
 }

Many thanks, you helped found me a bug ;)

So about my code, in particular last PORT call (extract_off), how many cycles the while(true) will use ?

Please note that I TDMA frame my pulses so i'm not using while() to toggle but rather tune parameters with cycle offset corrections to get proper timing.

Worthwhile noting if 2 pins belong same port, we can set or clear within same cycle two outputs.

Again many thanks for all your support on this thread & the other one about scaning USB activity in 2 cycles.

condemned

I've just used:

       avr-objdump -S mysketch.elf > mysketch.lss

in the applet directory of my sketch to produce a nice assembly listing (If you're not on Windows, replace '>' with the pipe mechanism for your OS).
Selfonlypath, I think you'll find it useful to pick through the .lss file. It should save you having to time everything to find your best solution. I have some sympathy with you doing cycle-counting - I've been cycle-counting video code recently to ensure jitter-free video output.

Re: using PIND to toggle bit-states.
It doesn't leave the cleanest of aftertastes. I think using this method has 2 downsides: a) It's the opposite of self-documenting code (self-obfuscating?) and b) It's not available on all Arduino hardware (e.g. Mega8). Hopefully the very smart people who develop avr-gcc will work out a way of optimising PORTD ^= 0b00001000 to the PIND equivalent one day (but I can see that being rather tricky to do).

#25
Jun 16, 2009, 05:15 am Last Edit: Jun 16, 2009, 05:59 am by selfonlypath Reason: 1
hello westfw and condemned,

ok, from my Mac, i can see & go to /Applications/arduino-0015/hardware/tools/avr/bin/avr-objdump but how do I launch avr-objdump - S on my sketch.elf which is stored on another directory ?

About max toggling, does goto instead of while(true) has same assembly code & overhead / duty cycle time at each run ?
Code: [Select]
 while (true) {
   PORTD |= B1000;
   PORTD &= B11110111;
 }

Code: [Select]

m:  PORTD |= B1000;
   PORTD &= B11110111;
   goto m;
 }





retrolefty

Caution ahead: GOTO usage.    ;D

Lefty

westfw

The "while (1)" compiles to a single jmp instruction at the end of loop (2 cycles.)  You can't do any better than that on an AVR.

The FOR loop looks pretty optimal for a counted loop as well.  The AVR doesn't have any of those fancy "decrement and loop if non-zero" instructions, so the choices of looping code is pretty limited.  It looks like you can save ONE cycles of an int-based for loop by counting down to zero instead of up to N, since it has to load a non-zero compare value into a register to do the "double precision" compare.

Have you looked at the video output code some people are doing (http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1240539968 - "teleMate shield")?  While the end application is quite different than yours, the low level coding issues - coming up with a high-accuracy and high-speed bit stream on an AVR pin - are very similar!

#28
Jun 16, 2009, 07:31 am Last Edit: Jun 16, 2009, 07:37 am by selfonlypath Reason: 1
Hey westfw,

i'm still having difficulties running AVR MacPack:
http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1245125011

so can you confirm this math model on how many extra cycles (overhead, surroundings,...) because I can only for the moment inspect by benchmarking or cycle-counting as I explain yesterday  :'(

Q1: While(true) only uses 2 cycles ?

Q2: For(i:=0;i<n;i++) uses 2 cycles to initialize then 6 cycles to either jump loop or leave the loop once completed ?

Q3: About your suggestion to decrement loop, do you confirm initialization will also take 2 cycles (starting from i=0 might not require same cycle as starting from i= non zero value).

Sorry for all this questions but I feel you understand what i'm trying to do: have precise math model of every cycle, every instructions to generate high speed multi PWM requiring reverse-engineer to set up parameters of for(), while(),...

westfw

Quote
Q1: While(true) only uses 2 cycles ?
Yes.

Quote
Q2: For(i:=0;i<n;i++) uses 2 cycles to initialize then 6 cycles to either jump loop or leave the loop once completed ?
Hmm.  I haven't been paying much attention to initialization.  It looks like it's got additional dependencies on whether n is constant or variable, and exact value if its a constant?  I still think you should use delay_loop_1() for your inner timing loops (or just strings of nops, depending on just how fast you need.)  (example of delay_loop_1() posted in the other thread.)  Each loop computation and jump is 7 cycles (for a 16bit loop variable) EXCEPT for the last one that drops out the bottom (6 cycles.)

Quote
Q3: About your suggestion to decrement loop, do you confirm initialization will also take 2 cycles (starting from i=0 might not require same cycle as starting from i= non zero value).
Good point.  I think initialization is same for zero or non-zero, since there is no carry involved.  The end-of-loop test for zero is shorter because there is a "known zero" register to compare against for zero check, but it has to load a register (or at any rate, the sample code DID load a register) to do the equivalent of "compare with borrow" against a non-zero 16bit constant.  (Hmm.  Why didn't it move that register load outside of the actual loop, eh?   Perhaps because arduino compilation gives the compiler the "optimize for size" switch and they're the same size?)

Quote
I feel you understand what i'm trying to do: have precise math model of every cycle, every instructions to generate high speed multi PWM requiring reverse-engineer to set up parameters of for(), while(),...
Yes, but you SHOULDN'T be counting on the compiler to produce the same code from version to version.  That's why there are those inline asm functions like delay_loop_1() designed to look like C code but actually use CONSTANT assembler structure for this sort of timing.  If you really need things accurate down to single cycles, you should bite the bullet and write pure assembler., very carefully.  If you need things down to (say) +/- 4 cycles, I'd feel pretty confident using the delay macros inside of C constructs.  If  you can withstand +/- 10 cycles you can probably get away with pure C as long as you pay attention each time the compiler changes...

Go Up