Maximum pin toggle speed

Hmm. This was asked over on AVRFreaks, and it's FREQUENTLY a Frequently asked question about CPUs/etc, though I don't recall ever seeing it asked here. Since I actually did the experiment, I'll post the answer anyway!

  while (1) {
    digitalWrite(3, 1);
    digitalWrite(3, 0);
  }

produces a 106.8kHz square wave on digital pin 3 in Arduino 0010, 0011, and 0012. Though it would probably be foolish to count on exactly that speed; library functions are subject to change.

  cli();
  while (1) {
    PORTD |= 0x8;
    PORTD &= ~0x8;
  }

on the same board runs at 2.667MHz. (This does produce the minimal sbi/cbi/rjmp loop that you'd expect, BTW.)
(so that's about a 20x penalty for the arduino library code; sounds about right: the overhead of abstracting IO to "pin number" is pretty substantial: a subroutine call, lookup table to get the port, another lookup table to get the bit, a third to check whether analogWrite is in use, and then less efficient instructions to access the port "indirectly")

Speaking of the hefty digitalWrite() overhead, I noticed two things when I went poking into that code.

One, some pins are slower than others, because they have PWM timers that have to be disengaged.

Two, the "function" that turns off those timers has a really ugly chain of if/if/if/if statements that should really be a switch or at least if/else if/else if/else statements. I did a timing analysis like you, westfw, and found that the gcc compiler really does optimize the code the same way since it's all forced inline, and comparing against static const data. But it bugs me to see code that relies on the optimizer to rephrase so drastically to do the right thing-- one accidental tweak or compiler update and boom, digitalWrite() could be twice as slow, because the compiler decides to implement what is written, not what might possibly be implied.

That's interesting.

re: 2.667MHz example. Was that at 8hz? I have been operating under the assumption that a 16mhz atmega did 16,000,000 instructions per second. Which would be 5.333 million loops per second with only 3 instructions.

Also the output isn't exactly "square", you would need another STI (or NOP) to hold the pin high for the same time as low, but it would be a bit slower if half/on half/off was a requirement.

I would have thought it would do 4 million loops per second at 16mhz. The while statement will be optimized to a relative jump, which is a two clock instruction – so the whole loop should be 4 clock cycles long

some pins are slower than others, because they have PWM timers that have to be disengaged.

Yes. The code reads:

  timer = digitalPinToTimer(pin);
    :
  if (timer != NOT_ON_TIMER) turnOffPWM(timer);

I had assumed digitalPinToTimer() would be false if analogWrite wasn't active, but it's actually a ROM-based function that says which timer MIGHT be associated with the pin. Pin3 used in my example IS a PWM output...

The examples are with a 16MHz MDC Bare Bones Board, as shown by the little frequency readout on my Tek TDS210 scope; you're right that that's not what I'd expect based on instruction timings. This may require more investigation!

Ah. SBI and CBI are 2-cycle instructions (I guess that makes sense, since they're read-modify-write), as is the jmp, so 2.66666 MHz is exactly as expected after all!

This implies that I can make a faster loop using OUT and pre-loaded registers...

  while (1) {
    PORTD = ones;
    PORTD = zeros;
  }

Gives me 4Mhz (and noticeably not "square wave." Adding a nop makes it more square but changes the max freq to about 3.2MHz. Adding two nops makes for very square, but back to 2.667MHz.)

Using digitalWrite() on a non-PWM pin (4 instead of 3) runs about 148.4kHz instead of 106.8kHz:

  while (1) {
    digitalWrite(4, 1);
    digitalWrite(4, 0);
  }

Ok, it makes sense now :slight_smile:

Here's a square wave version (resulting assembler confirmed), should be 8 cycles per loop, and that would be a 2mz output on a 16mhz CPU (1 mhz on an 8mhz cpu).

  cli();
  while (1) {
    PORTD |= 0x8;
    PORTD |= 0x8;
    PORTD &= ~0x8;
  }

Are changing fuses allowed :D? it seems you can also program the CKOUT fuse and get the system clock echoed on CLK0 (digital 8), which could be useful I recon. So that would be a 16mhz toggle speed, and it is conceivable that you could control it with an external gate with high precision.

You can usually suck the system clock off of one of the oscillator pins (XTAL2 is an output of an inverter-based circuit.), especially if you're willing to add a gate to square things off. I guess it depends on whether the fuse bits are set for the "low power" oscillator or the "full swing" oscillator. See Sections 7.2 through 7.4 of the Atmega168 data sheet.

You can save one cycle from

cli();
  while (1) {
    PORTD |= 0x8;
    PORTD |= 0x8;
    PORTD &= ~0x8;
  }

and get the higher square wave frequency with

cli();
  while (1) {
    PORTD |= B1000;
    PORTD &= B11110111;
  }

The newer devices (168/328, IIRC) have an instruction to toggle a pin - too bad I can't recall the name of it right now... That would reduce it by one more instruction.

-j

I think the new 'toggle' functionailty for pins works by writing a 1 to it's PIN register.
e.g. to toggle bit 3 of port B, you can do:

PINB = B1000 ;

[yup - M88 datasheet, section 13.1 : "writing a logic one to a bit in the PINx Register, will result in a toggle in the corresponding bit in the Data Register"]

This does NOT work on Mega8's.

I think this means a 3 cycle loop is possible (1 cycle for the OUT instruction, 2 cycles for the RJMP).
That results in a 2.666Mhz square signal.

In theory you can beat this by filling the memory space with that OUT instruction, and letting the program-counter roll-over at the end of flash, but there are problems with this solution, despite a theoretical 8Mhz output...

Could someone tell exactly how many cycles uses the while(true), I mean the overhead or surrounding

cli();
  while (true) {
    PORTD |= B1000;
    PORTD &= B11110111;
  }

For example, each while(true) loop will take 2 cycles to execute both PORTD writing / toggling but how many cycles the while(true) itself will take each run whatever instructions inside the loop ?

That's an inexcusable loss of speed. The compiler should be conveting the digitalWrite code into the same ASM that you get by doing it in raw C. This should be fixed in the next IDE if possible.

@selfonlypath:
Your code (or at least a variant using PORTB on my M8) compiles to:

sbi      0x18, 3
cbi      0x18, 3
rjmp      .-6

Each of these instructions are 2 cycles, hence this is a 6 cycle loop.
Note that it's not uniform though. The output will be on for 2 cycles, and off for 4 cycles.

@condemmed

i'm a bit surprised you come with 2 cycles forPORTD |= B1000; and 2 cycles for PORTD &= B11110111; because I've benchmarked & runned real time with big N this

for(i=0;i<N;i++){
    PORTD |= B1000;
    PORTD &= B11110111;
  }

in the arduino standard sketch and I come up with one cycle each instruction. Of course, in that case the iteration overhead consumes 6 extra-cycles whatever inside the loop hence a total of 8 cycles.

Please note my benchmarking method cannot trace execution time when using

while (true) {
    .....
  }

which was my initial question.

That's an inexcusable loss of speed.

It's a lot more complicated than that. The digitalWrite code does a lot of things that the raw C code we're looking at here does not do:

  • Pin number is a variable rather than a constant
  • New pin value is a variable rather than a constant
  • Convert "arduino pin number" to AVR port address and bit
  • If the port is capable of hardware PWM, make sure PWM is off

I mean, it may SOUND awful that a digitalWrite loop is 20x slower than the "fastest possible" loop, but that "fastest" possible pin change operation is a single instruction, which means that digitalWrite is only using on the order of 20 instructions (including function call overhead), and there is not a lot of room for improvement while maintaining the same functionality (there is room for some, but then you hit the fact that people wanting ultimate speed still won't be happy, and blinding speed is not particularly one of the goals of the arduino environment anyway.) (I think digitalWrite got slower in Arduino15, too...)

@selfonlypath: read the whole thread. The cbi/sbi instructions are two cycles according to the datasheet. The extra instruction you removed to make things faster was added to make the output a square wave rather than asymetric.

Your for loop produces:

void loop()                     // run over and over again
{
  int i;
#define N 10000
  for(i=0;i<N;i++){
 11c:   80 e0           ldi     r24, 0x00       ; 0  (initialization)
 11e:   90 e0           ldi     r25, 0x00       ; 0
    PORTD |= B1000;
 120:   5b 9a           sbi     0x0b, 3 ; 2 cycles
    PORTD &= B11110111;
 122:   5b 98           cbi     0x0b, 3 ; 2 cycles
 for(i=0;i<N;i++){
 124:   01 96           adiw    r24, 0x01       ; 2 cycles
 126:   27 e2           ldi     r18, 0x27       ;  1 cycle
 128:   80 31           cpi     r24, 0x10       ; 1 cycle
 12a:   92 07           cpc     r25, r18         ;1 cycle
 12c:   c9 f7           brne    .-14            ; 0x120 <loop+0x8> (2 cycles if taken)
  }

or 11 cycles for the whole loop by "analysis." How are you measuring 8 cycles?
Note that your code has the pin set to 1 for 1 or two cycles and set to zero for 9 or 10 cycles; I hope you didn't expect a square wave...

@selfonlypath:
The following code

    for(uint8_t i=0;i<250; i++){
        PORTD |= 0b00001000;
        PORTD &= 0b11110111;
    }

compiles (WinAVR-20090313, -Os) to the following instructions:


** **sbi      0x0b, 3** **

      ; 11

** **cbi      0x0b, 3** **

      ; 11
    subi      r24, 0xFF      ; 255
    cpi      r24, 0xFA      ; 250
    brne      .-10

sbi and cbi are 2 cycle instructions, according to my Mega88 datasheet.
This should therefore give a loop of 2 ticks ON, 6 ticks OFF, so 2Mhz output at 25% duty cycle.
I'd test this output, but the only test equipment I own is an ohmmeter, an LED and a damp finger (none of which help here).

Your original code (using while(1)) is slightly tighter, and should produce a 2.6666Mhz output at 33.3333% duty cycle (2 ticks ON, 4 ticks OFF).

I think that the best that is possible (without resorting to filling the chip with instructions) is the following code:

    while(1){
        PIND = 0b00001000 ;
    }

which results in a 2.6666Mhz output at 50% duty cycle (3 ticks ON, 3 ticks OFF)
The main caveat with this approach is it will only work on M88s and above.

If you know the state of the other pins on the port, then you could do something like:

    while(1){
        PORTD = 0b00001000;
        PORTD = 0b00000000;
    }

for a 4Mhz 25% duty cycle signal (1 tick ON, 3 ticks OFF)
or

    while(1){
        PORTD = 0b00001000;
        asm("nop");
        asm("nop");
        PORTD = 0b00000000;
    }

for a 2.6666Mhz 50% duty cycle (3 ticks ON, 3 ticks OFF) (e.g. similar to the earlier PIND example, except that in this case the state of the other pins must be known at compile time)

It would probably be useful to see the resulting assembly listing of sketches - does anyone know a simple way of getting a .lss file or similar from the arduino IDE?

Was there a real-world need for this square wave, or was it just a 'how fast can it go'?

(westfw is finding the idea of using a write to PINx to implement toggle to be rather ... distasteful.)
:frowning: