Why is there a delay at the end of void Loop?

I hope this is the right to place to ask this ..

I was testing the speed of pin toggling and discovered that relying on void loop{} is actually slower than placing a 'goto back to start' to repeat the loop. Here are the two versions:

#define L_PINDEF 0  // digital latch pin #8 (portb)
#define LATCH_ON() PORTB |= _BV(L_PINDEF)
#define LATCH_OFF() PORTB &= ~_BV(L_PINDEF)

void loop() {
  start:
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
 goto start;
} // with goto

void loop() {
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
} // without

The scope says 880Khz to repeat cycle without goto:

And 1.6MHz to repeat cycle with goto:

Any ideas why this may be? :~

In hardware->arduino->cores->arduino lies the ‘main.cpp’ file:

#define ARDUINO_MAIN
#include <Arduino.h>

int main(void)
{
	init();

	setup();
    
	for (;;)
		loop();
        
	return 0;
}

So after the function ‘void loop()’ has finished, the control returns to the above file. The above file then calls void loop() again, and so on. This takes time.

Onions.

Nothing wrong with adding a goto then to squueze out an extra ± 500 ns? :smiley:

Nothing wrong with adding a goto then to squueze out an extra +- 500 ns?

Yes, there is. There is no reason you can't use a more readable structure:

void loop() {
  while(1)
  {
    LATCH_ON();
    LATCH_OFF();
    LATCH_ON();
    LATCH_OFF();
  }
} // without goto

@KE7GKP: Doing some research into POV so I am looking for every extra Hz I can find.

While is defintely prettier :)

Although finding a label is easier in long code than tracing a {} pair? After several copy,paste procedures my indentation quickly goes to the dogs :(

JBMetal: After several copy,paste procedures my indentation quickly goes to the dogs :(

Auto-format will make short work of that for you ;)

suspect that there is internal "housekeeping" to handle things like updating timers, running counters, pulsing PWM outputs, etc. Arduino must have SOME of the "time slot" to handle those things.

There is no "time slot" - updating "micros" and "millis" is handled in interrupts or on-the-fly, and PWM outputs are handled by hardware.

To answer the original question, a simple "goto" will nearly always be faster than a "return from function" + "call to (same) function", but the latter won't get you despised by half the users who think that people who write "goto" in a C program should be drowned at birth. :P

@KE7GKP: Indeed there is a sensor to determine speed. I'm trying to get the maximum refresh rate as a math exercise which in turn would tell me how many LEDs at what speed can be considered. An eventual POV calculator if you will. In the end project there will probably be no need to speed up the void Loop {}. Perhaps only to get the data read from the EEPROM a tad faster .. ;)

@AWOL: I was surprised to find goto in the reference actually. Is it a more recent addition or perhaps have I been in denial? :)

The goto statement has been part of C from the beginning, added to satisfy those people that barely got beyond BASIC programming, in my opinion. In 25 years of C/C++ coding, I’ve used goto exactly once, and that could have been avoided if I’d been thinking.

I see there is a performance hit for using functions though, them not being compiled inline and all that.. :~ I wonder if goto can give a speed advantage there?

them not being compiled inline and all that..

Have you examined the output from the compiler? What evidence do you have to support your assertion?

I have had some interesting results posted on the POV math thread but there may have been some other factors involved. I'll quickly rewrite the code for both scenarios and see what the scope says ..

Ah! Time to defend the lowly goto...

I was recently working on EtherCard::packetLoop, see tcpip.cpp, line 516. Note the vast number of return statements. Now I needed to add something in just before the method returned. Doh! So there are a few options at this point:

  1. Put the code in before every return.
  2. Refactor the whole thing to be a giant tangled mess of if/elses, even more than it already is.
  3. Refactor even further to separate out into smaller methods.
  4. Use a 'goto' in place of the returns, jumping to the new code just before the single return.

Generally, a goto is a good solution when there are a lot of error cases that need to halt further execution and you don't have exceptions available.

The results are in! :)

With goto statement 68KHz:

Using a function 63KHz:

Code snippet for goto:

void loop() {
  
  while (1)
  {
   
   goto ShiftNow;
   starthere:
   datab = 4;
  } 
  
ShiftNow:  
   LATCH_OFF();
  SPI.transfer (scopedata);
  LATCH_ON();

  LATCH_OFF();
  SPI.transfer (dataa);
  LATCH_ON();
  
  LATCH_OFF();
  SPI.transfer (datab);
  LATCH_ON();
  
  LATCH_OFF();
  SPI.transfer (datac);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (datad);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (datae);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (dataf);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (datag);
  LATCH_ON();
  
  goto starthere;
}

Code snippet for function:

void ShiftNow()
{
    LATCH_OFF();
  SPI.transfer (scopedata);
  LATCH_ON();

  LATCH_OFF();
  SPI.transfer (dataa);
  LATCH_ON();
  
  LATCH_OFF();
  SPI.transfer (datab);
  LATCH_ON();
  
  LATCH_OFF();
  SPI.transfer (datac);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (datad);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (datae);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (dataf);
  LATCH_ON();
    
  LATCH_OFF();
  SPI.transfer (datag);
  LATCH_ON();
}

void loop() {
  
  while (1)
  {
   ShiftNow();
  } 

}

For more info on the testing parameters see http://arduino.cc/forum/index.php/topic,64615.msg471258.html#msg471258

PS: I personally wouldn't use goto unless I really needed to ..

EDIT: spelling

All of the typical infinite loop constructs ("while (1)", "for (;;)", goto, etc) end up producing a single branch instruction. The delay "at the end of loop" in the original posting is the function return and call overhead.

See also: http://arduino.cc/forum/index.php/topic,4324.0.html for lots of discussion on generating the fastest possible square wave...

Thanks westfw, I have bookmarked that thread looks like excellent reading! :slight_smile:

I am however effectively looking for the fastest way to latch 595 registers, any ideas there? Current code implementation in code mentioned in thread above. What you are seeing on the scope pictures is 8 bytes being sent via SPI and the associated latchings.

I got to asking this question because these factors influence my readings and calculations.

westfw: All of the typical infinite loop constructs ("while (1)", "for (;;)", goto, etc) end up producing a single branch instruction. The delay "at the end of loop" in the original posting is the function return and call overhead.

See also: http://arduino.cc/forum/index.php/topic,4324.0.html for lots of discussion on generating the fastest possible square wave...

To corroborate this, I did a simple test, print out main.cpp and blink before compile but after arduino process:

http://arduino.cc/forum/index.php/topic,64615.msg475057.html#msg475057

There's nothing at the end of the loop() or in main so must be overhead. I expect maybe several registers need to be changed (stack and instruction pointers etc.).

void loop() {
  LATCH_ON();
  a8:   28 9a           sbi     0x05, 0 ; 5
  LATCH_OFF();
  aa:   28 98           cbi     0x05, 0 ; 5
  LATCH_ON();
  ac:   28 9a           sbi     0x05, 0 ; 5
  LATCH_OFF();
  ae:   28 98           cbi     0x05, 0 ; 5
} // without
  b0:   08 95           ret

000000b2 <main>:
#include <WProgram.h>

int main(void)
{
        init();
  b2:   0e 94 a8 00     call    0x150   ; 0x150 <init>
        setup();
  b6:   0e 94 53 00     call    0xa6    ; 0xa6 <setup>
         for (;;)
                loop();
  ba:   0e 94 54 00     call    0xa8    ; 0xa8 <loop>
  be:   fd cf           rjmp    .-6             ; 0xba <main+0x8>

The LATCH_ON and LATCH_OFF all end up as single (2-cycle) instructions. The end/resumption of loop is three instructions (return, jmp, call) and both return and call take 4 cycles. So I’d expect the gap between the last bitset in the loop and the first one after the loop resumes to be about 5 times longer than the gap between consecutive bitsets inside the loop, which is just about what the scope trace shows.

I wouldn’t call 10 cpu cycles a “delay”; when you optimize your code down to single instructions, you have to start being aware that EVERYTHING takes at least a little bit of time!

AWOL: To answer the original question, a simple "goto" will nearly always be faster than a "return from function" + "call to (same) function", but the latter won't get you despised by half the users who think that people who write "goto" in a C program should be drowned at birth. :P

If you are trying to generate an exact square wave at an exact frequency, I suggest the 555 chip (or is it the 666 chip? I can never remember).

As for "despise", it's simply a case of using the right tool for the job. The goto statement has its uses, in possibly 0.01% of cases. In the example given:

 start:
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
 goto start;

... there is still going to be a slight discrepancy between the end of the first OFF and the start of the second ON, and the next one. The goto just makes it smaller (the extra instruction, whatever it does). The timer interrupts firing will also delay the code slightly. It will never be a perfect square wave.

Let hardware do it for you.

Agreed. Goto is unduly demonized when in fact, its the author who should be in the receiving end of the ire. There is nothing wrong with goto in general. Having said that, its very frequently abused and misused. Its the classic, poor carpenter blaming his tools.

I completely agree with your comment. But, I do want to offer that the error can be further marginalized by unrolling the loop by hand. This is a little used optimization technique. With the above, the error is 1 out of every 2 pulses. Not so good.

 start:
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
  LATCH_ON();
  LATCH_OFF();
 goto start;

So on and so on…unroll it until your error becomes acceptable - if possible. With the above, the error is now 1 out of 64 pulses. Still not great, but considerably better; being 32x more precise. Its the classic size vs speed trade off.