Comments on counting faster

Finally I finished my "counting really fast" series on super aggressive performance optimization. And of course for a really pointless usecase. The whole thing should be implemented in hardware but I was just trying for the sake of trying. After the final post on this http://blog.blinkenlight.net/2012/07/01/counting-revisited/ the whole thing got covered by hackaday. Now here comes the thing. Most of the comments for the implementation http://blog.blinkenlight.net/experiments/counting/faster-counter/ objected against my use of goto. I do not mind these comments to much as the commenters seemingly did not read the article at all.

Although I like hackaday coverage a lot - it drives a lot of traffic to my blog - I always wonder about their readers. My overall impression is that the comments from the Arduino forum are much more insightful.

So here is my question: what do you think about these examples? Do they get the point through that excessive performance optimization is not worth the hassle? Do they deliver any insights? How could I do better?

My impression is that the hackaday audience covers a broader spectrum of the human race; from the truly pathetic lazy angry troll to the highly enlightened hacker. Unfortunately, the noise and stench from the former seems to drown out the wisdom and insight from the latter. (This forum leans much more towards the enlightened hacker side. Trolls don’t seem to last long. Probably because of Nick’s new toy.)

So here is my question: what do you think about these examples?

I love em! The reader has to “think outside the box” just to understand them!

Cool toy. Looks like 3D printed. Maybe there is a market for more massive cnc milled versions ;)

[quote author=Coding Badly link=topic=114210.msg859331#msg859331 date=1342254659] Probably because of Nick's new toy.) [/quote]

Lol. The tone seemed a bit lighter after I showed that. ;)

Do you have a link to the hackaday comments?

My opposition to GOTO is well know, but it is not a mindless opposition. If you could demonstrate a particular circumstance where it would make a project work, where, without it, it would not work, well go for it.

Back to the real world, though. In general, the trouble caused by GOTO exceeds the benefits. Especially in the hands of beginners.

I am reminded here of another recent thread (the one in which the Ban Hammer appeared) where the OP was concerned to get the maximum performance from the ADC hardware. At no stage that I can recall was the need for this demonstrated. It was all a bit theoretical.

... excessive performance optimization is not worth the hassle ...

I haven't read the example code, I admit. But my general answer would be: yes, excessive optimization is generally not worth it. Future versions of the compiler may negate any benefits, and in virtually every case, readability, maintainability and portability suffer.

My test case of generating VGA signals, which I think is the most timing-dependent code I have ever done, did not need to descend to assembler code. Not because I couldn't, or wouldn't. I just didn't think I could do better than the compiler had already done.

Careful analysis of the generated code led to insights for how to make the C++ code more efficient. Such as reworking the order of rows/columns in the font data. But this would apply whether or not you end up with assembler (or GOTO).

Here is the link http://hackaday.com/2012/07/11/thinking-outside-the-ide-to-make-a-fast-counting-arduino/#comments.

In general, the trouble caused by GOTO exceeds the benefits.

Fully agreed. But as you also mentioned, under specific circumstances a GOTO can be more appropriate. I claim that my example is such a case. There is an original article by Dijkstra "Go To statement considered harmful". I fully agree. Still I claim that my case is an example even Dijkstra would consider harmless.

Btw: in nowadays code I consider the "Come From" statement a similar plague or even worse as the "Go To" statement was. It leads to all kinds of missing or poor error handling. The "Come From" statement usually gets a syntax like

try { ... } catch (...) {...}

The issue is that a lot of developers catch to eager and then throw away the exception or throw a different one. Both of which make subsequent error analysis harder. Also a lot of these constucts are missing proper cleanup / finally blocks.

Code with the "Go To" plague is something I have not seen anymore in the last 10 years. Probably because it got hammered so much into the heads of everyone that nowadays "Go To" is considered way more evil than it ever was.

I am reminded here of another recent thread (the one in which the Ban Hammer appeared) where the OP was concerned to get the maximum performance from the ADC hardware.

Unlike in the Hammer thread I do not have to ask. My goal was to demonstrate what happens and how far I could push it just for the sake of pushing it. It is definitely not intended as a guideline on how one should implement at all. As I already said: a counter is best "implemented" by buying it as a cheap hardware part.

Very occasionally you do need to do this. However, I find as a compiler developer, that often times pre-mature optimization is misplaced, and what people think are the hot-spots of the code are not. Often times, I find code that has been ‘optimized’ slows down when you change processors or even go to a faster processor. In fact, when I was talking to some fellow developers last week, I wished I had a ‘de-optimizer’ at times would sense when the user tried to help out and manually unrolled the loop, and the compiler would convert it back into the simpler loop, and then unroll the loop with the proper scheduling to avoid memory latencies.

In larger systems, you generally should profile the application to find where the true hot-spots are and work on them, rather than assuming where the hot spot is. However, I don’t think the Arduino has such an infrastructure on the real hardware. Perhaps using a simulator on a host system would help you find where the hot-spots are.

Particularly with caches, careful data placement can speed things up.

Back to my example. The most agressive optimization had <1% overhead for counting PLUS output. >99% of the cycles give an increase in the visible output. Just think about it for a minute how much overhead you would expect from counting in a loop (even without output). Since I manipulate a 20bit counter the most natural type would be uint32_t - a 4 byte integer. I would expect that each increment would thus take at least 4 cycles. Checking the loop condition would result in another 4 cycles if a “==” is used, 1 cycle if tricks are applied. Unless the compiler unrolls the loop at least 2 but probably 3 cycles would go into a jump. So 8 cycles instead of 1 and not even manipulated the output. I am pretty much sure that any attempt to implement a counter without my specific optimizations would result in code that is at least 10 times slower.

Having said that I completely agree that in 99% of the cases optimizations should come from the compiler.
However I do not agree that in really large systems profiling for hot spots is sufficient. Sometimes the issue is deeper, like flawed architecture or messed up type system. In such cases things get much more challenging than just to remove the hot spots.

Note, the ATmega is an 8-bit micro-processor. That means only 8-bit adds are done in one instruction. So adds are done as 4 separate adds, ditto with compares. Also, in the ATmega environment, you really don’t want the compiler unrolling the loop, since the environment is so memory constrained.

[quote author=Udo Klein link=topic=114210.msg860230#msg860230 date=1342347019]However I do not agree that in really large systems profiling for hot spots is sufficient. Sometimes the issue is deeper, like flawed architecture or messed up type system. In such cases things get much more challenging than just to remove the hot spots.[/quote]

I agree. Selecting an appropriate algorithm is by far the most effective optimization.

I also agree that the best way is to optimize the algorithm at a high level. Though in performance critical times, once your algorithm is polished, you do need to look at the details to uncover hot-spots, etc.

MichaelMeissner: Also, in the ATmega environment, you really don't want the compiler unrolling the loop, since the environment is so memory constrained.

Correct me if I'm wrong, but doesn't unrolling loops only take up extra program memory? With the Harvard architecture, the amount of progmem used doesn't affect anything as long as it fits.

Also, can't the goto be replaced with a while(1) loop? It'll turn out to be the same.

WizenedEE:

MichaelMeissner: Also, in the ATmega environment, you really don't want the compiler unrolling the loop, since the environment is so memory constrained.

Correct me if I'm wrong, but doesn't unrolling loops only take up extra program memory? With the Harvard architecture, the amount of progmem used doesn't affect anything as long as it fits.

Yes, the issue is whether program memory will fit. An Uno only has 31.5K of memory. If you use floating point, the emulation routines take a bit of space (the 32-bit long support takes some amount of space).

WizenedEE: Also, can't the goto be replaced with a while(1) loop? It'll turn out to be the same.

Well in the early parts of the compiler, things are represented as high level constructs, but as the optimization phases are done, the compiler represents loops in terms of jumps.

Typicall a loop and a GOTO are NOT the same. The reason is that usually loop constructs are followed by curly braces. Hence most loops will give rise to a new scope. GOTOs will jump about anywhere, especially without regard for the braces. That is GOTO is usually not used with new scope.

BTW: this is exactly the reason why I did use GOTO instead of a loop. I know that in theory the optimizer should take care of that. However my experience showed that if the block inside is large enough the optimizer will not be perfect. Hence I choose GOTO.

Still I agree that 99% of the time GOTO can and should be avoided. But remember, my example was not about structured programming but about aggressive optimizations. Try to implement it in a straight forward manner, let the compiler optimize and measure. You will be surprised how much slower the code will end up.

Unfortuately the blame was put on GOTO instead of putting the blame on poor programming. Also consider that GOTO is just the same as a JMP (jump) instruction in most any assembly language. And some of the other constructs that have been developed to get rid of the GOTO really just tend to obfuscate the code even more.

If you find you are using GOTOs it might be time to evaluate whether you are just coding sloppily and whether other constructs would be cleaner and have better logic. But the reason GOTO is still there is that on occasion it is the best or only way to do it.

When optimizing code you often have to operate by a different set of rules. You constraint is rapid execution. Therefore you have to work with a set of rules that favor that objective. It might require many more comments (they don't affect execution...) and may be hard to understand. And you probably won't need to do it for very large hunks of code.