Reducing Overhead with Variable Length Timers?

Wrote an electronic fuel injection simulation where events are queued up to control the on-off states of injectors and coils. The timer interrupt intervals vary and because of changes in rpm, the next event time is not calculated until it is at the head of the queue. The main loop is busy collecting sensor inputs and calculating a new schedule.

My goal has been 1 us precision at 22k rpm's since this is the maximum rpm's I could find for an engine. With 8 cylinders, that would be 32 events within a 2.73 ms period. At 1000 rpm it is 60 ms to give an idea of the range. The error rates do appear acceptable up to the typical requirement of 7000 rpm, but the error increases significantly above that. Event times vary with inputs and can sometimes stack up on their own.

I subtract a slop time from the interval to account for cpu overhead, the isr execution time, and recent isr contention. The slop is adjusted based on late frequency. Timers.cpp contains implementations for the Due and the Mega here:

Defining TIMER_TEST will cause the timers to wakeup, compute a fixed interval and sleep for the time in handleTimer. The minimum slop I get for the Due is 6 us, the Mega runs between 30-40 if I recall correctly. It is pretty easy to setup the timer test mode. Set serial to 115200 and type 'i' to see what the timer interrupts are up to.

Here are some stats from the program in test mode:

timer:{id=     4	idx=    0	slop=   7	sleep=  897	asleep= 895	-awake= 2	+awake= 4	-late=  -3	+late=  7	retest= 207	late=   0.38	hist:{0=2110	5=2110	6=2111	7=2054	8=8}}
timer:{id=     5	idx=    1	slop=   14	sleep=  897	asleep= 887	-awake= 2	+awake= 4	-late=  -10	+late=  -1	retest= 177	hist:{0=2118	5=2118	6=2118	7=2074}}

This is timers 4 & 5, the sleep interval was 900 us, and they are pretty much always early. I believe what is going on is that #4 gets a higher priority, and is not late very often, in fact it was late only 7 times out of 2110 events, probably because #5 got ahead. #4 gets in front of #5 enough that it has adjusted the slop factor to compensate and was never late in the last 2118 events. Early is a marginal improvement over late, and in order to never be late, the timers are more than 1 us early 2054 and 2074 times respectively. The awake time is 2-4 us, and some of that is definitely overhead from the stats I am tracking.

When enabled, the two efi related isr's take 10 and 20 us on the Due, but events don't overlap as much at lower rpm's so the slop times often start at 6. The longer execution times with interrupt queuing is my principal source of worse case jitter so I have tried to reduce the isr times and call overhead as much as possible.

The oem computer is running at 16 mhz, and seems like I should be able to match it with a Mega and the Due should be a no-brainer. Unfortunately I do not know what level of accuracy it had so I may already be ahead.

Can I get a less jitter out of the timers other than removing my metrics?
Are there performance advantages at all to using a prescaler when possible?

The DueTimer library always tries to pick the largest prescaler .. with a downside. The slop required to keep DueTimer on time went to 120 us, likely because of all the floating point recalculations each time you start a timer .. that could be substantially optimized.

If you're seriously running a V8 at 22,000RPM then I hope your budget for individual bolts is more than the cost of an Arduino Due. A V8 at 7000RPM is much more practical and realistic.

It's hard to compare different processors. The OEM computer may seem slow at 16MHz but it has a lot of dedicated hardware timers that are specifically constructed to synchronise with the engine RPM. The amount of general-purpose programming code running in a revolution is very little. My laptop is over 3GHz clock speed but it can't run an engine because it's got other things to do, like update Windows.

To understand more about the jitter, you are going to have to dig into the machine instructions that the compiler generates from your code. Little things like comparing a value to zero or less than zero will make significant differences in the instructions that the microprocessor is given. Have a look at the comments in pulseIn() (on the AVR Arduinos) to see how the original author worked out how many instructions his code was taking.

Floating-point is generally slow. Usually your goal is to replace floating-point with fixed-point, using integers representing (say) 1000 times the actual quantities. However, I have found that floating-point division is faster than integer division on the Mega, so you could actually make some things slower with fixed-point arithmetic. You should develop a profile of how long every single operation takes - is it faster to multiply by 1000 and then bit-shift to get a division?

22k is a contrived number, but there are some v8's that rev into that range, and anything above 7k is more about pushing the boundaries and to find issues that increase on-time rates across the board. I posted times for a crank rev and forgot to state the events are spread across 2 revs, but the issues are the same.

The OEM processor is no doubt optimized in assembly for the task at hand, but the best I have been able to get on a mega has been about a 98% on-time rate that tanked at not much above 1k rpm's. It should improve some with the changes I made below.

All floating point calculations are in the main loop so there isn't any in the isr's, or even any division operations that the compiler shouldn't optimize with a bit shift. Even the final time calculations are done with a multiplication and a bit-shift.

I looked at pulsein, nothing new, but good stuff for the person who wrote DueTimer!

I posted the code separately for the CycleCount class instead of it getting buried here..

My initial wake period ticks ranged 540-620. I found the default optimization flag is -Os, and for the sake of size, it wasn't inlining functions in some or all cases. By switching to -Ofast I was able to drop the tick ranges for the awake time down to 387-450! I would also recommend adding the -Wdouble-promotion compiler flag and removing the -v from bossac to disable verification for the 20% larger image.

I then switched from my us time reference to use cycle counts, which works because things I measure are less than half the time it takes it to roll over. The inline ticks call gives 84 ticks/us and only take 5% of the time as the micros() call.

The added resolution and performance gains tightened things up quite a bit. Next I duplicated TC_Start/TC_Stop so they were inline in my start/stop methods since they were both essentially 1 line functions. The NVIC calls were already written inline, but were only needed to be executed once during init.

My two actual isrs went from 20 and 30 us down to under 8 and just over 14 us. My late error rate at 22k dropped from 24% down to 6%, and down to 0.01% for 3 us or over. And the cycles to do a sleep/wakeup are now down to 146-174, about 28% of the starting numbers!

Still want it to be faster..