Pages: 1 [2] 3   Go Down
Author Topic: Optimization of code for maximum speed, very specific project.  (Read 2480 times)
0 Members and 1 Guest are viewing this topic.
Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Thank you, I hadn't run through most of this code yet.  That probably would have had me scratching my head for a bit.  smiley-red

Attached is the new version.

* XmasLightControllerForVixenSSRs.ino (59.81 KB - downloaded 8 times.)
Logged

SF Bay Area (USA)
Offline Offline
Tesla Member
***
Karma: 106
Posts: 6367
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

How do you know it's too slow?  And how much too slow IS it?  Can you do the sort of dimming you're looking for, with your existing code, on 16 pins?  2 pins?

Hmm.  It looks like the biggest problem you have at the moment is trying to run the timer interrupt too often.  Trying to interrupt in fewer cycles than your service routine takes to execute is a recipe for disaster; you'll essentially end up executing one main program instruction, doing the entire ISR, one more main program instruction, etc...

Here are some details and other comments.

There's an intermediate-speed digitalWrite that might be useful.  Have arrays indexed by your channel_pin (whatever is sent over the serial port) for Port and bitmask, just like the real digitalWrite does.  Except populate them at initialization time instead of every time you want to write to the pin:
Code:
void populate_fdw_tab (char myPinNum, arduinoPinNum) {
   fdw_port[myPinNum] = digitalPinToPort(arduinoPinNum);
   fdw_ONbitmask = digitalPinToBitMask(arduinoPinNum);
   fdw_OFFbitmask = ~digitalPinToBitMask(arduinoPinNum);
   pinMode(arduinoPinNum, OUTPUT);
   digitalWrite(arduinoPinNum, LOW);
}
   :
void setup() {
  populate_fdw_tab( 0, 22);  // lights are on pins 22 -52
  populate_fdw_tab( 1, 31);  // and can be mapped "however"
    :
  populate_fdw_tab(31, 53);
}
Then, turning a pin off is just:
   *fdw_port[pin] &= fdw_OFFbitmask[pin];
and turning it on is just:
   *fdw_port[pin] |= fdw_ONbitmask[pin];

(Hmm.  You can probably scratch all that, since you're always dealing with all the pins, and the current fastdigitalWrite is doing single instructions.  It would give you faster random/variable access to your pins, but I guess that's unnecessary.)

Similarly, since serial messages are only occasional and not more than every 50ms, while AC cycles happen every 1/120 s (8.3ms),
get rid your calculations of powerDelay[] in ZeroCrossDetected and move them into the serial receive code.

How many brightness levels are you trying to get?  Make sure that you're not running your timer interrupt too often, or it will suck up extra cycles in overhead that might be needed elsewhere.  (It looks like this is really bad at the moment!  This is probably the biggest change that needs to happen!)   tickCounter should probably be MUCH smaller than 32bits; ideally you should divide the 8.3mS half-cycle time into exactly N ticks by adjusting the timer prescaling and count, which will give you up to N brightness levels and simplify a LOT of the  math that is happening, assuming that N < 256.

This code is interesting:
Code:
 if(powerDelay[CHANNEL_PIN_1] <= tickCounter) customDigitalWrite(CHANNEL_PIN_1, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_2] <= tickCounter) customDigitalWrite(CHANNEL_PIN_2, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_3] <= tickCounter) customDigitalWrite(CHANNEL_PIN_3, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_4] <= tickCounter) customDigitalWrite(CHANNEL_PIN_4, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_5] <= tickCounter) customDigitalWrite(CHANNEL_PIN_5, HIGH); // turn the channel on here
Because all the array calculations are done at compile time, so it's load memory, load memory, compare, portwrite.
However, the load memory operations are currently 32bits, and since tickCounter is volatile, it is loaded each time (at significant expense.)  Replace with something like:
Code:
 register uint8_t tick = tickCounter;  // get a local copy!
  if(powerDelay[CHANNEL_PIN_1] <= tick) customDigitalWrite(CHANNEL_PIN_1, HIGH);
  if(powerDelay[CHANNEL_PIN_2] <= tick) customDigitalWrite(CHANNEL_PIN_2, HIGH);
  if(powerDelay[CHANNEL_PIN_3] <= tick) customDigitalWrite(CHANNEL_PIN_3,
HIGH);
  
Logged

Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

You guys have really made my night.  I can't tell you how much this helps.  I've got to get some sleep but I had to say thanks again.  smiley
Logged

Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

How do you know it's too slow?  And how much too slow IS it?  Can you do the sort of dimming you're looking for, with your existing code, on 16 pins?  2 pins?

Well, I don't yet, to tell the truth. smiley  I just know that the faster I can get the better looking the dimming is going to be.

Hmm.  It looks like the biggest problem you have at the moment is trying to run the timer interrupt too often.  Trying to interrupt in fewer cycles than your service routine takes to execute is a recipe for disaster; you'll essentially end up executing one main program instruction, doing the entire ISR, one more main program instruction, etc...

Yep, I'm going to have to play with this but I have changed it to count to 500 for right now.

Similarly, since serial messages are only occasional and not more than every 50ms, while AC cycles happen every 1/120 s (8.3ms),
get rid your calculations of powerDelay[] in ZeroCrossDetected and move them into the serial receive code.

Those calcs must have be in an earlier version of the code, right now the ZeroCrossDetected function just turns all of the channels off and then assigns the correct powerDelay values from the pre-computed list. So should be good to go here.

How many brightness levels are you trying to get?  Make sure that you're not running your timer interrupt too often, or it will suck up extra cycles in overhead that might be needed elsewhere.  (It looks like this is really bad at the moment!  This is probably the biggest change that needs to happen!)   tickCounter should probably be MUCH smaller than 32bits; ideally you should divide the 8.3mS half-cycle time into exactly N ticks by adjusting the timer prescaling and count, which will give you up to N brightness levels and simplify a LOT of the  math that is happening, assuming that N < 256.

I'm trying to get 256 levels of brightness, but frankly I'm not sure that level is needed.  64 would probably be enough but I want to shoot for 256. I've changed tickCounter to an unsigned 16bit int. I don't think I'm going to be able to run the IRS fast enough to get down to a byte, but you never know. smiley

This code is interesting:
Code:
 if(powerDelay[CHANNEL_PIN_1] <= tickCounter) customDigitalWrite(CHANNEL_PIN_1, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_2] <= tickCounter) customDigitalWrite(CHANNEL_PIN_2, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_3] <= tickCounter) customDigitalWrite(CHANNEL_PIN_3, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_4] <= tickCounter) customDigitalWrite(CHANNEL_PIN_4, HIGH); // turn the channel on here
  if(powerDelay[CHANNEL_PIN_5] <= tickCounter) customDigitalWrite(CHANNEL_PIN_5, HIGH); // turn the channel on here
Because all the array calculations are done at compile time, so it's load memory, load memory, compare, portwrite.
However, the load memory operations are currently 32bits, and since tickCounter is volatile, it is loaded each time (at significant expense.)  Replace with something like:
Code:
 register uint8_t tick = tickCounter;  // get a local copy!
  if(powerDelay[CHANNEL_PIN_1] <= tick) customDigitalWrite(CHANNEL_PIN_1, HIGH);
  if(powerDelay[CHANNEL_PIN_2] <= tick) customDigitalWrite(CHANNEL_PIN_2, HIGH);
  if(powerDelay[CHANNEL_PIN_3] <= tick) customDigitalWrite(CHANNEL_PIN_3,
HIGH);

This was VERY helpful, it has been a LONG time since I've had to think about this kind of stuff. I'm now grabbing a local copy.

Part of the reason I'm so focused on speed here is that I have a number of folks that may be building similar projects by following mine.  While I'm only going to be using 32 channels with 256 dimming levels and a 50ms frame rate from Vixen, these other folks may be using other settings.  The faster I can make this now the more likely the other people will be to not run into major issues, I hope.

Attached is the latest code.

* XmasLightControllerForVixenSSRs.ino (59.29 KB - downloaded 7 times.)
Logged

Ayer, Massachusetts, USA
Offline Offline
Edison Member
*
Karma: 50
Posts: 1764
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

If speed is very critical, I imagine moving the part that does the interrupt handling and turning on specific lights at specific times off to an outboard processor that does not do interrupts, instead it is dedicated, and just reads the status constantly in the loop.  It should do nothing but just a busy poll, and turn on/off lights.  It would need some way of getting control updates from the brain processor (another pin perhaps).  Otherwise you are going to be spending a lot of time having the interrupt set up the call stack and then undo it.  And if that is not fast enough, consider getting a faster processor like an ARM embedded board.

Perhaps you could use a standard 555 type circuit to do the timing, perhaps not.
Logged

Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

If speed is very critical, I imagine moving the part that does the interrupt handling and turning on specific lights at specific times off to an outboard processor that does not do interrupts, instead it is dedicated, and just reads the status constantly in the loop.  It should do nothing but just a busy poll, and turn on/off lights.  It would need some way of getting control updates from the brain processor (another pin perhaps).  Otherwise you are going to be spending a lot of time having the interrupt set up the call stack and then undo it.  And if that is not fast enough, consider getting a faster processor like an ARM embedded board.

Perhaps you could use a standard 555 type circuit to do the timing, perhaps not.

I guessing that is beyond my current skill level, lol.  Maybe that will be Version 3 of the controller.
Logged

Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Given that the powerDelay array is only assigned values inside one of the ISRs and only read from the other ISR and the ISRs are not nested:  Can I remove the volatile attribute from the declaration?  That should give me a few extra cycles in speed shouldn't it?
« Last Edit: April 04, 2013, 12:00:05 pm by zparticle » Logged

Global Moderator
Dallas
Offline Offline
Shannon Member
*****
Karma: 176
Posts: 12285
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Given that the powerDelay array is only assigned values inside one of the ISRs and only read from the other ISR and the ISRs are not nested:  Can I remove the volatile attribute from the declaration?

You can.  You shouldn't.  It won't make any difference...

Quote
That should give me a few extra cycles in speed shouldn't it?

No.  In each ISR each element of powerDelay is accessed just once (read in one ISR; written in the other).  Removing volatile only helps when there are multiple accesses.  Removing volatile eliminates redundant loads from memory but there aren't any.
Logged

Global Moderator
Dallas
Offline Offline
Shannon Member
*****
Karma: 176
Posts: 12285
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
I'm trying to get 256 levels of brightness, but frankly I'm not sure that level is needed.

Very unlikely.  In my opinion, what works best are fewer levels with levels that reflect how a human perceives brightness.  I typically use 100 levels with each level such that a person would say there is about a 1% difference in brightness between each level.
Logged

Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
I'm trying to get 256 levels of brightness, but frankly I'm not sure that level is needed.

Very unlikely.  In my opinion, what works best are fewer levels with levels that reflect how a human perceives brightness.  I typically use 100 levels with each level such that a person would say there is about a 1% difference in brightness between each level.


That makes a lot of sense.  I guess I'll see how the 256 works out and then maybe back off and give 100 a try.  I'm betting you are right.
Logged

Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Okay sorry to keep pestering everyone but this has me baffled.  Attached is the version of the code that generated the following output.  As you will see I have set the timer/counter 1 TOP value to 100 so I should get the interrupt every 100 cycles.  I've commented out everything in both ISRs except:

1> the reset of tickCounter=0 on the detection of AC zero cross
2> the addition of 100 to tickCounter every time the timer/coutner interrupt is triggered

I'm not reading anything from Vixen in this test.  This is just to see if I get a stable zero cross detection and a stable tick count.

What I am seeing is:

Code:
Power on self test running.
Channel: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
POST Done.
Setting up Vixen mode.
millis=8256 tickCounter=13200
millis=8257 tickCounter=35600
millis=8258 tickCounter=6364
millis=8261 tickCounter=33000
millis=8263 tickCounter=9164
millis=8266 tickCounter=49564
millis=8269 tickCounter=25000
millis=8271 tickCounter=1264
millis=8274 tickCounter=41664
millis=8276 tickCounter=16700
millis=8280 tickCounter=58500
millis=8282 tickCounter=34664
millis=8285 tickCounter=10300
millis=8287 tickCounter=52000
millis=8290 tickCounter=28064
millis=8293 tickCounter=3300
millis=8295 tickCounter=43700
millis=8298 tickCounter=19864
millis=8300 tickCounter=500
millis=8303 tickCounter=34500
millis=8305 tickCounter=10764
millis=8308 tickCounter=52464
millis=8310 tickCounter=27500
millis=8313 tickCounter=3664
millis=8315 tickCounter=44064
millis=8318 tickCounter=19600
millis=8322 tickCounter=61400
millis=8324 tickCounter=37564

Are these number more stable than they look and it is just because the serial printing is so slow compared to the interrupts?

* XmasLightControllerForVixenSSRs.ino (55.41 KB - downloaded 7 times.)
Logged

SF Bay Area (USA)
Offline Offline
Tesla Member
***
Karma: 106
Posts: 6367
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
right now the ZeroCrossDetected function just turns all of the channels off and then assigns the correct powerDelay values from the pre-computed list. So should be good to go here.
The assignments are what I meant should be removed.  They're relatively expensive:
Code:
 else powerDelay[CHANNEL_PIN_3] = powerDelayPreCalc[channelValue[CHANNEL_PIN_3]]; // turn it back on at N ticks past zero cross
     60c:       e0 91 37 07     lds     r30, 0x0737
     610:       f0 e0           ldi     r31, 0x00       ; 0
     612:       ee 0f           add     r30, r30
     614:       ff 1f           adc     r31, r31
     616:       ee 0f           add     r30, r30
     618:       ff 1f           adc     r31, r31
     61a:       ed 5f           subi    r30, 0xFD       ; 253
     61c:       fc 4f           sbci    r31, 0xFC       ; 252
     61e:       80 81           ld      r24, Z
     620:       91 81           ldd     r25, Z+1        ; 0x01
     622:       a2 81           ldd     r26, Z+2        ; 0x02
     624:       b3 81           ldd     r27, Z+3        ; 0x03
     626:       80 93 b4 07     sts     0x07B4, r24
     62a:       90 93 b5 07     sts     0x07B5, r25
     62e:       a0 93 b6 07     sts     0x07B6, r26
     632:       b0 93 b7 07     sts     0x07B7, r27
This would get better if powerDelay isn't volatile, and if it's smaller, but it still doesn't need to happen every half-cycle.
Or you could look at making the copy into a memcpy() call, which would "short circuit" the (correct) volatile declarations.

Quote
I don't think I'm going to be able to run the IRS fast enough to get down to a byte,
Huh?  If ticks were a byte, there is LESS pressure on the computational complexity.  What you lose is brightness levels.  If you can decide that you'll be OK with 256 possible on-times within you 8.3ms half-cycle, powerDelay and tickCounter all become byte-wide operations instead of muti-precision math.  Since at the core of your problem, you have to compare 32 powerDelay values against tickCounter and flip some bits (120 times a second), those are the most important pieces to optimize.  (actually, that's only sort-of true.  I'd say the MOST important thing is to NOT do things 120 times/second that don't need done 120 times/second.)
Note that if you allow 100 brightness levels, but have 256 possible ON-times, you have some ability to tune the "distance" between on-times to allow brightness values to map to perceptions, rather than times.  Neither the power delivered to your load, nor human brightness perception, are exact match-ups with the time during the power cycle that you turn on the relay.

Looking at the assembly language produced, there are some other interesting parts:
Here is the way we'd like the fast digitalWrite to work:
Code:
 customDigitalWrite(CHANNEL_PIN_1, LOW);
     32e:       10 98           cbi     0x02, 0 ; 2
  customDigitalWrite(CHANNEL_PIN_2, LOW);
     330:       11 98           cbi     0x02, 1 ; 2

However, not all of the IO ports of the MEGA are reachable via the sbi/sbi instructions, so some times we get:
Code:
 customDigitalWrite(CHANNEL_PIN_21, LOW);
     356:       9f b7           in      r25, 0x3f       ; 63
     358:       f8 94           cli
     35a:       80 91 0b 01     lds     r24, 0x010B
     35e:       8f 77           andi    r24, 0x7F       ; 127
     360:       80 93 0b 01     sts     0x010B, r24
     364:       9f bf           out     0x3f, r25       ; 63
  customDigitalWrite(CHANNEL_PIN_22, LOW);
     366:       9f b7           in      r25, 0x3f       ; 63
     368:       f8 94           cli
     36a:       80 91 0b 01     lds     r24, 0x010B
     36e:       8f 7b           andi    r24, 0xBF       ; 191
     370:       80 93 0b 01     sts     0x010B, r24
     374:       9f bf           out     0x3f, r25       ; 63

This is about what you expect for those "distant" IO ports, but there is an opportunity for optimization.  The author of the fast digital writes has gone to some trouble to make the bit operations "atomic", which adds three instructions and three cycles to each bit written.  And as someone else said, if you can do byte-wise operations on the IO ports (especially in the "clear", where it's easier), you can reduce the time by a factor of 8.

Another thing you should think about is how to measure and/or test the performance.  Something as simple as turning on the PIN13 LED during your ISR can give you a useful indication of how much time is being spent there.  And you can theoretically check how long there is till the next Interrupt before returning from the ISR as well.  LEDs can replace your SSRs, and any periodic (or even one-shot) interrupt can replace your zero-crossing detector, for purposes of measurement and debugging.
Logged

Offline Offline
Newbie
*
Karma: 1
Posts: 41
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Man I'm waaay to beat and brain fried, I need to step away, I can't even read your post, lol.

I have determined that the numbers simply look erratic, by adding a counter for the ZC ISR and using it to add the overhead in cycles for the ZC ISR and massively increasing the timer interrupt interval the numbers even out a lot.  Basically it is all happening so fast there is just no way to use serial output to measure it.  I wish I a scope and a logic analyzer.

Oh well.  Thanks again, I'll take a look at your post again after I get some sleep.
Logged

SF Bay Area (USA)
Offline Offline
Tesla Member
***
Karma: 106
Posts: 6367
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
Are these number more stable than they look and it is just because the serial printing is so slow compared to the interrupts?
The latter, I think.  You're essentially sampling a counter that runs at 16MHz every 2ms or so, by which time it ought to have incremented by about 32000, which is pretty much the full range of the integer in question...

Quote
 tickCounter += TIMER_CYCLE_COUNT;
You are chasing a wrong direction here, I think.   You should want to redefined a "tick" so that tick = CycleTime/BrightnessLevels (or, for 8.3ms and 256 brightness levels, about 32 us.)  That means that the clock always ticks by 1 in your ISR.   The actual clock or instruction rate of the cpu ought to be irrelevant.  Sure, at some level it's the limiting factor, but you want to have your code running in a way that it's so much faster than the problem, that there's no actual dependency...

Looking at the code generated when tick and powerDelay are both uint8_t, it looks to be about 480 instructions.  32 us is about 512 instructions.  Now, every ISR execution doesn't execute all the instructions, and all the instructions are not one cycle, so it's hard to tell whether this would actually be fast enough.  It does indicate (to me) a couple of things:
1) It would be pretty close; certainly close enough to worry about.
2) wider math (32bit tick and powerDelay) probably isn't going to work at all. (1100 instructions!  (odd that it isn't more.))
3) going to, say, 100 ticks/halfcycle instead of 256 ticks/halfcycle, would probably give you all the breathing room you need.
Logged

Offline Offline
Edison Member
*
Karma: 27
Posts: 2033
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

If your sketch can't keep up,   you could always try moving to a country with 50 Hz electricity.    Like, for example,  New Zealand.    Or one half of Japan ( I forget which half ).
Logged

Pages: 1 [2] 3   Go Up
Jump to: