Optimization of code for maximum speed, very specific project.

Given that the powerDelay array is only assigned values inside one of the ISRs and only read from the other ISR and the ISRs are not nested: Can I remove the volatile attribute from the declaration? That should give me a few extra cycles in speed shouldn't it?

zparticle:
Given that the powerDelay array is only assigned values inside one of the ISRs and only read from the other ISR and the ISRs are not nested: Can I remove the volatile attribute from the declaration?

You can. You shouldn't. It won't make any difference...

That should give me a few extra cycles in speed shouldn't it?

No. In each ISR each element of powerDelay is accessed just once (read in one ISR; written in the other). Removing volatile only helps when there are multiple accesses. Removing volatile eliminates redundant loads from memory but there aren't any.

I'm trying to get 256 levels of brightness, but frankly I'm not sure that level is needed.

Very unlikely. In my opinion, what works best are fewer levels with levels that reflect how a human perceives brightness. I typically use 100 levels with each level such that a person would say there is about a 1% difference in brightness between each level.

That makes a lot of sense. I guess I'll see how the 256 works out and then maybe back off and give 100 a try. I'm betting you are right.

Okay sorry to keep pestering everyone but this has me baffled. Attached is the version of the code that generated the following output. As you will see I have set the timer/counter 1 TOP value to 100 so I should get the interrupt every 100 cycles. I've commented out everything in both ISRs except:

1> the reset of tickCounter=0 on the detection of AC zero cross
2> the addition of 100 to tickCounter every time the timer/coutner interrupt is triggered

I'm not reading anything from Vixen in this test. This is just to see if I get a stable zero cross detection and a stable tick count.

What I am seeing is:

Power on self test running.
Channel: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
POST Done.
Setting up Vixen mode.
millis=8256 tickCounter=13200
millis=8257 tickCounter=35600
millis=8258 tickCounter=6364
millis=8261 tickCounter=33000
millis=8263 tickCounter=9164
millis=8266 tickCounter=49564
millis=8269 tickCounter=25000
millis=8271 tickCounter=1264
millis=8274 tickCounter=41664
millis=8276 tickCounter=16700
millis=8280 tickCounter=58500
millis=8282 tickCounter=34664
millis=8285 tickCounter=10300
millis=8287 tickCounter=52000
millis=8290 tickCounter=28064
millis=8293 tickCounter=3300
millis=8295 tickCounter=43700
millis=8298 tickCounter=19864
millis=8300 tickCounter=500
millis=8303 tickCounter=34500
millis=8305 tickCounter=10764
millis=8308 tickCounter=52464
millis=8310 tickCounter=27500
millis=8313 tickCounter=3664
millis=8315 tickCounter=44064
millis=8318 tickCounter=19600
millis=8322 tickCounter=61400
millis=8324 tickCounter=37564

Are these number more stable than they look and it is just because the serial printing is so slow compared to the interrupts?

XmasLightControllerForVixenSSRs.ino (55.4 KB)

right now the ZeroCrossDetected function just turns all of the channels off and then assigns the correct powerDelay values from the pre-computed list. So should be good to go here.

The assignments are what I meant should be removed. They're relatively expensive:

  else powerDelay[CHANNEL_PIN_3] = powerDelayPreCalc[channelValue[CHANNEL_PIN_3]]; // turn it back on at N ticks past zero cross
     60c:       e0 91 37 07     lds     r30, 0x0737
     610:       f0 e0           ldi     r31, 0x00       ; 0
     612:       ee 0f           add     r30, r30
     614:       ff 1f           adc     r31, r31
     616:       ee 0f           add     r30, r30
     618:       ff 1f           adc     r31, r31
     61a:       ed 5f           subi    r30, 0xFD       ; 253
     61c:       fc 4f           sbci    r31, 0xFC       ; 252
     61e:       80 81           ld      r24, Z
     620:       91 81           ldd     r25, Z+1        ; 0x01
     622:       a2 81           ldd     r26, Z+2        ; 0x02
     624:       b3 81           ldd     r27, Z+3        ; 0x03
     626:       80 93 b4 07     sts     0x07B4, r24
     62a:       90 93 b5 07     sts     0x07B5, r25
     62e:       a0 93 b6 07     sts     0x07B6, r26
     632:       b0 93 b7 07     sts     0x07B7, r27

This would get better if powerDelay isn't volatile, and if it's smaller, but it still doesn't need to happen every half-cycle.
Or you could look at making the copy into a memcpy() call, which would "short circuit" the (correct) volatile declarations.

I don't think I'm going to be able to run the IRS fast enough to get down to a byte,

Huh? If ticks were a byte, there is LESS pressure on the computational complexity. What you lose is brightness levels. If you can decide that you'll be OK with 256 possible on-times within you 8.3ms half-cycle, powerDelay and tickCounter all become byte-wide operations instead of muti-precision math. Since at the core of your problem, you have to compare 32 powerDelay values against tickCounter and flip some bits (120 times a second), those are the most important pieces to optimize. (actually, that's only sort-of true. I'd say the MOST important thing is to NOT do things 120 times/second that don't need done 120 times/second.)
Note that if you allow 100 brightness levels, but have 256 possible ON-times, you have some ability to tune the "distance" between on-times to allow brightness values to map to perceptions, rather than times. Neither the power delivered to your load, nor human brightness perception, are exact match-ups with the time during the power cycle that you turn on the relay.

Looking at the assembly language produced, there are some other interesting parts:
Here is the way we'd like the fast digitalWrite to work:

  customDigitalWrite(CHANNEL_PIN_1, LOW);
     32e:       10 98           cbi     0x02, 0 ; 2
  customDigitalWrite(CHANNEL_PIN_2, LOW);
     330:       11 98           cbi     0x02, 1 ; 2

However, not all of the IO ports of the MEGA are reachable via the sbi/sbi instructions, so some times we get:

  customDigitalWrite(CHANNEL_PIN_21, LOW);
     356:       9f b7           in      r25, 0x3f       ; 63
     358:       f8 94           cli
     35a:       80 91 0b 01     lds     r24, 0x010B
     35e:       8f 77           andi    r24, 0x7F       ; 127
     360:       80 93 0b 01     sts     0x010B, r24
     364:       9f bf           out     0x3f, r25       ; 63
  customDigitalWrite(CHANNEL_PIN_22, LOW);
     366:       9f b7           in      r25, 0x3f       ; 63
     368:       f8 94           cli
     36a:       80 91 0b 01     lds     r24, 0x010B
     36e:       8f 7b           andi    r24, 0xBF       ; 191
     370:       80 93 0b 01     sts     0x010B, r24
     374:       9f bf           out     0x3f, r25       ; 63

This is about what you expect for those "distant" IO ports, but there is an opportunity for optimization. The author of the fast digital writes has gone to some trouble to make the bit operations "atomic", which adds three instructions and three cycles to each bit written. And as someone else said, if you can do byte-wise operations on the IO ports (especially in the "clear", where it's easier), you can reduce the time by a factor of 8.

Another thing you should think about is how to measure and/or test the performance. Something as simple as turning on the PIN13 LED during your ISR can give you a useful indication of how much time is being spent there. And you can theoretically check how long there is till the next Interrupt before returning from the ISR as well. LEDs can replace your SSRs, and any periodic (or even one-shot) interrupt can replace your zero-crossing detector, for purposes of measurement and debugging.

Man I'm waaay to beat and brain fried, I need to step away, I can't even read your post, lol.

I have determined that the numbers simply look erratic, by adding a counter for the ZC ISR and using it to add the overhead in cycles for the ZC ISR and massively increasing the timer interrupt interval the numbers even out a lot. Basically it is all happening so fast there is just no way to use serial output to measure it. I wish I a scope and a logic analyzer.

Oh well. Thanks again, I'll take a look at your post again after I get some sleep.

Are these number more stable than they look and it is just because the serial printing is so slow compared to the interrupts?

The latter, I think. You're essentially sampling a counter that runs at 16MHz every 2ms or so, by which time it ought to have incremented by about 32000, which is pretty much the full range of the integer in question...

tickCounter += TIMER_CYCLE_COUNT;

You are chasing a wrong direction here, I think. You should want to redefined a "tick" so that tick = CycleTime/BrightnessLevels (or, for 8.3ms and 256 brightness levels, about 32 us.) That means that the clock always ticks by 1 in your ISR. The actual clock or instruction rate of the cpu ought to be irrelevant. Sure, at some level it's the limiting factor, but you want to have your code running in a way that it's so much faster than the problem, that there's no actual dependency...

Looking at the code generated when tick and powerDelay are both uint8_t, it looks to be about 480 instructions. 32 us is about 512 instructions. Now, every ISR execution doesn't execute all the instructions, and all the instructions are not one cycle, so it's hard to tell whether this would actually be fast enough. It does indicate (to me) a couple of things:

  1. It would be pretty close; certainly close enough to worry about.
  2. wider math (32bit tick and powerDelay) probably isn't going to work at all. (1100 instructions! (odd that it isn't more.))
  3. going to, say, 100 ticks/halfcycle instead of 256 ticks/halfcycle, would probably give you all the breathing room you need.

If your sketch can't keep up, you could always try moving to a country with 50 Hz electricity. Like, for example, New Zealand. Or one half of Japan ( I forget which half ).

michinyon:
If your sketch can't keep up, you could always try moving to a country with 50 Hz electricity. Like, for example, New Zealand. Or one half of Japan ( I forget which half ).

That is thinking outside the box. :slight_smile:

westfw:

tickCounter += TIMER_CYCLE_COUNT;

You are chasing a wrong direction here, I think. You should want to redefined a "tick" so that tick = CycleTime/BrightnessLevels (or, for 8.3ms and 256 brightness levels, about 32 us.) That means that the clock always ticks by 1 in your ISR. The actual clock or instruction rate of the cpu ought to be irrelevant. Sure, at some level it's the limiting factor, but you want to have your code running in a way that it's so much faster than the problem, that there's no actual dependency...

Looking at the code generated when tick and powerDelay are both uint8_t, it looks to be about 480 instructions. 32 us is about 512 instructions. Now, every ISR execution doesn't execute all the instructions, and all the instructions are not one cycle, so it's hard to tell whether this would actually be fast enough. It does indicate (to me) a couple of things:

  1. It would be pretty close; certainly close enough to worry about.
  2. wider math (32bit tick and powerDelay) probably isn't going to work at all. (1100 instructions! (odd that it isn't more.))
  3. going to, say, 100 ticks/halfcycle instead of 256 ticks/halfcycle, would probably give you all the breathing room you need.

I see what you are getting at now, thanks. I didn't understand how you expected me to put values of 256 or more into a byte. I'll have to see what I can to rearrange this. The latest code is attached.

XmasLightControllerForVixenSSRs.ino (55.8 KB)

Similarly, since serial messages are only occasional and not more than every 50ms, while AC cycles happen every 1/120 s (8.3ms),
get rid your calculations of powerDelay[] in ZeroCrossDetected and move them into the serial receive code.

Done. I only update the powerDelay values after I have read in a new set of channel values, see the updatePowerDelayValues() in the code. I'm turning the interrupts off during this process and I'm not sure if that is a good or bad idea.

I asked this next question earlier and I think I've seen to conflicting "answers" can someone give me the straight poop (I've changed the code this is no longer true but I would like to know)? "Given that the powerDelay array is only assigned values inside one of the ISRs and only read from the other ISR and the ISRs are not nested: Can I remove the volatile attribute from the declaration? That should give me a few extra cycles in speed shouldn't it?"

You can. You shouldn't. It won't make any difference...

In each ISR each element of powerDelay is accessed just once (read in one ISR; written in the other). Removing volatile only helps when there are multiple accesses. Removing volatile eliminates redundant loads from memory but there aren't any.

and

This would get better if powerDelay isn't volatile, and if it's smaller, but it still doesn't need to happen every half-cycle.
Or you could look at making the copy into a memcpy() call, which would "short circuit" the (correct) volatile declarations.

Maybe I'm just reading more into the second one than is really there.

Okay, random mode, reading data from Vixen, zero cross detection and setting the powerDelay values all work as long as I set the prescaler to 8 and TIMER_CYCLE_COUNT to 100. However in Vixen mode the channels don't turn on until the channel values reach about 128 out of the 255 possible. I had a lot of really stupid bugs in there. Working on this stuff when I'm very tired has proven to be a bad choice, lol.

Now I just have to figure out why the light don't come on until around 128 and they don't seem to change intensity. Attached is the latest code.

XmasLightControllerForVixenSSRs.ino (54.7 KB)

Okay the code is now working! Thanks for the help everyone. Boy was I going about this the wrong way. Code is attached.

XmasLightControllerForVixenSSRs.ino (33.5 KB)

Working?! Already! Excellent!

I was fun to help with a well-defined, well-expressed problem with existing well-commented code, and someone who was responsive to the suggestions made. MUCH better than all the "Is a MEGA too slow to handle 32 channels of dimming? Do I need a Due?" questions that result in an extended "discussion" of 8 vs 32 bit cpus, during which the original poster disappears...

westfw:
Working?! Already! Excellent!

I was fun to help with a well-defined, well-expressed problem with existing well-commented code, and someone who was responsive to the suggestions made. MUCH better than all the "Is a MEGA too slow to handle 32 channels of dimming? Do I need a Due?" questions that result in an extended "discussion" of 8 vs 32 bit cpus, during which the original poster disappears...

LOL, yeah that kind of topic can be a bummer. :slight_smile:

Here is a new video. It may take a little while for youtube to process it and make it live. This shows the project basically completed and running through some effects fed to it by Vixen. Finding some interesting side effects of the relay boards. Most interesting is that they seem to be leaking enough power to dimly light a string of LEDs. It isn't a code problem, there is definitely power leaking through the relays. You don't notice it until you connect a string of LEDs, incandescent lights don't have the issue. I wonder if replacing the relays with random cross relays is showing a flaw in the sainsmart board design.

More likely there are resistors on the boards that aren't large enough to keep the power from leaking or the photo decouplers are getting triggered somehow. I've also found an interesting issue with Vixen that I didn't run into before. If none of the channel values change from one frame to the next frame, no data is sent. So my suggestion is to use one of the channels to make sure something always changes so all of the channel data will get sent every frame.

zparticle:
Most interesting is that they seem to be leaking enough power to dimly light a string of LEDs. It isn't a code problem, there is definitely power leaking through the relays. You don't notice it until you connect a string of LEDs, incandescent lights don't have the issue. I wonder if replacing the relays with random cross relays is showing a flaw in the sainsmart board design.

SSRs leak a small amount of current, you will find how much on the datasheet for the SSR. For switching low currents (0.9A or less), there are power opto triacs available with virtually no leakage.

btw if you ever want to increase the number of channels or get more levels of brightness, there is a way to make the critical code (in particular the ISR) faster. Use two arrays of (channel, ticks) pairs. Each entry in the array says which channel should be turned on next, and how many ticks to wait since the last channel. So the ISR only needs to consider one or a few channels each time, except when you have a large number of channels set to the same brightness. The reason for having 2 arrays is that the code in loop() can be setting up one of them while the ISR works from the other.

Ah, okay. Thanks I didn't realize that. It gives me an idea for how to "fix" it though.