right now the ZeroCrossDetected function just turns all of the channels off and then assigns the correct powerDelay values from the pre-computed list. So should be good to go here.
The assignments are what I meant should be removed. They're relatively expensive:
else powerDelay[CHANNEL_PIN_3] = powerDelayPreCalc[channelValue[CHANNEL_PIN_3]]; // turn it back on at N ticks past zero cross
60c: e0 91 37 07 lds r30, 0x0737
610: f0 e0 ldi r31, 0x00 ; 0
612: ee 0f add r30, r30
614: ff 1f adc r31, r31
616: ee 0f add r30, r30
618: ff 1f adc r31, r31
61a: ed 5f subi r30, 0xFD ; 253
61c: fc 4f sbci r31, 0xFC ; 252
61e: 80 81 ld r24, Z
620: 91 81 ldd r25, Z+1 ; 0x01
622: a2 81 ldd r26, Z+2 ; 0x02
624: b3 81 ldd r27, Z+3 ; 0x03
626: 80 93 b4 07 sts 0x07B4, r24
62a: 90 93 b5 07 sts 0x07B5, r25
62e: a0 93 b6 07 sts 0x07B6, r26
632: b0 93 b7 07 sts 0x07B7, r27
This would get better if powerDelay isn't volatile, and if it's smaller, but it still doesn't need to happen every half-cycle.
Or you could look at making the copy into a memcpy() call, which would "short circuit" the (correct) volatile declarations.
I don't think I'm going to be able to run the IRS fast enough to get down to a byte,
Huh? If ticks were a byte, there is LESS pressure on the computational complexity. What you lose is brightness levels. If you can decide that you'll be OK with 256 possible on-times within you 8.3ms half-cycle, powerDelay and tickCounter all become byte-wide operations instead of muti-precision math. Since at the core of your problem, you have to compare 32 powerDelay values against tickCounter and flip some bits (120 times a second), those are the most important pieces to optimize. (actually, that's only sort-of true. I'd say the MOST important thing is to NOT do things 120 times/second that don't need done 120 times/second.)
Note that if you allow 100 brightness levels, but have 256 possible ON-times, you have some ability to tune the "distance" between on-times to allow brightness values to map to perceptions, rather than times. Neither the power delivered to your load, nor human brightness perception, are exact match-ups with the time during the power cycle that you turn on the relay.
Looking at the assembly language produced, there are some other interesting parts:
Here is the way we'd like the fast digitalWrite to work:
32e: 10 98 cbi 0x02, 0 ; 2
330: 11 98 cbi 0x02, 1 ; 2
However, not all of the IO ports of the MEGA are reachable via the sbi/sbi instructions, so some times we get:
356: 9f b7 in r25, 0x3f ; 63
358: f8 94 cli
35a: 80 91 0b 01 lds r24, 0x010B
35e: 8f 77 andi r24, 0x7F ; 127
360: 80 93 0b 01 sts 0x010B, r24
364: 9f bf out 0x3f, r25 ; 63
366: 9f b7 in r25, 0x3f ; 63
368: f8 94 cli
36a: 80 91 0b 01 lds r24, 0x010B
36e: 8f 7b andi r24, 0xBF ; 191
370: 80 93 0b 01 sts 0x010B, r24
374: 9f bf out 0x3f, r25 ; 63
This is about what you expect for those "distant" IO ports, but there is an opportunity for optimization. The author of the fast digital writes has gone to some trouble to make the bit operations "atomic", which adds three instructions and three cycles to each bit written. And as someone else said, if you can do byte-wise operations on the IO ports (especially in the "clear", where it's easier), you can reduce the time by a factor of 8.
Another thing you should think about is how to measure and/or test the performance. Something as simple as turning on the PIN13 LED during your ISR can give you a useful indication of how much time is being spent there. And you can theoretically check how long there is till the next Interrupt before returning from the ISR as well. LEDs can replace your SSRs, and any periodic (or even one-shot) interrupt can replace your zero-crossing detector, for purposes of measurement and debugging.