Optimization of code for maximum speed, very specific project.

Are these number more stable than they look and it is just because the serial printing is so slow compared to the interrupts?

The latter, I think. You're essentially sampling a counter that runs at 16MHz every 2ms or so, by which time it ought to have incremented by about 32000, which is pretty much the full range of the integer in question...

tickCounter += TIMER_CYCLE_COUNT;

You are chasing a wrong direction here, I think. You should want to redefined a "tick" so that tick = CycleTime/BrightnessLevels (or, for 8.3ms and 256 brightness levels, about 32 us.) That means that the clock always ticks by 1 in your ISR. The actual clock or instruction rate of the cpu ought to be irrelevant. Sure, at some level it's the limiting factor, but you want to have your code running in a way that it's so much faster than the problem, that there's no actual dependency...

Looking at the code generated when tick and powerDelay are both uint8_t, it looks to be about 480 instructions. 32 us is about 512 instructions. Now, every ISR execution doesn't execute all the instructions, and all the instructions are not one cycle, so it's hard to tell whether this would actually be fast enough. It does indicate (to me) a couple of things:

  1. It would be pretty close; certainly close enough to worry about.
  2. wider math (32bit tick and powerDelay) probably isn't going to work at all. (1100 instructions! (odd that it isn't more.))
  3. going to, say, 100 ticks/halfcycle instead of 256 ticks/halfcycle, would probably give you all the breathing room you need.