@grumpy_mike:
You're right that the hertz can be lower. But don't forget that old-style CRT TVs had a phosphor that glowed for longer than the beam illuminated it. So maybe the duty cycle was quite large. In fact given that when the TV is off you can STILL see the phosphor glowing a little bit, the duty cycle on a CRT TV is very analog -- its never completely off. I think that this would make a big difference. From my experiments, as the duty cycle goes down, the hz must go up because its a lot easier to perceive blinking at 10% duty cycle than at 90%.
BTW, I heard that 200 number as a minimum from an architect specializing in LED lighting... so its just hearsay.
WRT cycle counting... nope I didn't count them. Its pretty hard nowadays with pipelining and all... and its boring. But because you spend so many hours helping us all out, I'm going to reciprocate and give a little rough counting a try!
Here's digitalWrite:
void digitalWrite(uint8_t pin, uint8_t val) // Fn call + 2 vars = 3
{
uint8_t timer = digitalPinToTimer(pin); // macro so only push,add, and mem ref = 3
uint8_t bit = digitalPinToBitMask(pin); // = 3
uint8_t port = digitalPinToPort(pin); // = 3
volatile uint8_t *out; // push = 1
if (port == NOT_A_PIN) return; // test and jump = 2
// If the pin that support PWM output, we need to turn it off
// before doing a digital write.
if (timer != NOT_ON_TIMER) turnOffPWM(timer); // test not taken = 2
out = portOutputRegister(port); // add and assign = 2
if (val == LOW) *out &= ~bit; // if, deref, read, not, and, assign = 6
else *out |= bit;
} // return: pop, jump = 2
So adding all of this up we get a count of 26.
Let me guess a single clock and data cycle is CLK_HIGH, DATA HI OR LOW, CLK_LOW. So that is a total of 26*3 = 78 counts. I feel that this is very conservative because I am assuming that conditions, jumps, memory access, etc are all one cycle AND because I didn't dig into all those macros that carefully and gave the compiler the benefit of the doubt. For example, there is some data type casting in there which could result in a unnecessary copy.
So if the OP uses register access the cycle count is 3 instead of 78 to clock one bit in (Actually if the clock is in the same register as the data, you could reduce the count to 2).
Now, if the OP switches from chained 595s to parallel then he's clocking in 8 of these at a time. So instead of 78*8 = 624 counts he is doing 3.
Let's say the OP uses all 16 digital outputs. 15 for the data and 1 for the clock. So instead of 78*15 or 1170 counts, he is doing 4.
Now let's unroll the final outer loop. So I'm guessing it looks something like:
for (i=0;i<NUMSHIFTS;i++)
{
ClockItOut();
}
So that loop itself does a test, add, and jump (say count 3) and NUMSHIFTS is for a 64 by 16 matrix or 80. So that's another 240 counts. So lets add that to the 80 bits * the 78 count is 6480 clocks.
Now that 64 x 16 matrix is 8 x 2 chips. So we can simultaneously clock 10 chips. And we need to do that 8 times. So that would look something like:
CLK_HIGH and write 2 registers = 2 counts
CLK_LOW = 1 count
(and cut and paste that 8 times)
So you get a total of 24 counts.
So the back of the envelope calculation shows a speed up of 270 times.
Sure that's on the low side of my estimate. But we haven't even really dug into the ugliness... for example a quick look at shiftout() shows this gem within the 0-7 for loop:
if (bitOrder == LSBFIRST)
digitalWrite(dataPin, !!(val & (1 << i)));
else
digitalWrite(dataPin, !!(val & (1 << (7 - i))));
Now I don't know about the AVR but back when I was counting cycles a decade ago, lots of uCs handled the bit shift operation in 1 bit shift per clock (1<<7 takes 7 clocks). Ergo, this if statement and all that val munging adds a LOT more cycles than my estimation.
But, you know, I didn't think all of this thru before posting. It was simply apparent by comparing what the OP said his matrix was doing with what I'm getting out of my M5451 library.