More speed from ATMEGA328 Internal Clock

I have a board that is designed and produced. It was designed to use an Atmega328 with the internal clock. There is no space on the board to place an external clock. After completing these and testing them I have found that 8Mhz is too slow to complete my main loop actions seamlessly. The main loop involved shifting data out to 20 registers. Is there any way to speed this up without going back to the drawing board on the PCB design? Can another AVR be substituted as a drop in replacement with a faster internal clock? Can the internal clock be made to run faster or doubled to achieve 16Mhz? Or can code optimization double the speed of my display loop?

The application is a matrix display that has 50 columns and 7 rows. The registers are the display buffers and represent the current state of the display. Because of the multiplexing in the matrix, only 1/5 of the pixels are illuminated at any given time. In order to trick the eyes into seeing all lit at the same time the speed needs to be faster. The display flickers with the interal clock. If I connect and UNO board instead of using the on board ATMEGA328 (QFP) it works good. So the difference between 8Mhz and 16Mhz is visible. If I can double the speed of the display loop than that is a solution. Below is the loop, for reference, latchPin = 8, clockPin = 12, dataPin = 11.

Is there any way to speed this up without going back to the drawing board on the PCB design?

You can't speed up the clock, but you might be able to rewrite the code for more speed.

I see a lot of shiftOut(). You could speed that up A LOT by using the hardware SPI.

And you could reduce you code A LOT by using arrays (more) :wink:

I looked into hardware SPI, but I am afraid my PCB has the wrong pins utilized for it? My registers are on latchPin = 8, clockPin = 12, dataPin = 11. Hardware SPI on the Atmega328 is MOSI = 11, MISO = 12, SCK = 13. Can this be made to work?

Nope (or maybe), that's why you protoype :wink: shiftout is all in software and just not fast.

And the maybe is, make wire links to fix your error :wink:

Yea that's what I thought. Initial prototype worked, but it was using an uno as the controller and therefore had a 16mhz clock. Now we have a production run of boards that are the way they are. Too many to manually correct with 'wires'. I am hoping software optimization might swoop in and save the day. Any ways to make the software faster would be helpful. If I can eliminate half the instructions, or double the speed another way I should be good to go.

Yeah, that's kind of stupid... Make a prototype but change it without testing for the product...

And bodge wires where very normal up until like 10 year ago, even in big production runs :wink:

But without the use of arrays changing the code alone already is a big task, damn. And without de full code even impossible.

Only small improvement I can think of is making a post manipulation based shiftout().

The atmega328 has provisions for calibrating the 8 MHz internal oscillator and typically can be tuned over a rather large range. Figure 1-1 in the application note linked below suggests it can be tuned to something like 14 MHz at the high end. This, of course, will have to be accounted for in the setup for any peripherals you might be using.

Atmel-2555-Internal-RC-Oscillator-Calibration-for-tinyAVR-and-megaAVR-Devices_ApplicationNote_AVR053.pdf

septillion:
But without the use of arrays changing the code alone already is a big task, damn. And without de full code even impossible.

Only small improvement I can think of is making a post manipulation based shiftout().

How do you mean without the full code? I have the full code. I can make any changes necessary.... How slow is pin lookup anyway?

Yeah, but we don't...

MrMark:
The atmega328 has provisions for calibrating the 8 MHz internal oscillator and typically can be tuned over a rather large range. Figure 1-1 in the application note linked below suggests it can be tuned to something like 14 MHz at the high end. This, of course, will have to be accounted for in the setup for any peripherals you might be using.

Atmel-2555-Internal-RC-Oscillator-Calibration-for-tinyAVR-and-megaAVR-Devices_ApplicationNote_AVR053.pdf

Hmm this is interesting, when you say accounted for in peripherals. The shift registers are slave devices and get the clock from the ATMEGA. The only other peripheral I need is UART at 9600 baud. Can that still work with a high OSCCAL value?

septillion:
Yeah, but we don't...

I am attaching it here.

DisplayCode.ino (11.8 KB)

twing207:
Hmm this is interesting, when you say accounted for in peripherals. The shift registers are slave devices and get the clock from the ATMEGA. The only other peripheral I need is UART at 9600 baud. Can that still work with a high OSCCAL value?

As I understand it, Arduino has a compile time parameter (f_cpu) that the peripheral libraries use to derive parameters for peripheral setup. I've not used this myself, so perhaps we'll get appender with the relevant experience to comment on the procedure and limitations.

Typically this would be used to configure a custom board with a non-standard oscillator and the bootloader would have to be corrected as well. I think the model here is that you are still at a nominal 8 MHz for the internal oscillator in the bootloader, and the calibration register would be modified in the run time code, so the bootloader is ok without modification. This presupposes that peripheral (re)configuration happens as part of the run time code which may not be the case.

I am not sure how good is optimizing in Arduino and what code are you using for shiftOut. But if it is the same as mine:

void shiftOut(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, uint8_t val){
	uint8_t i;

	for (i = 0; i < 8; i++)  {
		if (bitOrder == LSBFIRST)
			digitalWrite(dataPin, !!(val & (1 << i)));
		else	
			digitalWrite(dataPin, !!(val & (1 << (7 - i))));
			
		digitalWrite(clockPin, HIGH);
		digitalWrite(clockPin, LOW);		
	}
}

and if it is the part slowing your code most you should be able to make it much (10 times?) faster. I would remove the if statement and toggle pins by writing to registers. Also I am not sure if the compiler is able to optimize the "val&(1<<(7-i))" statement.
It MUST be possible to shift out 1 bit in 20 CK, that is 160CK per shift register, 3200 to shift out everything. If you sacrifice 50% of CPU time for shifting you will get 1 kHz refresh rate.

EDIT: 10CK per bit is probably too optimistic, fixed to realistic values.

Yeah, that's what I said.

And now you talk about the "val&(1<<(7-i))" statement, but sure to use LSB first, that's way faster because you can shift as you go. Just flip the logic/variables you use to match. That would speed up things a bit. Not SPI levels but faster :slight_smile:

"I looked into hardware SPI, but I am afraid my PCB has the wrong pins utilized for it? My registers are on latchPin = 8, clockPin = 12, dataPin = 11. Hardware SPI on the Atmega328 is MOSI = 11, MISO = 12, SCK = 13. "

Pity, another waste of the fast internal hardware.
Did you know that this takes just 17 clocks?

spdr = dataArray[x]; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; // wait out the transfer

That's just over 1uS at 16 MHz, just over 2uS at 8 MHz.

CrossRoads:
"I looked into hardware SPI, but I am afraid my PCB has the wrong pins utilized for it? My registers are on latchPin = 8, clockPin = 12, dataPin = 11. Hardware SPI on the Atmega328 is MOSI = 11, MISO = 12, SCK = 13. "

Pity, another waste of the fast internal hardware.
Did you know that this takes just 17 clocks?

spdr = dataArray[x]; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; // wait out the transfer

That's just over 1uS at 16 MHz, just over 2uS at 8 MHz.

Forgive my ignorance, but what exactly is happening here in 17 clocks?

twing207:
Forgive my ignorance, but what exactly is happening here in 17 clocks?

Waiting for SPI hardware to finish. Nothing you could use.

shiftOut() is pretty sucky code; there is lots of room for improvement.
Here's one discussion
If your clock and data pins are constants, you can speed it up even more.

Smajdalf:
It MUST be possible to shift out 1 bit in 20 CK, that is 160CK per shift register, 3200 to shift out everything. If you sacrifice 50% of CPU time for shifting you will get 1 kHz refresh rate.

EDIT: 10CK per bit is probably too optimistic, fixed to realistic values.

In fact it is possible to send 1 bit in 7 CK, something like:
SBRC rx, 7
SBI PORTB, DATAPIN
SBRS rx, 7
CBI PORTB, DATAPIN
OUT PINB, CLOCKPIN
OUT PINB, CLOCKPIN
It is not 100% correct syntax but I hope the idea is clear. I doubt C is able to generate such code, inline assembly would be needed to reach this speed.