How quickly I can change pin states on Arduino?

Hi,

I'm thinking of implementing a project which would require quite a lot of data to be transferred to shift registers and I'm thinking what the limitations in performance I could expect from Arduino to plan the project properly. Instead of using digitalWrite() that sets the state of pins individually my understanding is that I can write the state directly to PORTA/B/C/D global variables, which sets the state of 8 pins in each port at once. If I understood correctly, memory write on ATMega takes 2 cycles, so at 16MHz that would mean I can change the state of 8 pins at 8MHz, is that right?

Thanks, Jarkko

Well you can write at that rate but you have to do something else like fetch the data and that will slow you down.
For transferring data to a shift register look at using the SPI interface, I think that will run at 1MHz

change the state of 8 pins at 8MHz, is that right?

Only if the writes are inline, if you have a loop you have to add the JMP time.

Also you can use the IN and OUT instructions I would assume, they work in 1 cycle.

But you are getting data from somewhere and having to shift the bits etc, that will slow things down again.

Depending on the nature of the transfers you can organise the data in RAM such that all the clocking and data information is decoded into an array, then blat that array as fast as possible to a port.

Is there any reason for not using the SPI hardware?


Rob

SPI is quickest - 8 MHz clock, can send out a byte about every 1 uS if you do it right.
This test spat out 41 bytes in 46uS I think, 58uS for the loop, pulling data from an array to send out.

simpletestR2.ino (6.13 KB)

JarkkoL:
I'm thinking of implementing a project which would require quite a lot of data to be transferred to shift registers

Read this post about clocking out pixel data to a VGA monitor:

I don't think you can get much faster than that. Using SPI (a shift register, effectively) it clocks out one bit every two clock cycles (ie. every 125 nS), however there is a one clock cycle gap between bytes.

In the 17 clock cycles that the hardware is clocking out that byte (8 bits) you just have time to load up the next byte from the array in memory where you have them waiting.

there is a one clock cycle gap between bytes.

I've heard that you can eliminate this by using "the usart in SPI mode", on those devices that have the more broadly "universal" USART.
(I think that includes the 328. It's definitely getting attention on the Xmega chips, which have many usarts AND dma; people over on avrfreaks are doing "interesting stuff."

Interesting point. I don't believe my testing confirmed that, but I'm willing to try again. It was like (strangely) that the hardware needed one extra clock cycle to "recover" from sending that byte.

DMA would be another kettle of fish, of course. I don't recall seeing reference to it on the Atmega chip lines.

JarkkoL:
require quite a lot of data to be transferred to shift registers

How much data are you talking about here? Where is it coming from?

My limited experience with shift registers is as serial in / parallel out devices i.e. one shift register per pin. Are you writing to multiple shift registers simultaneously?

Did we lose the OP?

Looks like it.


Rob

Back to the original subject.

I can change the state of 8 pins at 8MHz

Theoretically. Note that the ATmega328 on the Uno only has one IO port (PORTD) with a full 8 pins, and those pins include the serial rx/tx pins.
If you want to maintain serial connectivity as well, you end up with several possibilities for writing 6 pins at 8MHz...

I suppose if you had:

label A;
PORTD = (PORTD & B00000011) | B10101011;
PORTD = (PORTD & B00000011) | B01010111;
GOTO A; // or however labeling works

you could flip all but the Rx/Tx bits pretty quick.
Hard to do much else tho. Anything like pulling the data from an array would slow it down:
// clears all but Rx/Tx lets bits 7:2 be set
PORTD = (PORTD & B00000011) | (dataArray[0] & B11111100); // set the 6 bits per whats in the array
PORTD = (PORTD & B00000011) | (dataArray[1] & B11111100); // set the 6 bits per whats in the array
PORTD = (PORTD & B00000011) | (dataArray[2] & B11111100); // set the 6 bits per whats in the array
PORTD = (PORTD & B00000011) | (dataArray[3] & B11111100); // set the 6 bits per whats in the array
PORTD = (PORTD & B00000011) | (dataArray[4] & B11111100); // set the 6 bits per whats in the array
PORTD = (PORTD & B00000011) | (dataArray[5] & B11111100); // set the 6 bits per whats in the array
PORTD = (PORTD & B00000011) | (dataArray[6] & B11111100); // set the 6 bits per whats in the array
PORTD = (PORTD & B00000011) | (dataArray[7] & B11111100); // set the 6 bits per whats in the array

Don't do this in a for loop, that adds 12uS with each jump to the next for value.

westfw:
Theoretically. Note that the ATmega328 on the Uno only has one IO port (PORTD) with a full 8 pins, and those pins include the serial rx/tx pins.
If you want to maintain serial connectivity as well, you end up with several possibilities for writing 6 pins at 8MHz...

In any case I don't see what this is to do with the shift registers the OP mentioned.

Thanks for the replies and sorry the delay!

Graynomad:
Is there any reason for not using the SPI hardware?

I didn't know about SPI, and thanks for pointing it out. If I got it right, there's single SPI on Uno, and that's able to push data at 8Mbits/s. I believe transfer() call to SPI is asynchronous so that data loads & looping can be done in parallel on the CPU while SPI feeds the data to the pins?

PeterH:
How much data are you talking about here? Where is it coming from? My limited experience with shift registers is as serial in / parallel out devices i.e. one shift register per pin. Are you writing to multiple shift registers simultaneously?

I'm looking to control ~300 RGB LED's, where each RGB channel takes 8-bit PWM data. This is rotating LED cylinder with radius of ~15cm and I'm looking for ~5mm resolution, so the cylindrical "display" size is 28200 pixels. I would like to have refresh rate of 20fps, so that would require 28200203*8 bits/sec = ~13Mbits/sec data pushed from the main board. Of course the specs are adjustable to whatever is feasible to implement, but higher the better. While I would like to push the data directly to shift registers, I think I need additional microcontrollers in between, because the cylinder probably need to rotate faster than the actual update rate to maintain steady image (e.g. 100fps).

westfw:
Theoretically. Note that the ATmega328 on the Uno only has one IO port (PORTD) with a full 8 pins, and those pins include the serial rx/tx pins.

I checked that it should take 8 cycles to update 8 bits of data from an array to the PORTD. If I unroll the loop few times you could shave off ~couple of cycles for higher transfer rate:

  1. read data from memory to a register (2 cycles)
  2. out the register to port D (1 cycle)
  3. out 1 to cycle pin (1 cycle)
  4. out 0 to cycle pin (1 cycle)
  5. increment data address counter (1 cycle)
  6. compare and loop (2 cycles)
    So at 16MHz that should give 16Mbits/sec transfer speed.

Jarkko

I don't see what this is to do with the shift registers

Load the input of 8 (6) parallel shift registers with one write, fiddle the clock with the next write.
6x the throughput of single-bit-at-a-time...

In my post above about sending data to VGA I managed to get a byte out in 6 cycles:

while (i--)
    PORTD = * messagePtr++;

Generated code:

  while (i--)
    PORTD = * messagePtr++;
(2) 194:	89 91       	ld	r24, Y+
(1) 196:	8b b9       	out	0x0b, r24	; 11
(1) 198:	91 50       	subi	r25, 0x01	; 1
(2) 19a:	e0 f7       	brcc	.-8      	; 0x194 

-------
6 cycles in loop = 375 nS

If you unrolled the loop I suppose you could get a byte out in 3 cycles (you wouldn't need to subtract 1 from i, nor do a branch).

If you unrolled the loop and put the data in immediate mode, you could get two cycles:

  PORTD = 0x24;
  PORTD = 0x35;
...

(1)  ldi   r24, 0x24
(1)  out  0x0b, r24
(1)  ldi   r24, 0x35
(1)  out  0x0b, r24
...

Nice to see gcc optimizes the loop so well and there's no need for inline asm. Didn't know there was instruction that does both load with post increment. For shift registers you need to add two out calls there to signal the register for the data so it comes up to 8 cycles. However since I might need to route the data through several microcontrollers which redirect it to shift registers, maybe it's possible to optimize the clock ticking (e.g. pass data on both rising and falling clock edge).

TanHadron:
If you unrolled the loop and put the data in immediate mode, you could get two cycles:

I need this data to be read from memory because it's supposed to be streamed in via USB or something.

JarkkoL:
Nice to see gcc optimizes the loop so well and there's no need for inline asm. Didn't know there was instruction that does both load with post increment.

This is one of the reasons I recommend against using asm unless you absolutely have to (which is practically never).

The compiler generates good code, and unless you are very, very familiar with the underlying hardware (as the compiler-writers happen to be) you may choose sub-optimal ways of solving the problem.

By all means decompile and see what is generated. That can give hints about ways of optimizing (for example) how you store data in arrays. But ultimately you practically never need to out-guess the compiler.