Due fast SPI with port manipulation

Hi, I am trying to improve upon the SPI speed of an Arduino Due (with a Sam3x8e uC) by using direct port manipulation similar to how other ATMega based Arduinos would use:

SPDR = byteToSend;
while(!(SPSR & (1<<SPIF)));

to replace:

SPI.transfer((byte) byteToSend);

however I cannot find any information on how to do this using the Arduino IDE 1.5.2. If anyone could illustrate how this would be done, or just point me towards some sort of documentation/blog post/tutorial/etc that would be awesome; perhaps I am just googling the wrong keywords.

I have looked through the Sam3x datasheet http://www.atmel.com/Images/doc11057.pdf (section 33) and found the relevant register names, but am unable to access them in the IDE. Additionally the datasheet alludes to SPI transfers of anywhere from 8-16 bits, which I would love to utilize (at 12 bits) however I have no idea of how to go about this. Any info or just a relevant link would be greatly appreciated, thanks! Let me know if any additional, project specific info would be useful and I’ll post it.

You can also speed up SPI by changing the clock divider. The default is 16 (for UNO anyway) which is 1 MHz. The command is " SPI.setClockDivider(SPI_CLOCK_DIV8 );" //set SPI clock to 2MHz. DIV4 would be 4 MHz.

Thanks for the response, let me clarify a little further. The Due has an 84MHz system clock, and I am using a divisor of 21 to achieve 4 MHz. Unfortunately this is the limit of my slave devices bit rate, so I cannot turn it up any higher.

When I measure the duration of a single byte transfer using a scope I get ~8uS = 125kHz for one byte or 1MHz bitrate. I assume the loss in speed is related to the spi library calls involved in spi.transfer, and I would therefore like to eliminate it's use.

This is how the implementation of the transfer() method looks like for the SAM:

byte SPIClass::transfer(byte _pin, uint8_t _data, SPITransferMode _mode) {
	uint32_t ch = BOARD_PIN_TO_SPI_CHANNEL(_pin);
	// Reverse bit order
	if (bitOrder[ch] == LSBFIRST)
		_data = __REV(__RBIT(_data));
	uint32_t d = _data | SPI_PCS(ch);
	if (_mode == SPI_LAST)
		d |= SPI_TDR_LASTXFER;

	// SPI_Write(spi, _channel, _data);
    while ((spi->SPI_SR & SPI_SR_TDRE) == 0)
    	;
    spi->SPI_TDR = d;

    // return SPI_Read(spi);
    while ((spi->SPI_SR & SPI_SR_RDRF) == 0)
    	;
    d = spi->SPI_RDR;
	// Reverse bit order
	if (bitOrder[ch] == LSBFIRST)
		d = __REV(__RBIT(d));
    return d & 0xFF;
}

As you can see the loss of speed is not in the library call. Show us the code you’re using, maybe we find something there.

The general performance problem with the SPI class is that it does this write then wait for the write to be finished before returning. This generally ends up capping the maximum transfer rate you can get out the chip because of a lot of stupidly spent time, instead of doing cpu bound work (e.g. moving bytes in and out of memory) while the SPI hardware is busy writing bits to SPI it just waits.

In my code, what I generally do is write - and then don't do anything regarding waiting until the -next- byte is ready to be written.

Also - the arm chip on the teensy 3 has a 4 item FIFO that is fantastic for getting ridiculously high data rates out of - I just need to keep that fifo filled and it pushes data out at nearly 22Mbps on a 48Mhz cpu clock. I'm a little disappointed that the due doesn't have any FIFO'd based SPI output options - but I'm still trying to track down the SAM8X reference manual for the full low level game.

(Also - ATMEL's USARTS can run SPI as well as serial, I need to add hardware SPI support using the USARTS - and it appears that the SAM8X chips also have SPI mode capable usarts - whether or not the due exposes those is another question entirely...)

But your example replacement code for the 8-bit AVRs also waits doing nothing during the transfer...

This generally ends up capping the maximum transfer rate you can get out the chip because of a lot of stupidly spent time, instead of doing cpu bound work (e.g. moving bytes in and out of memory) while the SPI hardware is busy writing bits to SPI it just waits.

If this is responsible for the degradation from 4MHz to 1MHz your data preparation is too calculation intensive and you might have to change other things. But you have the code now to play with if you think you can gain a lot of performance by setting the registers directly.

Sorry for the slow response, I have spent most of my time the past few days moving.

I made a rather dumb computational error in my previous post which the following code and explanation will show

portCData = REG_PIOC_PDSR;      // get all 32 bits representing port C pin status
if ( portCData & (1 << 15) ) {  // check if data valid pin is high
    
  // REG_PIOC_ODSR ^= 65536;  // toggle timing pin HIGH

  val1 = 0;              // the 8 least significant bits of the val stored here
  val2 = 0;              // the 4 most significant bits of the val + 0000 

  if ( portCData & (1 << 1) ) val1 |= 1;            // if pin 33 is high set bit 0 of val1 
  if ( portCData & (1 << 2) ) val1 |= ( 1 << 1 );   // if pin 34 is high set bit 1 of val1 
  if ( portCData & (1 << 3) ) val1 |= ( 1 << 4 );   // etc
  if ( portCData & (1 << 4) ) val1 |= ( 1 << 3 );   // etc
  if ( portCData & (1 << 5) ) val1 |= ( 1 << 4 );
  if ( portCData & (1 << 6) ) val1 |= ( 1 << 5 );
  if ( portCData & (1 << 7) ) val1 |= ( 1 << 6 );
  if ( portCData & (1 << 8) ) val1 |= ( 1 << 7 );
  if ( portCData & (1 << 9) ) val2 |= 1;            // if pin 39 is high set bit 0 of val2
  if ( portCData & (1 << 12) ) val2 |= ( 1 << 1 );  // if pin 51 is high set bit 1 of val2
  if ( portCData & (1 << 13) ) val2 |= ( 1 << 2 );  // etc
  if ( portCData & (1 << 14) ) val2 |= ( 1 << 3 );  // etc
    
  SPI.transfer(SS, val2, SPI_CONTINUE);  // transfer most significant bits
  SPI.transfer(SS, val1, SPI_CONTINUE);  // transfer least significant bits
    
  // REG_PIOC_ODSR ^= 65536;  // toggle timing pin Low
}

what is happening here is that I am reading a 12bit ADC that is connected to digital pins on PortC (off of an interrupt) and then transferring them out via SPI.
By toggling what I have marked as the “timing pin” and watching it with an oscilloscope I can ascertain roughly how long things are taking. Toggling the pin high and immediately low via this method takes ~100ns.

When I do this I find:
the whole function takes ~9.3us
reading the Ports and filling my vals takes 3us
the SPI transfers take the remaining 6.3us

The bit rate for 2 bytes over 6.3us is => 16/(6.310^-6) ~= 2.5MHz, which is not so far off from 4MHz.
However when one calculates the rate for the whole operation (2 bytes in 9.3us) it comes out => 16/(9.3
10^-6) ~= 1.7MHz.
As dgarcia42 points out I could recover that 0.8MHz by loading data in while transferring, but I don’t see what accounts for the remaining 4 - 2.5 = 1.5MHz of speed that is not being utilized.

After reading some through other threads I think that I am going to attempt to just program the Sam3x directly using Atmel studio, and as I have no experience with doing that any advice would be appreciated. Thanks again for any info

Your complex calculation:

  if ( portCData & (1 << 1) ) val1 |= 1;            // if pin 33 is high set bit 0 of val1 
  if ( portCData & (1 << 2) ) val1 |= ( 1 << 1 );   // if pin 34 is high set bit 1 of val1 
  if ( portCData & (1 << 3) ) val1 |= ( 1 << 4 );   // etc
  if ( portCData & (1 << 4) ) val1 |= ( 1 << 3 );   // etc
  if ( portCData & (1 << 5) ) val1 |= ( 1 << 4 );
  if ( portCData & (1 << 6) ) val1 |= ( 1 << 5 );
  if ( portCData & (1 << 7) ) val1 |= ( 1 << 6 );
  if ( portCData & (1 << 8) ) val1 |= ( 1 << 7 );
  if ( portCData & (1 << 9) ) val2 |= 1;            // if pin 39 is high set bit 0 of val2
  if ( portCData & (1 << 12) ) val2 |= ( 1 << 1 );  // if pin 51 is high set bit 1 of val2
  if ( portCData & (1 << 13) ) val2 |= ( 1 << 2 );  // etc
  if ( portCData & (1 << 14) ) val2 |= ( 1 << 3 );  // etc

can be much simplified:

val1 = (portCData >> 1) & 0xFF;
val2 = ((portCData >> 9) & 0x01) | ((portCData >> 11) & 0x0E);

What does your scope show for the SPI pins? Is the SCKL pin clocked at 4 MHz? How much delay do you have between the two bytes being sent?

pylon: But your example replacement code for the 8-bit AVRs also waits doing nothing during the transfer...

No, it's usually loading up the next and incrementing the pointer counter that's used for it. Small amount of improvement, but makes a significant improvement when moving hundreds of bytes - especially if you are reordering bytes on the fly. Also - in the next version of the led library, that 'gap' time is going to be used for things like on the fly hsv to rgb conversion and global brightness leveling. It's the difference between doing:

write -> wait -> check if end of loop -> jump back to beginning of loop -> load byte -> increment pointer -> write

and

write -> check if end of loop -> jump back to beginning of loop -> load byte -> increment pointer -> wait -> write

basically overlapping the time spent waiting and the loading/memory juggling.

This generally ends up capping the maximum transfer rate you can get out the chip because of a lot of stupidly spent time, instead of doing cpu bound work (e.g. moving bytes in and out of memory) while the SPI hardware is busy writing bits to SPI it just waits.

If this is responsible for the degradation from 4MHz to 1MHz your data preparation is too calculation intensive and you might have to change other things. But you have the code now to play with if you think you can gain a lot of performance by setting the registers directly.

Not all of it, but it hurts. I don't have the numbers handy but I have seen it impact timing up to 5-10% at higher spi clock rates (osc/2).

Also - on the teensy 3 - the SPI hardware introduces 1-3 clocks of delay/waiting between each byte. So each byte ends up taking 9-11 clock cycles, not 8. There's another chunk of time lost. I haven't gone back to see if AVR's hardware SPI introduces a similar intra-byte delay - however, given that between the end of one byte being written and the next byte beginning you have the flag check (read of a register, bit set check, jump, possible loop - at least 3/4 clocks there in the best case of you only having to go through that loop once), at osc/2 now you've got a few more clock cycle time spans where you aren't sending anything, so you aren't going to see a steady stream of clock and data, unless you're hand timing code to the point where you are writing SPDR at the clock where SPIF is being set.

Hello, i am having trouble accessing the ports directly on sam3x8c. I have tried using this:

REG_PIOB_ODSR ^= 0b00000000000000010000000000000000;

But did not have any success. Is there anything else that I need to define before this works?

Regards, Owais.