Have I made this hardware SPI transfer as fast as possible?

Hey guys,

I’m transferring a LOT of data out to some TLC5947 LED drivers, and it’s interfering with my sound output because my DAC is on the same bus and I can’t transfer two bytes to the DAC in the middle of the LED data transfer. It’s too late to modify my hardware, so I’m trying to minimize the issue with software tricks.

I don’t know exactly how long this transfer should take, but my first benchmarked attempt gave me microsecond timings of around 460us to update 6 LED modules with 24 LEDs each having 12 bits of brightness data. I have since optimized the function and with the following I’m now getting speeds of around 340us:

const int LEDMODULES = 6; 

uint8_t LEDData[LEDMODULES*36] = {0}; // LED data is stored as a string of 12 bit LED values.  There are 3 bytes for every two LEDs.  We store it this way because the data does not need to be altered when we are shifting it out to the TLC5947s.  Modifying the data with bit shifts would be very expensive and there is a lot of data to move.  

*
This function shifts all the LED data out to the TLC5947's. 
*/
void updateLEDs() {
  
  const int LEDFrameTime = 1000 / 30; // The number of milliseconds between each update of the LEDs.  1000 / 24 = 24fps.
  
  static unsigned long nextLEDFrameTime = 0;   
  
  uint8_t *thisLED;
  uint8_t *lastLED;

  if (time >= nextLEDFrameTime) { // If it is time for an LED update...  
  
    //unsigned int time = micros();
      
    thisLED = &LEDData[LEDMODULES*36-1];
    lastLED = &LEDData[0]-1; // LastLED is actually one prior to first LED so the while loop works properly.
   
    cli(); // Halt interrupts.
 
    SPDR = *thisLED--; // Initiate first byte transfer and decrement address of *thisLED.
 
    do {         
      while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED. 
    } while (thisLED != lastLED); 
    
    while (!(SPSR & _BV(SPIF))); // Wait for last byte to finish transfer.
         
    // 340 microseconds
    // The trick which makes the above so fast is that we decrement the pointer and do the conditional jump while we are waiting for the previous byte transfer to complete.

    // Move data from shift register into latch register:
    
      LATPORT &= ~(1 << LATPIN); // Set LATCH pin low.  It is important to do this first, just before transitioning high!  I don't know why, but the LED modules won't work right otherwise!
      LATPORT |= (1 << LATPIN); // Set LATCH pin high.  Transitioning from low to high moves all data from the shift register into the latch registers. 
      ENAPORT |= (1 << ENAPIN); // Set BLANK pin high. Turn all outputs off, and reset grayscale timer.  Putting this after the the latch pin change seems to result in the least flicker.
      ENAPORT &= ~(1 << ENAPIN); // Set BLANK pin low. Turn all outputs on.
  
    sei(); // Reenable interrupts.
    
    //time = micros() - time; 
    //Serial1.println(time,DEC); 
    
    nextLEDFrameTime = time + LEDFrameTime; // Set the time of the next update.
    
  }
}

I’m pretty proud of how much I’ve optimized this since my first attempt where I was naively storing the LED brightness data in an array with one element per LED. My current setup is not only more memory efficient, but since I don’t need to change LED states super fast I can have helper functions to twiddle the appropriate bits in the compact array, and the animaton stuff can go at its own pace, while the DAC interrupt does its thing unimpeded until I need to send out an LED update, at which point I can disable the DAC interrupt and just blast the data out as one continuous bitstream. (Which I need to do because as previously mentioned the two chips are on the same bus and there’s no way I have been able to come up with to send 16 bits to the DAC in the middle of sending data out to all the LED modules.)

Anyway, if you have any suggestions for how to optimize this further please let me know. Otherwise, feel free to use it if you need to send a large amount of data as fast as possible.

Have you made the SPI clock faster?

setClockDivider (SPI_CLOCK_DIV2);

Yep.

 SPI.begin();
  SPI.setClockDivider(SPI_CLOCK_DIV2);
  SPI.setDataMode(SPI_MODE0);
  SPI.setBitOrder(MSBFIRST);

So far my only complaint about your code is it sent my IDE into a loop when I tried to auto-format it. Then it did this, after I restarted:

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuffer.append(StringBuffer.java:224)
    at processing.app.tools.AutoFormat.indent_puts(AutoFormat.java:170)
    at processing.app.tools.AutoFormat.run(AutoFormat.java:652)
    at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209)
    at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:702)
    at java.awt.EventQueue.access$400(EventQueue.java:82)
    at java.awt.EventQueue$2.run(EventQueue.java:663)
    at java.awt.EventQueue$2.run(EventQueue.java:661)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
    at java.awt.EventQueue.dispatchEvent(EventQueue.java:672)
    at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296)
    at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211)
    at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188)
    at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)

I've raised an bug report about that.

I agree with your idea of doing some extra work during the SPI transfer. Not sure why you turned interrupts off though.

I turned interrupts off because my DAC is on the same SPI bus as my LED modules and it is updated in an interrupt. If I allow the DAC interrupt to fire, it will clock an integer out onto the SPI bus that is intended for the DAC, and the LED modules will pick up on it and transfer it merrily along it's way down the chain of modules and somewhere along the line two LEDs will end up with the wrong brightness. The only way to avoid this is to disable the DAC interrupt while it's updating the LED modules. There is no chip select pin on the TLC5947, so there's no way to tell it to ignore data coming down the SPI bus.

The TLC5947 does have a latch pin, but I cannot latch the data in for just one module at a time. All the modules are on the same latch line. So if I'm going to shift any data out to the LED modules I have to shift it ALL out, and then latch, and then let my DAC update.

Not exactly ideal, and if I had it to do over I'd put everything on its own bus, but it's too late for that now. Thankfully it works well enough for what I need it to do.

The maximum SPI data rate is 8Mbits/sec on a 16MHz Arduino. Your TLC5947s need 288 bits of data each, so the minimum time possible to write to all of them is 288 * 6 / 8 = 216us. You are achieving 340us, which implies an overhead of around 50%, or 0.5us per byte. I can think of two ways in which you could speed it up:

  1. Try inserting various numbers of NOP instructions immediately before the statement "while (!(SPSR & _BV(SPIF)));". The idea is to synchronise the loop so that the SPIF flag is detected immediately after it goes active, avoiding the situation in which it goes active just after it has been read. The ideal number will probably be between 0 and 4.

  2. Even better, replace the statement "while (!(SPSR & _BV(SPIF)));" by just enough NOP instructions to allow the previous byte to be sent. The SPI clock is synchronised with the CPU clock, so this is a safe thing to do. Try 16 NOP instructions to start with, make sure it works, then reduce the number until you find the minimum number for which the LED drivers receive the correct data.

This how NOP instructions are added in the Arduino core library (two of them in this instance):

    __asm__ __volatile__ (
        "nop" "\n\t"
        "nop");

dc42:
Thanks, I’ll try that out and report back on my results!

scswift:   do {              while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED.    } while (thisLED != lastLED);

// The trick which makes the above so fast is that we decrement the pointer and do the conditional jump while we are waiting for the previous byte transfer to complete.

Problem is, you do some work between SPIF going HIGH and sending the next byte.

Try changing it to:

   do {
      const byte b = *thisLED--;
      while (!(SPSR & _BV(SPIF))); SPDR = b;
    } while (thisLED != lastLED);

dc42: 2. Even better, replace the statement "while (!(SPSR & _BV(SPIF)));" by just enough NOP instructions to allow the previous byte to be sent. The SPI clock is synchronised with the CPU clock, so this is a safe thing to do. Try 16 NOP instructions to start with, make sure it works, then reduce the number until you find the minimum number for which the LED drivers receive the correct data.

This how NOP instructions are added in the Arduino core library (two of them in this instance):

  __asm__ __volatile__ (
        "nop" "\n\t"
        "nop");

Try this too. It might not do anything because you could already be perfectly synced to the SPI ... but it's worth a try.

Fungus,

I don't understand. What work do you refer to?

Here, it: 1. Waits till it can send another byte 2. Sends it 3. Decrements the pointer 4. Jumps

 do {         
      while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED. 
    } while (thisLED != lastLED);

Here, it: 1. Decrements the pointer and sticks it in a temorary variable 2. Waits until it can send another byte 3. Sends it 4. Jumps

   do {
      const byte b = *thisLED--;
      while (!(SPSR & _BV(SPIF))); SPDR = b;
    } while (thisLED != lastLED);

The only difference I can see is that the order of the decrement and the jump are reversed. Either way, both are done after it's waited to send a byte and has sent it.

scswift: Fungus,

I don't understand. What work do you refer to?

Here, it: 1. Waits till it can send another byte 2. Sends it 3. Decrements the pointer 4. Jumps

 do {         
      while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--;
    } while (thisLED != lastLED);

I'm betting it's more like this:

  1. Waits till it can send another byte
  2. Reads memory pointed to by pointer variable
  3. Decrements pointer variable
  4. Writes data from (2) to SPI
  5. Jumps

Maybe 3 and 4 are the other way around but 2 will almost certainly be in there and it's a couple of clock cycles.

Okay, here is the new version of the function, using NOPs. The minimum number of NOPs I could use without glitching was 15.

I decided to go with a 12x unroll since 24x provided little additional speed. That gives 256us, which is quite a bit better than the 340us I was getting with the standard method, so thanks for the suggestion. :slight_smile:

My sound is still warbling somewhat with 4-6 LED drivers, but it’s mainly an issue with music that has singing. With most instruments and sound effects it would be completely unnoticeable. I still can’t get 7 or 8 LED drivers to work even when using short runs for all the cables. Must be reflection in the data lines.

I was also getting some glitches when I had a long ribbon cable that I’d bunched up but as soon as I spread it out across the table it started working fine again. (This was regardless of the NOP code, it was just crosstalk in the lines.)

/*
This function shifts all the LED data out to the TLC5947's. 
*/

#define NOP __asm__ __volatile__ ("nop\n");
#define WAIT NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP // 15 NOPs 

void updateLEDs() {
 
  const int LEDFrameTime = 1000 / 30; // The number of milliseconds between each update of the LEDs.  1000 / 24 = 24fps.
  
  static unsigned long nextLEDFrameTime = 0;   
  
  uint8_t *thisLED;
  uint8_t *lastLED;

  if (time >= nextLEDFrameTime) { // If it is time for an LED update...  
  
    //unsigned int time = micros();
      
    thisLED = &LEDData[(LEDMODULES*36)-1];
    lastLED = &LEDData[0]-1; // LastLED is LED -1 so the while loop works properly.
   
    cli(); // Halt interrupts.
 
    SPDR = *thisLED--; // Initiate first byte transfer and decrement address of *thisLED.
    WAIT; SPDR = *thisLED--;
    WAIT; SPDR = *thisLED--;
    WAIT; SPDR = *thisLED--;
    WAIT; SPDR = *thisLED--;
    WAIT; SPDR = *thisLED--;
        
    do {       
      
      WAIT; SPDR = *thisLED--;
      WAIT; SPDR = *thisLED--;
      WAIT; SPDR = *thisLED--;
      WAIT; SPDR = *thisLED--;
      WAIT; SPDR = *thisLED--;
      WAIT; SPDR = *thisLED--;
      
      //while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED. 
          
    } while (thisLED != lastLED); // lastLED is LED -1, so when thisLED = lastLED, we know the last decrement operation was after LED 0 was output.

    WAIT; // Wait for last byte to finish transfer.
    SPSR = SPSR & ~_BV(SPIF); // Clear transfer flag.
  
    // 6x unroll = 276us
    // 12x unroll = 256us
    // 24x unroll = 252us

    // Move data from shift register into latch register:
    
      LATPORT &= ~(1 << LATPIN); // Set LATCH pin low.  It is important to do this first, just before transitioning high!  I don't know why, but the LED modules won't work right otherwise!
      LATPORT |= (1 << LATPIN); // Set LATCH pin high.  Transitioning from low to high moves all data from the shift register into the latch registers. 

      ENAPORT |= (1 << ENAPIN); // Set BLANK pin high. Turn all outputs off, and reset grayscale timer.  Blanking reduces flicker, but it does not seem to matter if it is before, after, or in the middle of latching.
      ENAPORT &= ~(1 << ENAPIN); // Set BLANK pin low. Turn all outputs on.
      
    sei(); // Reenable interrupts.
    
    //time = micros() - time; 
    //Serial1.println(time,DEC); 
    
    nextLEDFrameTime = time + LEDFrameTime; // Set the time of the next update.
    
  }
}

fungus: I'm betting it's more like this:

  1. Waits till it can send another byte
  2. Reads memory pointed to by pointer variable
  3. Decrements pointer variable
  4. Writes data from (2) to SPI
  5. Jumps

Maybe 3 and 4 are the other way around but 2 will almost certainly be in there and it's a couple of clock cycles.

Yep. I just disassembled the code and it's:

  1. Wait 'til it can send another byte
  2. Read memory pointed to by pointer variable
  3. Write SPI data
  4. Decrement pointer variable
  5. Jump

So you've definitely wasted two clock cycles per byte sent compared to my optimization.

Also: if you write "*--thisLED;" instead of "*thisLED--;" the decrement can be done for free because the CPU has a special instruction for that. The "*x--" form wastes two clock cycles (although it might not speed this loop up because you're waiting for SPI).

It's also good style because you don't have the "-1"s in this code:

   thisLED = &LEDData[LEDMODULES*36-1];
   lastLED = &LEDData[0]-1; // LastLED is actually one prior to first LED so the while loop works properly.

scswift: Okay, here is the new version of the function, using NOPs. The minimum number of NOPs I could use without glitching was 15.

Ok, now change all those to:

  byte b;
  ...
  b = *thisLED--;   WAIT;   SPDR = b;

I went one better and implemented your pre-decrement as well, but still getting 256us as with my last version, and using only 14 NOPs still glitches, so it seems like I already hit the limit.

    thisLED = &LEDData[LEDMODULES*36]; // The first operation is to decrement this pointer, so the first time we write thisLED it will be array index LEDMODULES*36-1
    lastLED = &LEDData[0]; 
    
    byte b;
       
    cli(); // Halt interrupts.

    b = *--thisLED; SPDR = b; // No need to wait for the first byte.
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
           
    do {       
    
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      b = *--thisLED; WAIT; SPDR = b;
      
      //while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED. 
          
    } while (thisLED != lastLED); 

    WAIT; // Wait for last byte to finish transfer.
    SPSR = SPSR & ~_BV(SPIF); // Clear transfer flag.

Just for the sake of completeness, here’s the final loop I decided to go with. I used your pre-decrement trick to get rid of those -1’s. Good call. :slight_smile:

This version runs in around 256us.

/*
This function shifts all the LED data out to the TLC5947's. 
*/

#define NOP __asm__ __volatile__ ("nop\n");
#define WAIT NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP // 15 NOPs 

void updateLEDs() {
 
  const int LEDFrameTime = 1000 / 30; // The number of milliseconds between each update of the LEDs.  1000 / 24 = 24fps.
  
  static unsigned long nextLEDFrameTime = 0;   
  
  uint8_t *thisLED;
  uint8_t *lastLED;

  if (time >= nextLEDFrameTime) { // If it is time for an LED update...  
  
    //unsigned int time = micros();
      
    thisLED = &LEDData[LEDMODULES*36]; // The first operation is to decrement this pointer, so the first time we write thisLED it will be array index LEDMODULES*36-1
    lastLED = &LEDData[0]; 
   
    cli(); // Halt interrupts.
 
    SPDR = *--thisLED; // Initiate first byte transfer and decrement address of *thisLED.
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
    WAIT; SPDR = *--thisLED;
       
    do {       
    
      WAIT; SPDR = *--thisLED;  
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      WAIT; SPDR = *--thisLED;
      
      //while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED. 
          
    } while (thisLED != lastLED); // thisLED is decremented one last time before we hit the end of the loop, so after byte 0 transfers, the loop exits.

    WAIT; // Wait for last byte to finish transfer.
    SPSR = SPSR & ~_BV(SPIF); // Clear transfer flag.
    
    //while (!(SPSR & _BV(SPIF))); // Wait for last byte to finish transfer.
    
    // 15 nops and:
    // 6x unroll = 276us
    // 12x unroll = 256us
    // 24x unroll = 252us
    
    // This may be leaving the SPIF flag set.  
    
    // Clear SPIF flag:
    //while (!(SPSR & _BV(SPIF))); // Wait for last byte to finish transfer.

    // 340 microseconds
    // The trick which makes the above so fast is that we decrement the pointer and do the conditional jump while we are waiting for the previous byte transfer to complete.

    // Move data from shift register into latch register:
    
      LATPORT &= ~(1 << LATPIN); // Set LATCH pin low.  It is important to do this first, just before transitioning high!  I don't know why, but the LED modules won't work right otherwise!
      LATPORT |= (1 << LATPIN); // Set LATCH pin high.  Transitioning from low to high moves all data from the shift register into the latch registers. 

      ENAPORT |= (1 << ENAPIN); // Set BLANK pin high. Turn all outputs off, and reset grayscale timer.  Blanking reduces flicker, but it does not seem to matter if it is before, after, or in the middle of latching.
      ENAPORT &= ~(1 << ENAPIN); // Set BLANK pin low. Turn all outputs on.
      
    sei(); // Reenable interrupts.
    
    //time = micros() - time; 
    //Serial1.println(time,DEC); 
    
    nextLEDFrameTime = time + LEDFrameTime; // Set the time of the next update.
    
  }
}

scswift: I went one better and implemented your pre-decrement as well, but still getting 256us as with my last version, and using only 14 NOPs still glitches, so it seems like I already hit the limit.

Oh, duh! My bad...

The position of the decrement in the code doesn't matter when you're not waiting for SPIF. All that matters is the total number of cycles per byte sent.

On this page I am using SPI to send video output:

http://www.gammon.com.au/forum/?id=11608

Since it takes 16 clock cycles to send one SPI byte at the highest speed (clock div 2, times 8 bits) I had a loop that happened to take 17 cycles to read the pixel data from program memory and output it, so no waiting on the hardware was necessary.

My sending loop was just:

// blit pixel data to screen    
  while (i--)
    UDR0 = pgm_read_byte (linePtr + (* messagePtr++));

  // wait till done    
  while (!(UCSR0A & _BV(TXC0))) 
    {}

However it just happened that getting the data from program memory slowed it down the right amount. ;)


I'm inclined to agree with dc42 that you should be able to do it in 216 uS so maybe the LEDs can't keep up.


    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;
    b = *--thisLED; WAIT; SPDR = b;

Bear in mind that you have 16 clock cycles to for each SPI byte to be sent, so unravelling the loop doesn't achieve much, except obfuscation.

[quote author=Nick Gammon link=topic=129824.msg977510#msg977510 date=1351630840]

I'm inclined to agree with dc42 that you should be able to do it in 216 uS so maybe the LEDs can't keep up. [/quote]

I think the SPI clock always has to end in a LOW state after sending a byte (it has to send complete clock pulses, it can't end on HIGH).

In that case it would take 17 cycles per byte, minimum.

Bear in mind that you have 16 clock cycles to for each SPI byte to be sent, so unravelling the loop doesn't achieve much, except obfuscation.

Hm, that's true, assuming the jump doesn't take 16 cycles. I'll try another version that isn't unrolled but has fewer NOP's per WAIT.