SPI speed, why the dead time?

I’ve read the very interesting discussion here
http://forum.arduino.cc/index.php/topic,129824.0.html

I was going to reply but got told not to. I don’t see why not but if that’s the rules, fine, here goes with a new thread.

I’m driving SPI configured stepper motor drivers in combination with GRBL on a Uno, so I need to get it as fast a possible so as to add a minimum amount of jitter, for example when a thermal warning raises the error flag, I need to interrogate a daisy chain of three devices to get the status bytes an see who bitchin’ and about what?

This means one byte out and two back; repeated three times to push the response down the chain. It seems to work fine in principal but it’s damned slow, even using direct pin addressing to interface with the SPI registers. ( No silly digitalWrites etc.)

The classic while (!(SPSR & _BV(SPIF))) {;} method was taking about 27us for the nine byte sequence running the SPI at max speed ( CK/2 ).

On the 'scope I see three groups of three bursts on SCK. Each burst of 8 pulses is just under a microsecond as expected. But then there’s about 1.5us dead time before the next tx.

I picked up the idea of using NOPs and it improved a bit but it’s still essentially the same. The burst of three now takes 4.5us instead of 6us.The dead time between each burst is thus a little under 1us rather than 1.5us. The 9bytes sequence is down to 20us.

 uint8_t *pByteToTransmit;
 uint8_t *pReceivedByte;
 uint16_t sh ;

  cli(); 
  for (i = L6474_CMD_ARG_MAX_NB_BYTES-1-maxReturnBytes; i < L6474_CMD_ARG_MAX_NB_BYTES; i++)
  {

    pByteToTransmit=&spiTxBursts[i][0];
    pReceivedByte=&spiRxBursts[i][0];

    L6474_SPI_PORT &= ~_BV(SPI_CS);  // SPI port CS' enable, starts clock pushes bytes through daisy-chain
  
    for (sh = 0; sh < numberOfShields; sh++)
    {
      SPDR = pByteToTransmit++[0];   // write to SPDR <Arduino.h>  initiates h/w transfer
//  while (!(SPSR & _BV(SPIF))) {;}  // SPIF set when tx complete and SCK stops

// use NOPs instead of while on SPIF
                
__asm__ __volatile__ ("nop\n\t"
                      "nop");  
__asm__ __volatile__ ("nop\n\t"
                      "nop");  
__asm__ __volatile__ ("nop\n\t"
                      "nop");  
__asm__ __volatile__ ("nop\n\t"
                      "nop");  
__asm__ __volatile__ ("nop\n\t"
                      "nop");  
__asm__ __volatile__ ("nop\n\t"
                      "nop");  
__asm__ __volatile__ ("nop\n\t"
                      "nop");  
__asm__ __volatile__ ("nop\n\t"
                      "nop");   // 16 NOPS to mark time while SPI byte goes out
                     
  // this is a blocking (16 clock cycles @ sysclk/2 = 1us) all but ISRs ; set up WDT to catch electrical errors ???
    pReceivedByte++[0]= SPDR; 
    } // end for sh

    L6474_SPI_PORT |= _BV(SPI_CS);  // SPI CS' hi: stops SCK: triggers decode in chip input registers.
  }  // end for i
  
  spiRxData += (spiRxBursts[1][spiIndex] << 16)|
              (spiRxBursts[2][spiIndex] << 8) |
              (spiRxBursts[3][spiIndex]);
  
  /* re-enable interrupts after SPI transfers*/
  sei();

The disassembly’s frankly a bit of a mess, so I’m not sure whether I got the number of NOPs I intended. I think it’s right because even one less returns a zero which is not a valid result: the data is not yet in.

    SPDR = pByteToTransmit++[0];   // write to SPDR <Arduino.h>  initiates h/w transfer
//  while (!(SPSR & _BV(SPIF))) {;}  // SPIF set when tx complete and SCK stops
     636: d8 01         movw  r26, r16
     638: ed 91         ld  r30, X+
     63a: 8d 01         movw  r16, r26
     63c: ee bd         out 0x2e, r30 ; 46
  ...
                      "nop");
__asm__ __volatile__ ("nop\n\t"
                      "nop");
__asm__ __volatile__ ("nop\n\t"
                      "nop");
__asm__ __volatile__ ("nop\n\t"
     64e: 00 00         nop
                      "nop");
__asm__ __volatile__ ("nop\n\t"
                      "nop");
     650: ee b5         in  r30, 0x2e ; 46
     652: e9 93         st  Y+, r30
     654: e7 cf         rjmp  .-50      ; 0x624 <CmdGetParam+0x98>
__asm__ __volatile__ ("nop\n\t"
                      "nop");

 // 16 NOPS to mark time while SPI byte goes out

     656: 2a 9a         sbi 0x05, 2 ; 5
    // full 3x3 takes 27us
 uint8_t *pByteToTransmit;
 uint8_t *pReceivedByte;
 uint16_t sh ;
     63a: 8d 01         movw  r16, r26
     63c: ee bd         out 0x2e, r30 ; 46
  ...
     64e: 00 00         nop
     650: ee b5         in  r30, 0x2e ; 46

It seems to be as fast as it can go, is that extra microsecond just the for-loop overhead? Seems a lot.

It looks like running the SPI stack on top of GRBL may be asking a bit too much. I may need to move it to STM32.

 uint16_t sh ;
...
    for (sh = 0; sh < numberOfShields; sh++)

Are you expecting up to 65535 shields? How about making that a byte?

The disassembly's frankly a bit of a mess, so I'm not sure whether I got the number of NOPs I intended.

Use the -Z flag to see NOPs in the dissassembly.

Can you post a complete sketch that demonstrates the issue? Then I can test it here.

Thanks for the tip Nick, avr-objdump -d -z does the trick. As it appeared from the -d snippet above there are 18 NOPs not the 16 I asked for :? Some boundary alignment I guess.

avr-objdump seems to have serious issues with conditional compilation, the straight -S output is pretty much unintelligible.

Are you expecting up to 65535 shields?

LOL
For the moment I'm padding with NOPs and duping code to get see how fast I can make it run. Byte level economy comes next.
I'm adapting code provided in a c++ library for these chips. In the lib is a was uint32_t !

You asked why the looping part took so long. Comparing an int would take longer than a byte, when deciding whether to loop or not.

there are 18 NOPs not the 16 I asked for :? Some boundary alignment I guess.

I doubt that, and when you post code that I can reproduce, we'll see for ourselves.

avr-objdump seems to have serious issues with conditional compilation,

I don't see any conditional compilation in what you posted, nor do I see 18 NOPs.

A separate sketch may take some time. This is embedded into 31k worth of GRBL.The uncommented avr-objdump -d dump for the timed loops looks like this.

     5f6: f8 94         cli
     5f8: 43 e0         ldi r20, 0x03 ; 3
     5fa: 50 e0         ldi r21, 0x00 ; 0
     5fc: ba 01         movw  r22, r20
     5fe: 63 1b         sub r22, r19
     600: 71 09         sbc r23, r1
     602: 9b 01         movw  r18, r22
     604: ab 01         movw  r20, r22
     606: 44 0f         add r20, r20
     608: 55 1f         adc r21, r21
     60a: 42 0f         add r20, r18
     60c: 53 1f         adc r21, r19
     60e: 60 91 00 01   lds r22, 0x0100
     612: 70 91 01 01   lds r23, 0x0101
     616: 64 0f         add r22, r20
     618: 75 1f         adc r23, r21
     61a: 2a 98         cbi 0x05, 2 ; 5
     61c: 8a 01         movw  r16, r20
     61e: 05 55         subi  r16, 0x55 ; 85
     620: 1e 4f         sbci  r17, 0xFE ; 254
     622: eb 01         movw  r28, r22
     624: e0 91 03 01   lds r30, 0x0103
     628: f0 e0         ldi r31, 0x00 ; 0
     62a: de 01         movw  r26, r28
     62c: a6 1b         sub r26, r22
     62e: b7 0b         sbc r27, r23
     630: ae 17         cp  r26, r30
     632: bf 07         cpc r27, r31
     634: 80 f4         brcc  .+32      ; 0x656 <CmdGetParam+0xca>
     636: d8 01         movw  r26, r16
     638: ed 91         ld  r30, X+
     63a: 8d 01         movw  r16, r26
     63c: ee bd         out 0x2e, r30 ; 46
     63e: 00 00         nop
     640: 00 00         nop
     642: 00 00         nop
     644: 00 00         nop
     646: 00 00         nop
     648: 00 00         nop
     64a: 00 00         nop
     64c: 00 00         nop
     64e: 00 00         nop
     650: ee b5         in  r30, 0x2e ; 46
     652: e9 93         st  Y+, r30
     654: e7 cf         rjmp  .-50      ; 0x624 <CmdGetParam+0x98>
     656: 2a 9a         sbi 0x05, 2 ; 5
     658: 2f 5f         subi  r18, 0xFF ; 255
     65a: 3f 4f         sbci  r19, 0xFF ; 255
     65c: 4d 5f         subi  r20, 0xFD ; 253
     65e: 5f 4f         sbci  r21, 0xFF ; 255
     660: 24 30         cpi r18, 0x04 ; 4
     662: 31 05         cpc r19, r1
     664: 09 f0         breq  .+2       ; 0x668 <CmdGetParam+0xdc>
     666: d3 cf         rjmp  .-90      ; 0x60e <CmdGetParam+0x82>
     668: e0 91 00 01   lds r30, 0x0100
     66c: f0 91 01 01   lds r31, 0x0101
     670: e8 0f         add r30, r24
     672: f9 1f         adc r31, r25
     674: 66 81         ldd r22, Z+6  ; 0x06
     676: 70 e0         ldi r23, 0x00 ; 0
     678: 76 2f         mov r23, r22
     67a: 66 27         eor r22, r22
     67c: 81 85         ldd r24, Z+9  ; 0x09
     67e: 68 2b         or  r22, r24
     680: 88 27         eor r24, r24
     682: 77 fd         sbrc  r23, 7
     684: 80 95         com r24
     686: 98 2f         mov r25, r24
     688: 78 94         sei

Making both the for-loop counters single byte saves 24bytes :slight_smile: but does not seem to affect the timing noticeably. I do now get 16 NOPs.

     5ee: f8 94         cli
     5f0: 23 e0         ldi r18, 0x03 ; 3
     5f2: 23 1b         sub r18, r19
     5f4: e0 91 00 01   lds r30, 0x0100
     5f8: f0 91 01 01   lds r31, 0x0101
     5fc: 2a 98         cbi 0x05, 2 ; 5
     5fe: 42 2f         mov r20, r18
     600: 50 e0         ldi r21, 0x00 ; 0
     602: ba 01         movw  r22, r20
     604: 66 0f         add r22, r22
     606: 77 1f         adc r23, r23
     608: 46 0f         add r20, r22
     60a: 57 1f         adc r21, r23
     60c: 45 55         subi  r20, 0x55 ; 85
     60e: 5e 4f         sbci  r21, 0xFE ; 254
     610: b3 e0         ldi r27, 0x03 ; 3
     612: 2b 9f         mul r18, r27
     614: e0 0d         add r30, r0
     616: f1 1d         adc r31, r1
     618: 11 24         eor r1, r1
     61a: da 01         movw  r26, r20
     61c: 30 91 03 01   lds r19, 0x0103
     620: 6a 2f         mov r22, r26
     622: 64 1b         sub r22, r20
     624: 63 17         cp  r22, r19
     626: b0 f4         brcc  .+44      ; 0x654 <CmdGetParam+0xc8>
     628: 3d 91         ld  r19, X+
     62a: 3e bd         out 0x2e, r19 ; 46
     62c: 00 00         nop
     62e: 00 00         nop
     630: 00 00         nop
     632: 00 00         nop
     634: 00 00         nop
     636: 00 00         nop
     638: 00 00         nop
     63a: 00 00         nop
     63c: 00 00         nop
     63e: 00 00         nop
     640: 00 00         nop
     642: 00 00         nop
     644: 00 00         nop
     646: 00 00         nop
     648: 00 00         nop
     64a: 00 00         nop
     64c: 3d b5         in  r19, 0x2d ; 45
     64e: 3e b5         in  r19, 0x2e ; 46
     650: 31 93         st  Z+, r19
     652: e4 cf         rjmp  .-56      ; 0x61c <CmdGetParam+0x90>
     654: 2a 9a         sbi 0x05, 2 ; 5
     656: 2f 5f         subi  r18, 0xFF ; 255
     658: 24 30         cpi r18, 0x04 ; 4
     65a: 09 f0         breq  .+2       ; 0x65e <CmdGetParam+0xd2>
     65c: cb cf         rjmp  .-106     ; 0x5f4 <CmdGetParam+0x68>
     65e: e0 91 00 01   lds r30, 0x0100
     662: f0 91 01 01   lds r31, 0x0101
     666: e8 0f         add r30, r24
     668: f9 1f         adc r31, r25
     66a: 66 81         ldd r22, Z+6  ; 0x06
     66c: 70 e0         ldi r23, 0x00 ; 0
     66e: 76 2f         mov r23, r22
     670: 66 27         eor r22, r22
     672: 81 85         ldd r24, Z+9  ; 0x09
     674: 68 2b         or  r22, r24
     676: 88 27         eor r24, r24
     678: 77 fd         sbrc  r23, 7
     67a: 80 95         com r24
     67c: 98 2f         mov r25, r24
     67e: 78 94         sei
     680: 08 95         ret

Less of a detail than I thought , thanks.

oops NOP is a two byte instruction.

uint16_t vars gave me 9 lines of nops; uint8_t gives me 16 lines : 16 NOPs. How did that happen??

I think this is due to my volatiling each pair of nops , not the whole block. I did that in an attempt to get the disassemble to show me the NOPs explicitly before Nick pointed out the -z switch.

I'll tidy up code below ...

  cli(); 
  for (i = L6474_CMD_ARG_MAX_NB_BYTES-1-maxReturnBytes; i < L6474_CMD_ARG_MAX_NB_BYTES; i++)
  {

    pByteToTransmit=&spiTxBursts[i][0];
    pReceivedByte=&spiRxBursts[i][0];

    L6474_SPI_PORT &= ~_BV(SPI_CS);  // SPI port CS' enable, starts clock pushes bytes through daisy-chain
 
    for (sh = 0; sh < numberOfShields; sh++)
    {
      SPDR = pByteToTransmit++[0];   // write to SPDR <Arduino.h>  initiates h/w transfer
//  while (!(SPSR & _BV(SPIF))) {;}  // SPIF set when tx complete and SCK stops

// use NOPs instead of while on SPIF
                
__asm__ __volatile__ ("nop\n\t"
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop\n\t" 
                      "nop");   // 16 NOPS to mark time while SPI byte goes out
                     
  // this is a blocking (16 clock cycles @ sysclk/2 = 1us) all but ISRs ; set up WDT to catch electrical errors ???
      tmp=SPSR;  // dummy read, preceding data access clears SPIF
      pReceivedByte++[0]= SPDR; 
    } // end for sh
    L6474_SPI_PORT |= _BV(SPI_CS);  // SPI CS' hi: stops SCK: triggers decode in chip input registers.
  }  // end for i
  
  spiRxData += (spiRxBursts[1][spiIndex] << 16)|
              (spiRxBursts[2][spiIndex] << 8) |
              (spiRxBursts[3][spiIndex]);
  
  /* re-enable interrupts after SPI transfers*/
  sei();  
  
#ifdef debug_L6474
  printPgmString(PSTR("getparam out "));    printPgmString(PSTR("sh"));  
  print_uint8_base10(shieldId );
  printPgmString(PSTR(":param="));  
 // max three byte returned
  print_uint8_base16( spiRxBursts[1][spiIndex]  );
  print_uint8_base16( spiRxBursts[2][spiIndex]  ); 
  print_uint8_base16( spiRxBursts[3][spiIndex]  );
  printPgmString(PSTR("\r\n"));
//  printPgmString(PSTR(":spiRxData=")); 
//  print_uint16_base16((uint16_t) spiRxData);
//  printPgmString(PSTR("\r\n"));            
#endif
    
  return (spiRxData);
}

debug_L6474 is not defined in what I’ve posted here.

I've just realised that I can use a constant for the number shields in the inner for-loop, since this will not change on this hardware. That helps a lot.

The 3 x 3 byte interaction is down to just over 16us from 27us. It's now spending more time actually sending data than thinking about doing it. Much more credible.

A couple of key tips from Nick there were a great help. Many thanks. A good nights sleep also helped :wink:

The challenge now it to stop locking interrupts, let GRBL do it's stuff and hope it all still works.

for (sh = 0; sh < numberOfShields; sh++)

you might reverse the loop as comparing with zero is slightly faster on AVR

for (sh = numberOfShields; sh > 0 ; --sh)
SPDR = pByteToTransmit++[0];

for readability you might split it in 2 lines (more clear in which order the thing is evaluated)

SPDR = pByteToTransmit[0];
pByteToTransmit++;

Thanks, counting down is a good idea, but that would mean reversing the order of the data to be sent out. That may make the code better because it actually backwards already. Sounds like a bit work to test it.

I'd just combined what were two separate statements to make it more efficient :wink:

I think this is due to my volatiling each pair of nops...

Doubt it. The most likely culprit is that the lines are not properly terminated.

Not correct...

__asm__ __volatile__ ("nop\n\t"
                      "nop");

Correct...

__asm__ __volatile__ ("nop\n\t"
                     "nop\n\t");

Many thanks, I was not too happy with my explanation but haven't had time to go into it. That was a mistake I picked up on the other thread. There were many variants posted, looks like I picked the wrong person to listen to :frowning:

I did not have the courage to start ploughing through opaque documentation to find out what was correct.

Thanks for the correction.

ardnut:
Thanks, counting down is a good idea, but that would mean reversing the order of the data to be sent out. That may make the code better because it actually backwards already. Sounds like a bit work to test it.

I'd just combined what were two separate statements to make it more efficient :wink:

  1. the sh is nowhere used as index in the code if I see correctly so it is just a loop counter.

  2. Give the compiler separate statements, and it can compile/optimize quite well.

Ah, you are correct, it's simple repetition. It is the outer loop that orders the data.

However, no perceptible difference on the 'scope, though I'm sure you're right in principal. I always try to compare to zero if possible.

thx