SPI Problem between Mega 2560 and Raspberry Pi

I have got an issue I am fighting for two days now... I set up SPI communication between Raspberry Pi and Arduino Mega 2560. (it works). The lines are very short (total length about 5 cm) through pin headers; there is 3.3 V to 5.0 V converter inbetween.

On the raspberry side, Pi sends blocks of about 500 bytes 2-3 times a second. On the arduino side, arduino stores the data in a buffer and sends back data from a different buffer.

The raspberry-side data is shorter than the data I need to receive from arduino, so that after sending meaningful data, filler bytes (containing "X") are sent to complete 500 bytes record.

Now the problem: from time to time in the data received from arduino one byte is replaced, and the value of this byte corresponds to the filler byte, i.e. the byte sent from Pi to arduino is "mirrored" back from arduino to pi, replacing the correct byte which should be sent.

The location of the incorrectly returned byte in the record varies and seems to be random. The frequency of error also varies, but is between 2% and 20 %. (i.e. 2% to 20 % of the blocks of 500 bytes contain one byte swapped.)

I tried different speeds, from 4.000.000 down to 40000. There is no clear pattern, but it seems that the error occurs much more frequently at LOWER speeds. At 4000000 I had sequencies of up to 50 transmissions without a glitch... But then they start and several can appear in a row...

I am not sure if it is a hardware issue or some problem with interrupts... Considering that lower bitrate seems to cause more problems I am inclined to think that it has to do with the program. I have several interrupts in use, but from my understanding, if there would be a delay in reading the byte, the byte would be lost, both for receiving and for sending. But this does not happen, byte is not lost, it is replaced..

Any help would be really appreciated, because I am stuck...

The complete code both on raspberry and arduino is pretty long, more than 1000 lines, so I am not sure how to post.

On arduino side, I have several other interrupt routines, timer3,4,5 (compa,b,c) and ADC interrupt routine.

The SPI routine:

ISR (SPI_STC_vect) // SPI interrupt routine
{

byte c = SPDR;  // grab byte from SPI Data Register 
if (c==1) {rasp_ndx=0;}  // reset the ndx to start

if (rasp_active==0) 
 {if (rasp_ndx < sizeof (rasp_in_0)) 
           { rasp_in_0 [rasp_ndx] = c;  // add to buffer 0 if room, including the 0 char
             SPDR=rasp_out_0[rasp_ndx++]; // send from buffer 0              
           }
 }         
else
 {if (rasp_ndx < sizeof (rasp_in_1)) 
           { rasp_in_1 [rasp_ndx] = c;  // add to buffer 1 if room, including the 0 char
             SPDR=rasp_out_1[rasp_ndx++]; // send from buffer 1              
           }
 }

 if (c==0) {rasp_active++;rasp_active=rasp_active & 1; rasp_new=true;}  // toggle rasp_active
}

Here are the SPI routines on Raspberry in Python:

def SPI_Transmit (datastring):
  
   RX=""
   a=spi.xfer ([1]) # first char  = 1 to start sentence
   b=spi.xfer ([packetcounter])  # sequential counter of packets
   c=spi.xfer ([len(datastring)+32])
   RX=chr(b[0])+chr(c[0])
 
   for j in range (0,len(datastring)):
              k=ord(datastring[j])
              z=spi.xfer ([k])
              RX=RX+chr(z[0])
   z=spi.xfer ([13])  # CR 13 - to indicate the end of useful data end of string
   RX=RX+chr(z[0])
   for j in range (0,packetlength-len(datastring)-5):
              k=ord("X")    #  Filler ************************************* !!!
              z=spi.xfer ([k])
              RX=RX+chr(z[0])
   z=spi.xfer ([0])  # 0- to indicate the end of sentence
   RX=RX+chr(z[0])
   return RX;
# ---------------------------
def SPI_Setup ():
   
   global packetcounter
   global spi
   print ("starting...")
   packetcounter = 32
   spi=spidev.SpiDev()
   spi.open (0,  0)
   spi.mode=0
   spi.max_speed_hz = 4000000
   return;
# ----------------------------
def SPI_Communicate ():

   global packetcounter
   global commands
   RX=""
   packetcounter=packetcounter+1
   if packetcounter>127:
      packetcounter=32

   commands=commands[:maxcommandlength] # truncate so as not exceed the buffer length
   k=commands.rfind("/")
   commands=commands[:k]  # adjust to the nearest fully formulated command

   
   RXdata=SPI_Transmit (commands)  # raw data from mega
   # print (commands)
   commands="" # clear all the commands entered

   k=RXdata.rfind("/")  # checksum: first, cut off last field
   temp=RXdata[:k]
   rightside=RXdata[k+1:] # obtain value of last field
   k=rightside.rfind(":")
   fieldname=rightside[:k]   # must be "CHECK"
   ke=rightside.find(chr(0))
   checksumreceived=rightside[k+1:ke]
   checksum=0
   for i in range (0,len(temp)):
      checksum=checksum+ord (RXdata[i])
   checksum=str(checksum)
   if not (checksum==checksumreceived):
      print ("Communication Breakdown: checksum:",checksum,checksumreceived)
      print (RXdata,chr(13))
   else:
      print ("CHECKSUM CORRECT")
   Logger_write (RXdata)
)
   t=threading.Timer(SPI_Frequency,SPI_Communicate)
   t.start()
   return;

Please see the post 'How to use this forum - please read.' for details on how to post code so that its easier for everyone to read.

The lines are very short (total length about 5 cm) through pin headers; there is 3.3 V to 5.0 V converter inbetween.

Are the Arduino and the Raspi the only devices on the bus?

Now the problem: from time to time in the data received from arduino one byte is replaced, and the value of this byte corresponds to the filler byte, i.e. the byte sent from Pi to arduino is "mirrored" back from arduino to pi, replacing the correct byte which should be sent.

This usually happens if the SPI interrupt is delayed too long. Post complete code, probably one of the other interrupts blocks longer than you have to fill the SPI data register.

Yes, arduino and raspberry are the only ones connected, and there are no loose cables: the connections are through pin headers and the lines are etched on boards.

I have blocked all interrupts on arduino and have done more testing...

BUT:
I get now close to 0 errors at 40.000, 250.000, 1.000.000, 2.000.000 , 4.000.000 (on raspberry).
When I set the speed to 500.000 the following happens: during about 20-40 cycles I get total garbage 50 % of the time (not one byte is replaced incorrectly, but arduino sends all the record of 500 bytes with "X"s), and then suddenly it starts operating without errors.

I attached both raspberry and arduino files.

The programs are for an extension of a Autopilot System for FPV use

Airsupport_SBUS_11.ino (66.1 KB)

GPSIO_06.txt (40.5 KB)

I am really really lost now: I have deleted all extra parts of both programs, leaving it barebone, and I got the craziest response:

The SPI works perfectly from a speed above 700.000 up to 4.000.000 and below 200.000 to at least 10.000
SPI begins to accumulate errors in the speed range 300.000 - 650.000 reaching maximum error rate at around 500.000

I can not really believe it or explain it. Please help!!!!

Below are the barebone programs:

#include <SPI.h>

// -------------------------------------------------------------------------
ISR (SPI_STC_vect) // SPI interrupt routine
{
byte c = SPDR;  // grab byte from SPI Data Register 
SPDR=byte ('Z');
}

// --------------------------------------------------------------------------

void setup() {
  
  Serial.begin (115200);  // Debugging 
 
// --- SPI SETUP  ----    

  pinMode(MISO, OUTPUT);  // have to send on master in, *slave out* 
  SPCR |= _BV(SPE);  // turn on SPI in slave mode  
  SPCR |= _BV(SPIE); // get ready for an interrupt 
  SPI.attachInterrupt(); // now turn on interrupts
}
// --------------------------------------------------------------
void loop() {
        
}

Python:

from math import sin, cos, sqrt, atan2, radians
import math
import sys
import os
import shutil
import re
import spidev
import time
import threading


global packetlength
global packetcounter
global commands

packetlength = 480    # length of the SPI packet
SPI_Frequency=0.2       # send SPI every XX seconds
maxcommandlength=350  # maximum length of commands variable


spierrors=0
spicnt=0


#  ---------------------------------------------------------------------
# SPI PROTOCOLL:
# A sentence must begin with ASCII 1 and terminate with ASCII 0


def SPI_Transmit (datastring):
   
    RX=""
    a=spi.xfer ([1]) # first char  = 1 to start sentence
    b=spi.xfer ([packetcounter])  # sequential counter of packets
    c=spi.xfer ([len(datastring)+32])
    RX=chr(b[0])+chr(c[0])
  
    for j in range (0,len(datastring)):
               k=ord(datastring[j])
               z=spi.xfer ([k])
               RX=RX+chr(z[0])
    z=spi.xfer ([13])  # CR 13 - to indicate the end of useful data end of string
    RX=RX+chr(z[0])
    for j in range (0,packetlength-len(datastring)-5):
               k=ord("X")    #  Filler
               z=spi.xfer ([k])
               RX=RX+chr(z[0])
    z=spi.xfer ([0])  # 0- to indicate the end of sentence
    RX=RX+chr(z[0])
    return RX;
# ---------------------------
def SPI_Setup ():
    
    global packetcounter
    global spi
    print ("starting...")
    packetcounter = 32
    spi=spidev.SpiDev()
    spi.open (0,  0)
    spi.mode=0
    spi.max_speed_hz = 200000
    return;
# ----------------------------
def SPI_Communicate ():

    global packetcounter
    global commands
    global spicnt,spierrors
    RX=""
    packetcounter=packetcounter+1
    if packetcounter>127:
       packetcounter=32
    commands=commands[:maxcommandlength] # truncate so as not exceed the buffer length
    k=commands.rfind("/")
    commands=commands[:k]  # adjust to the nearest fully formulated command
    RXdata=SPI_Transmit (commands)  # raw data from mega

    commands="" # clear all the commands entered

    k=RXdata.rfind("/")  # checksum: first, cut off last field
    temp=RXdata[:k]
    rightside=RXdata[k+1:] # obtain value of last field
    k=rightside.rfind(":")
    fieldname=rightside[:k]   # must be "CHECK"
    ke=rightside.find(chr(0))
    checksumreceived=rightside[k+1:ke]
    checksum=0
    for i in range (0,len(temp)):
       checksum=checksum+ord (RXdata[i])
    checksum=str(checksum)
    if not (checksum==checksumreceived):
       print ("Communication Breakdown: checksum:",checksum,checksumreceived)
       print (RXdata,chr(13))
       spierrors=spierrors+1
    else:
       print ("CHECKSUM CORRECT", spicnt,spierrors)
       spicnt=spicnt+1

    t=threading.Timer(SPI_Frequency,SPI_Communicate)
    t.start()
    return;

# ==============================================================================================
commands=""  

SPI_Setup()
SPI_Communicate ()  # First communication session; from there, timer event will automatically call this function

I added two screen images: first one at speed of 455.000 second one at speed of 4.000.000

The second one does not have any errors at all (all is "Z"); the first one has a mix of Z and X...

The testing I did is with this setup.

All this test does is echo the bytes back on the SPI port and I use the R-Pi program from the wget link at the bottom of this page.

Thanks for the answer, but I am not sure if this really applies here... I know the SPI communication is working, I know SPI seems to have close to zero errors at certain speeds (either slow or fast), and I know that for medium speeds, I get 50 % of errors....

But WHY?

I wonder if it could be a memory corruption issue? I see String a few times in your program.

Have a look at this and decide if you want to try using "C strings" instead.

I thought about it, but then I eliminated completely all extra features of the arduino program, so it just receives a byte and sends it back. NO other processing whatsoever...

If this is true I need to know, so in a few weeks I should have some time to do some testing, I think I will modify the test program that the R-Pi documentation has to loop a few thousand times and count errors at different speeds (or some such thing).

https://raw.githubusercontent.com/raspberrypi/linux/rpi-3.10.y/Documentation/spi/spidev_test.c

Found partially the culprit, but not the real cause: I exchanged Pi 3B for the Pi Zero, the error seems to have disappeared; I have tested it at 250.000, 500.000, 1.000.000, 2.000.000, 4.000.000 and so far no errors; I will be doing more testing to see if it was a hardware failur, or it has soemthing toi do with timing (zero is much slower)...