SPI master/slave issue

I have an UNO R3 as SPI master and an ATmega 1284P as SPI slave. I'm trying to get some communication going between them. I have started with simple memory read/write. I have run into a strange timing issue that I can not explain. When reading memory from the slave, the first byte of memory returned is the last address byte sent and not the desired memory. However if I insert a small 10 microsecond delay between writing the last address byte and reading the first byte of memory, it works. I am hoping you can take a look and tell me what is going on.

Here is the master code...

#include <SPI.h>
#define READ_CMD 1
#define WRITE_CMD 2

#define SSpin 6

void setup() {
  Serial.begin(19200);
  pinMode(SSpin,OUTPUT);
  digitalWrite(SSpin,HIGH);
  SPI.begin();
  SPI.setClockDivider(SPI_CLOCK_DIV32);
}

void loop() {
  char c;
  
  while (Serial.available() > 0) {
    c = Serial.read();
    if (c=='w') {
      Serial.println("write");
      writeData();
    }
    else if (c=='r') {
      Serial.println("read");
      readData();
    }
  }
}

void writeData() {
  byte d;
  unsigned int address=1;
  
  digitalWrite(SSpin,LOW);
  SPI.transfer(WRITE_CMD);
  SPI.transfer((address >> 8) & 0xff);
  SPI.transfer(address & 0xff);
  for (int i=10;i<20;i++) {
    SPI.transfer(i);
  }
  digitalWrite(SSpin,HIGH);
}

void readData() {
  byte d;
  unsigned int address=1;
  
  digitalWrite(SSpin,LOW);
  SPI.transfer(READ_CMD);
  SPI.transfer((address >> 8) & 0xff);
  d = SPI.transfer(address & 0xff);
  //delayMicroseconds(10);
  for (int i=0;i<10;i++) {
    d = SPI.transfer(0);
    Serial.println(d);
  }
  digitalWrite(SSpin,HIGH);
}

Here is the slave code...

#define READ_CMD 1
#define WRITE_CMD 2

// states
#define COMMAND 1
#define ADDR1 2
#define ADDR2 3
#define DATA 4
#define ERROR 5

#define DATA_SIZE 4096

volatile byte command = 0;
volatile byte state=COMMAND;
volatile unsigned int address;

volatile byte eeprom[DATA_SIZE];

void setup (void) {
  pinMode(MISO, OUTPUT);
  
  // turn on SPI in slave mode
  SPCR |= _BV(SPE);

  // turn on interrupts
  SPCR |= _BV(SPIE);

}

ISR (SPI_STC_vect) {
  byte c = SPDR;
 
  switch (state) {
  case COMMAND:
    command = c;
    SPDR = 0;
    if (command == READ_CMD || command == WRITE_CMD)
      state = ADDR1;
    else
      state = ERROR;
    break;
  case ADDR1:
    address = c;
    SPDR = 0;
    state = ADDR2;
    break;
  case ADDR2:
    address = address << 8 | c;
    address %= DATA_SIZE;
    SPDR = eeprom[address];
    state = DATA;
    if (command == WRITE_CMD)
      break;
  case DATA:
    if (command == READ_CMD) {
      SPDR = eeprom[address];
    }
    else if (command == WRITE_CMD) {
      eeprom[address] = c;
      SPDR = 0;
    }
    ++address %= DATA_SIZE;
    break;
  default:
    SPDR = 0;
    break;
  }
}

void loop (void)
{
  if (digitalRead (SS) == HIGH)
    state = COMMAND;
}

And here is the serial output...

read
1
11
12
13
14
15
16
17
18
19

I can duplicate the issue, but can't explain it. Switched it around so the 1284P is the master and the 328P is the slave (reducing the size of the data array) and it behaves the same.

The write collision bit is getting set in SPSR. Not sure why it happens, or why it doesn't happen on subsequent bytes.

Here is a similar example. The master delays 20µs after each byte.

I also tried SPI.setClockDivider(SPI_CLOCK_DIV128) but it made no difference.

Seems non-standard to require a delay after every byte. Although maybe it is only needed after the first byte as you discovered. Can't figure out what is different with the first byte.

In searching the web, I found one comment that said polling might be faster than using interrupts.

SPI has no handshake, so the master has to allow enough time between transfers for
the slave ISR to complete its work - if you don't leave enough time and the previous
value is still in SPDR of the slave when the master triggers the next transfer, then
that's the value the master gets...

MarkT:
SPI has no handshake, so the master has to allow enough time between transfers for
the slave ISR to complete its work - if you don't leave enough time and the previous
value is still in SPDR of the slave when the master triggers the next transfer, then
that's the value the master gets...

Thanks, Tim, that was sort of the conclusion that I was coming to. Odd in this case that the slave code is fast enough to return nine of ten values and that it only fails on the first. I don't see much difference in the two paths, the code is quite short. Maybe it's just a very close race and hard to call without looking at the assembler code and counting cycles.

Jack, MarkT, thank you for your replies.

I tried polling, but that was worse. I tried writing SPDR sooner...

  case ADDR2:
    //address = address << 8 | c;
    //address %= DATA_SIZE;
    SPDR = eeprom[address];

And that didn't work either. My best guess is that on the master side, the loop overhead gives the slave enough time to respond. Thus the first time into the loop the SPI.transfer() is right on the heels of the previous one. I guess I need to keep the delay in the code. Sigh.

DavidOConnor:
My best guess is that on the master side, the loop overhead gives the slave enough time to respond. Thus the first time into the loop the SPI.transfer() is right on the heels of the previous one. I guess I need to keep the delay in the code. Sigh.

Could definitely be. I might try unrolling the loop, should be a good test of the theory. I also noticed that WCOL was set when the master does a write, but doesn't matter as much maybe. I agree, sigh