[solved, kind of...]: SPI issue with shift registers on a Udoo board

I'm facing a problem with a Udoo board, which has an embedded Arduino Due. I'm using SPI to get data from 16 push buttons with two shift registers. The code I'm working with works perfectly with an Arduino Uno. Here it is

#include <SPI.h>

const byte buttonLatch = 9;
const byte numOfButChips = 2; // This is the only modification needed if number of button chips used is changed

byte butArray[numOfButChips + 1];

void setup()
{
  SPI.begin();
  Serial.begin(115200);
  pinMode(buttonLatch, OUTPUT);
  digitalWrite(buttonLatch, HIGH);
}

void loop ()
{
  butArray[0] = 0xc0; // denote the beginning of data stream, necessary for Pd to understand where the data starts
  int index = 1;
  digitalWrite (buttonLatch, LOW);    // pulse the parallel load latch
  digitalWrite (buttonLatch, HIGH);
  for(int i = 0; i < numOfButChips; i++)
    butArray[index++] = SPI.transfer(0);
  Serial.write(butArray, numOfButChips + 1);
  
  delay (10);   // debounce
}  // end of loop

I'm collecting the data in Pure Data as two bytes which are being split into their bits to represent the 16 push buttons.
But with the Due (actually Udoo) the bytes transferred are bit shifted by one, and wrapped up, so that the last push button of the second chip is read in the first bit of the first byte. Plus there seems to be some short circuit connecting the first and last push button of the second chip. I've tested the very same circuit without changing it at all with an Arduino Uno and there was no short circuit, so the wiring is correct.
Looking for answers I found this post Continuous SPI for multiple bytes - Arduino Due - Arduino Forum which mentions and uses the DMA SPI (I've no idea what this is), so I tried to use it. I copied the code and changed the loop function and ended up with this

// Code used to read to shift register chips, each with 8 push buttons, and transfer the data to Pure Data over serial

#define USE_ARDUINO_SPI_LIBRARY 0
#define USE_NATIVE_SAM3X_SPI 1

#define CS 10
#define SPI_RATE 21

#define SPI_BUFF_SIZE 2
uint8_t rx_buffer[SPI_BUFF_SIZE];
uint8_t tx_buffer[SPI_BUFF_SIZE];

byte butArray[3];

void setup() {
  Serial.begin(115200);
  pinMode(CS,OUTPUT);
  digitalWrite(CS,HIGH);
  spiBegin();
  spiInit(SPI_RATE);
}

void loop() {
  butArray[0] = 0xc0; // denote the beginning of data stream, necessary for Pd to understand where the data starts
  int index = 1;
  digitalWrite(CS,LOW);
  digitalWrite(CS,HIGH);
  spiRec(rx_buffer, SPI_BUFF_SIZE);
  for(int i = 0 ; i < SPI_BUFF_SIZE; i++)
    butArray[index++] = rx_buffer[i];
  
  Serial.write(butArray, 3);
}

After this code I've included the SPI functions found here DUEZoo/dmaspi.ino at master · manitou48/DUEZoo · GitHub (taken from the above mentioned post, starting on line 37). The code compiles and uploads, but when I try to open the serial port in Pure Data (all in Udoo) it hangs and I get this message in the terminal

watchdog: signaling pd...

As I don't really understand how to use this DMA SPI, I'm sure I'm doing something (or some things) wrong here. But if there's an issue with Arduino's SPI library and Arduino Due and this DMA SPI is the solution, I'd be grateful if I had some light shed.

P.S. This forum is quite more active the Udoo's, and since it's about Arduino Due, I thought of posting here.

Experimenting further with this issue I'm now having some results (both good and bad). I managed to use this code that uses the SAM3X SPI instead of Arduino's SPI (if I got this right) to write some bytes to a couple of shift registers to light up some LEDs. This is the code that works

#define USE_ARDUINO_SPI_LIBRARY 0
#define USE_NATIVE_SAM3X_SPI 1

#define CS 10
#define SPI_RATE 21

#define SPI_BUFF_SIZE 2
uint8_t rx_buffer[SPI_BUFF_SIZE];
uint8_t tx_buffer[SPI_BUFF_SIZE];

void setup() {
  pinMode(CS,OUTPUT);
  digitalWrite(CS,HIGH);
  spiBegin();
  spiInit(SPI_RATE);
  for(int i = 0; i < SPI_BUFF_SIZE; i++)
    tx_buffer[i] = 0x00;
}

void loop() {
  for(int i = 0; i < SPI_BUFF_SIZE; i ++){
    for(int j = 0; j < 8; j++){
      tx_buffer[i] ^= (byte) pow(2, j);
      digitalWrite(CS, LOW);
      spiSend(tx_buffer, SPI_BUFF_SIZE);
      digitalWrite(CS,HIGH);
      delay(200);
      tx_buffer[i] = 0x00;
    }
  }
}

I even managed to send data from Pure Data to Arduino and control the LEDs by combining the code above with code given from Nick Gammon. This is the code

#define USE_ARDUINO_SPI_LIBRARY 0
#define USE_NATIVE_SAM3X_SPI 1

#define CS 10
#define SPI_RATE 21

#define SPI_BUFF_SIZE 2
uint8_t rx_buffer[SPI_BUFF_SIZE];
uint8_t tx_buffer[SPI_BUFF_SIZE];

const unsigned int MAX_INPUT = 10;
const byte maxLEDs = SPI_BUFF_SIZE * 8;

void refreshLEDs()
{
  digitalWrite(CS, LOW);
  spiSend(tx_buffer, SPI_BUFF_SIZE);
  digitalWrite(CS,HIGH);
}

void process_data (char * data)
  {
  
  // C: clear all bits
  switch (toupper (data [0]))
    {
     case 'C':
        {
        for (int i = 0; i < SPI_BUFF_SIZE; i++) 
          tx_buffer[i] = 0;
        refreshLEDs ();
        return;
        }
  
    // S: set all bits
    case 'S':
        {
        for (int i = 0; i < SPI_BUFF_SIZE; i++) 
          tx_buffer[i] = 0xFF;
        refreshLEDs ();
        return;
        }
    
    // I: invert all bits
    case 'I':
        {
        for (int i = 0; i < SPI_BUFF_SIZE; i++) 
          tx_buffer[i] ^= 0xFF;
        refreshLEDs ();
        return;
        }
    } // end of switch
  
  // otherwise: nnx 
  //   where nn is 1 to 89 and x is 0 for off, or 1 for on
      
  // convert first 2 digits to the LED number
  byte led = (data[0] - '0') * 10 + (data[1] - '0');
  
  // convert third digit to state (0 = off)
  byte state = data[2] - '0';  // 0 = off, otherwise on
  
  if (led > maxLEDs) // if LED number is greater than total LEDs, do nothing
      {
      return;
      }
   
   led--;  // make zero relative
   
   // divide by 8 to work out which chip
   byte chip = led / 8;  // which chip
   
   // remainder is bit number
   byte bit = 1 << (led % 8);
   
   // turn bit on or off
   if (state)
     tx_buffer[chip] |= bit;
   else
     tx_buffer[chip] &= ~ bit;
  
  refreshLEDs ();
}  // end of process_data

void setup() {
  pinMode(CS,OUTPUT);
  digitalWrite(CS,HIGH);
  spiBegin();
  spiInit(SPI_RATE);
  for(int i = 0; i < SPI_BUFF_SIZE; i++)
    tx_buffer[i] = 0x00;
  refreshLEDs();
  Serial.begin(115200);
}

void loop() {
  static char input_line [MAX_INPUT];
  static unsigned int input_pos = 0;

  if (Serial.available () > 0) 
    {
    char inByte = Serial.read ();

    switch (inByte)
      {

      case '\n':   // end of text
        input_line [input_pos] = 0;  // terminating null byte
        
        // terminator reached! process input_line here ...
        process_data (input_line);
        
        // reset buffer for next time
        input_pos = 0;  
        break;
  
      case '\r':   // discard carriage return
        break;
  
      default:
        // keep adding if not full ... allow for terminating null byte
        if (input_pos < (MAX_INPUT - 1))
          input_line [input_pos++] = inByte;
        break;

      }  // end of switch

  }  // end of incoming data
}

But trying to read buttons from shift registers simply doesn't work. I came up with this code

#define USE_ARDUINO_SPI_LIBRARY 0
#define USE_NATIVE_SAM3X_SPI 1

#define CS 10
#define SPI_RATE 21

#define SPI_BUFF_SIZE 2
uint8_t rx_buffer[SPI_BUFF_SIZE];
uint8_t tx_buffer[SPI_BUFF_SIZE];

byte butArray[3] = { 0 };

void refreshInput()
{
  digitalWrite(CS, LOW);
  digitalWrite(CS, HIGH);
  spiRec(rx_buffer, SPI_BUFF_SIZE);
}

void setup() {
  pinMode(CS,OUTPUT);
  digitalWrite(CS,HIGH);
  spiBegin();
  spiInit(SPI_RATE);
  Serial.begin(115200);
}

void loop() {
  butArray[0] = 0xc0; // this byte is necessary so that Pd understands that this is the beginning of the data stream
  refreshInput();
  for(int i = 0; i < SPI_BUFF_SIZE; i ++)
    butArray[i + 1] = rx_buffer[i];
  
  Serial.write(butArray, 3);
}

which to my understanding has the same philosophy as the first code of this post. The code compiles and uploads, but again, when I try to open the serial port, Pure Data hangs and I get the message "watchdog: signaling pd..." repeatedly.
Can someone point out what I'm doing wrong here?

Another thing I don't understand is how to manage is the CS pin, which in both cases is set to 10. With Arduino Uno, the CS (referred to as SS - Slave Select) is indeed pin 10, but that's used for the shift out registers, whereas for the shift in registers I was using pin 9 (taken from Gammon's tutorials as well), which is independent of the SPI (right?). Can I use pin 9 as the latch pin for shifting bits in?
Just a reminder, all this happens on a Udoo board (which has an embedded Arduino Due), and the functions used in the pieces of code above are taken from here DUEZoo/dmaspi.ino at master · manitou48/DUEZoo · GitHub copied from line 37 and below.

Thinking out loud on my issue I'm posting the evolution of it. Something quite strange is happening. I've used code that uses Arduino's SPI to read data from shift registers and the bit shift problem remained, but I accidentally wired a switch on pin 10 of the shift register, which is supposed to transmit data from Arduino's MISO to the daisy chained chips, and that switch was expressed in the first bit of that byte. So what actually happens is that the chip's pins have been physically shifted by one...you can check the chip's datasheet here http://www.ti.com/lit/ds/symlink/cd74hc165.pdf.
So now pin 10 (DS) works in the place of pin 11 (D0), pin 11 in place of 12 (D1), D1 in place of D2 etc. and D7 doesn't do anything if it's the first chip of more daisy chained chips. If the chips is the second one, then pin D7 affects the first bit of the first chip's byte, if the chip is the third, D7 affects the first bit of the second chip's byte etc. some kind of strange wrap around.
Also, when using a switch on pin 10, the chips following chips drop all their pins low (with pullup resistors)! If using a switch on pin 10 of the last chip, nothing weird happens in the other chips.
Anyway, I have no idea at all why all this is happening, but I can compensate for my project's needs, so more or less, problem solved.
BTW, is there any chance I'm damaging the shift registers by using pin 10 (DS) this way?