SPI transmit with DMA does not work correctly unless DMAC_CHSR is accessed?

I have encountered a rather peculiar problem while attempting to use DMA to transmit data using the SPI module. I have the following code:

void setup()
{
  // Disable interrupt handling.
  __disable_irq();
  
  
  /* PIN SETUP
    SPI0 MOSI = PA26 peripheral A. */

  // Enable PIOA internal clock supply.
  pmc_enable_periph_clk(ID_PIOA);

  // Set PA26 to peripheral A mode.
  PIO_SetPeripheral(PIOA, EPioType::PIO_PERIPH_A, 1u << 26);

  // Disable pull-up resistor on PA26.
  REG_PIOA_PUDR = 1u << 26;

  // SPI CS0.
  PIO_SetPeripheral(PIOA, EPioType::PIO_PERIPH_A, 1u << 28);
  REG_PIOA_PUDR = 1u << 28;


  /* SPI SETUP
    Using SPI module 0. */
    
  // Enable SPI0 clock supply.
  pmc_enable_periph_clk(ID_SPI0);

  // Disable SPI0 for now, will be enabled once DMAC is set up.
  REG_SPI0_CR = SPI_CR_SPIDIS;

  // Disable SPI0 interrupts.
  REG_SPI0_IDR = 0x70Fu;

  // Set SPI0 to master mode, using CS0, fault protection disabled.
  REG_SPI0_MR = SPI_MR_MSTR | SPI_MR_MODFDIS;

  // Set SPI0 to data on rising clock, 16 bits per transfer, clock = MCK/105, CS-SCK delay is minimum.
  REG_SPI0_CSR = SPI_CSR_NCPHA | SPI_CSR_BITS_16_BIT | 105u << 8 | 1u << 16;


  /* DMAC SETUP
    Using DMAC channel 5 (highest priority), with hardware handshaking interface 1 (SPI0 transmit). */

  // Enable DMAC clock supply.
  pmc_enable_periph_clk(ID_DMAC);

  // Enable DMAC.
  REG_DMAC_EN = DMAC_EN_ENABLE;

  // Disable DMAC channel while we modify settings.
  REG_DMAC_CHDR = DMAC_CHDR_DIS5;

  // Set DMAC to fixed channel priority.
  REG_DMAC_GCFG = DMAC_GCFG_ARB_CFG_FIXED;

  // Disable DMAC interrupts.
  REG_DMAC_EBCIDR = 0x3F3F3F;

  // Data to send over SPI.
  uint32_t data = 0x1117;

  // Set transfer source address.
  REG_DMAC_SADDR5 = reinterpret_cast<uint32_t>(&data);

  // Set transfer destination address.
  REG_DMAC_DADDR5 = reinterpret_cast<uint32_t>(&REG_SPI0_TDR);

  // Set descriptor address to 0 (linked-list mode disabled).
  REG_DMAC_DSCR5 = 0;

  /* Transfer size = 1 (1 word to be transferred in total)
	  Source and destination widths = word = 32 bits (SPI_TDR register is 32-bit)
	  Source and destination chunk sizes = 1 (1 word is transferred per handshake (request from SPI module)) */
  REG_DMAC_CTRLA5 = 1u | DMAC_CTRLA_SRC_WIDTH_WORD | DMAC_CTRLA_DST_WIDTH_WORD | DMAC_CTRLA_SCSIZE_CHK_1 | DMAC_CTRLA_DCSIZE_CHK_1;

  /* Set linked-list mode disabled, flow controller to memory-to-peripheral controller,
	  transfer source address fixed, transfer destination address fixed. */
  REG_DMAC_CTRLB5 = DMAC_CTRLB_SRC_DSCR_FETCH_DISABLE | DMAC_CTRLB_DST_DSCR_FETCH_DISABLE | DMAC_CTRLB_FC_MEM2PER_DMA_FC
				          | DMAC_CTRLB_SRC_INCR_FIXED | DMAC_CTRLB_DST_INCR_FIXED;

  // Set destination handshaking to hardware, peripheral 1 (SPI0 transmit).
  REG_DMAC_CFG5 = DMAC_CFG_DST_H2SEL_HW | (1u << 4);

  // Enable channel.
  REG_DMAC_CHER = DMAC_CHER_ENA5;
  

  delayMicroseconds(1000);

  // Enable SPI0 and trigger DMAC request.
  REG_SPI0_CR = SPI_CR_SPIEN;
}

void loop()
{}

Which as far I can tell should work fine. Yet for some reason, while the DMAC transfer does occur and the SPI module transmits data, it transmits the wrong data. It always sends something like 0x0100 or 0x0000, no matter what I put in the source buffer. If I replace the DMA destination address with, for example, a normal memory address, instead of the SPI_TDR register, I observe that the correct data is in fact transferred by the DMA controller. Yet it refuses to work with SPI_TDR.

What's stranger is that the code works exactly as expected if I add a read to the DMAC_CHSR register at the end of setup(). I don't understand how this can possibly make any difference. Nothing in the datasheet says anything about reading DMAC_CHSR having any effect on the behaviour.

The actual execution of the read doesn't seem to be the cause, though. Adding a considerable delay (e.g. 10ms) after the enable of the SPI channel and before the read of DMAC_CHSR still makes it work.
Additionally, it is not anything weird from the end of setup() or the beginning of loop(). Adding an infinite while loop after the SPI channel enable still results in the incorrect transmission.
It seems as though just the presence of a read from DMAC_CHSR somewhere in the code affects the behaviour, which makes me think it could be some sort of compiler thing. But I don't see anything the compiler could do to produce this specific effect. Alternatively, perhaps it's a similar thing but at the CPU level? Again, I don't see what it could be, though.

If anyone has any idea what's going on here, that'd be great.
Thanks.

If you follow the programming steps as per 22.4.5.1 Programming Examples for Single-buffer Transfer page 346:

  1. Read the Channel Handler Status Register DMAC_CHSR.ENAx Field to choose a free (disabled)
    channel. (and maybe clear some register)
const uint32_t check_bit = DMAC_CHSR_ENA0 << DMAC_CH;
while( (DMAC->DMAC_CHSR & check_bit) != 0);
  1. Clear any pending interrupts on the channel from the previous DMAC transfer by reading the interrupt status
    register, DMAC_EBCISR.
DMAC->DMAC_EBCISR;
  1. Program the following channel registers:
    …….

Actually doing exactly what the datasheet there apparently doesn't affect it.

But I just noticed something else even more strange. Adding the read to DMAC_CHSR before my declaration of the source data array results in the incorrect transmission, but adding it after results in the correct transmission. And qualifying that array with volatile results in the correct transmission even with no reads of DMAC_CHSR.

I think we know what's going on here now... The compiler probably is doing some funky stuff with memory access reordering/omission on the source data array. And it just so happens that reading DMAC_CHSR causes some sort of memory barrier or something such that it's no longer allowed to mess around with the data array. And the volatile forces the compiler to allocate, initialise and deallocate the array where expected by a human.
Although, adding a data memory barrier (__DMB()) and/or data synchronisation barrier (__DSB()) doesn't fix it... Interesting.