Drive DAC with .wav data from an SD card

Hey, i am trying to figure something out, and i ran into the limitations of myself cobbling things together.

I have set up an Arduino R4, attached an SD card reader module to pin 10 to 13. Started driving the DAC on A0 directly, and used a small breakout amplifier to attach a speaker to that DAC output to get some audio going.
I do not have an oscilloscope or a lot of knowledge around microprocessors. I normally code in C# and my electronic studies ended over 14 years ago so forgive some ignorance.

First i had to drive the DAC at 44.1kHz, i have managed this by using a hardware timer, and from what i can tell it is pretty consistent. This then drives a small tight loop to either feed a sine wave at a fixed frequency, or if the SD card file is found, it will try and read the audio buffer instead.

At first i used analogWrite(A0, DACValue); to try and drive the DAC directly. This was audible, but had a lot of crackling going on. I then stumbled upon a post from Grumpy_Mike and a discussion there with susan-parker: https://forum.arduino.cc/t/hairy-output-on-r4-dac/1159716/13
I did a few attempts to directly drive the DAC through void CDac::set(uint16_t value) which seems to just outright not work even if you manage (you can directly call void CDac::analogWrite(int value) if you want to though. For what that is worth.
So that meant i had to decipher the cryptic direct drive code from the topic above.
I can honestly say i don't understand a word of it. I understand to some extend this is directly speaking to the DAC and mishandling it could be dangerous to do. I also don't understand how to set it to 8-bit mode through this, so i am stuck converting my values to roughly match 12-bit mode instead.

This all works for outputting a wave form. It sounds very nice, no crackles or pops.
However, audio from SD is a different story.
Loading the buffer seems quick enough, any bigger of a buffer size cannot fit the Arduino memory, and any smaller makes the popping of audio slower. I tried using a 2nd buffer to read 512bytes at once, and then use a memcpy approach to put them all in the shared audio buffer, this just seemed to slow the loop down and make the audio artifacts worse.

At this point, i do not know how to optimize this more, or how to confirm/test/profile if what i am doing is too slow, or what i should be looking at.

The audio format for anyone curious is straight up a unsigned 8bit .wav file, meaning you can quite literally read it as is, and put it into the DAC if it was in 8bit mode. I even don't bother to skip the header yet at this point, but it would be trivial to actually parse the format, so i left that for now.

Does anyone have any idea on where i should go to try and optimize this further? What tools are there, what am i missing? Thanks in advance!

I am not liable for the horrors in this code snippet:

#include <FspTimer.h>
#include <SPI.h>
#include <SD.h>

// Special direct registers to drive the DAC. I don't understand any of this.

// 19.2.5 Port mn Pin Function Select Register (PmnPFS/PmnPFS_HA/PmnPFS_BY) (m = 0 to 9; n = 00 to 15)
#define PORTBASE 0x40040000 /* Port Base */
#define P000PFS 0x0800  // Port 0 Pin Function Select Register
#define PFS_P014PFS ((volatile unsigned int *)(PORTBASE + P000PFS + (14 * 4))) // A0 / DAC12
// 12-Bit D/A Converter
#define DACBASE 0x40050000          // DAC Base - DAC output on A0 (P014 AN09 DAC)
#define DAC12_DADR0    ((volatile unsigned short *)(DACBASE + 0xE000))      // D/A Data Register 0 
#define DAC12_DACR     ((volatile unsigned char  *)(DACBASE + 0xE004))      // D/A Control Register
#define DAC12_DADPR    ((volatile unsigned char  *)(DACBASE + 0xE005))      // DADR0 Format Select Register
#define DAC12_DAADSCR  ((volatile unsigned char  *)(DACBASE + 0xE006))      // D/A A/D Synchronous Start Control Register
#define DAC12_DAVREFCR ((volatile unsigned char  *)(DACBASE + 0xE007))      // D/A VREF Control Register
// Low Power Mode Control - See datasheet section 10
#define MSTP 0x40040000 // Module Registers
#define MSTP_MSTPCRD   ((volatile unsigned int   *)(MSTP + 0x7008))        // Module Stop Control Register D
#define MSTPD20 20  // DAC12  - 12-Bit D/A Converter Module

// My defines
#define BUFFER_SIZE 8192
#define CHUNK_SIZE 4092

static FspTimer audio_timer;

const float sampleRate = 44100.0f; // 44.1Khz
const uint8_t amplitude = 10; // Amplitude of sine wave (0-127 for 8-bit resolution)

// Sine wave variables
const float twoPi = 2.0 * PI;
const float frequency = 400.0f;
const float increment = twoPi * frequency / sampleRate; // Increment per sample
float phase = 0.0;

// File reading variables
File fs;
volatile uint8_t audioBuffer[BUFFER_SIZE];
volatile uint16_t writeIndex;
volatile uint16_t readIndex;

// Debug
int counter = 0;
long lastTime = 0;

void produceSineWave(){
  // Calculate the sine value for the current phase
  float sineValue = sin(phase);
  // Map the sine value to an 8-bit DAC range (0-255)
  uint8_t dacValue = (uint8_t)((sineValue * amplitude) + 128); // Centered at 128
  // Write the value to the DAC
  *DAC12_DADR0 = dacValue*8;
  // Increment the phase
  phase += increment;
  // Wrap the phase to stay within 0 to 2*PI
  if (phase >= twoPi) {
    phase -= twoPi;
  }
}

void generateAudio(timer_callback_args_t *args) {
  if(fs){
    *DAC12_DADR0 = audioBuffer[readIndex]*8;
    readIndex = (readIndex + 1) % BUFFER_SIZE;
  }
  else{
    produceSineWave();
  }
  counter++;
}

bool startAudioTimer(float sampleRate) {
  uint8_t timer_type = GPT_TIMER;
  int8_t tindex = FspTimer::get_available_timer(timer_type);
  if (tindex < 0){
    tindex = FspTimer::get_available_timer(timer_type, true);
  }
  FspTimer::force_use_of_pwm_reserved_timer();
  audio_timer.begin(TIMER_MODE_PERIODIC, timer_type, tindex, sampleRate, 0.0f, generateAudio);
  audio_timer.setup_overflow_irq();
  audio_timer.open();
  audio_timer.start();
}

void openFile(const char* fileName){
  SD.begin(115200);
  bool exists = SD.exists(fileName);
  fs = SD.open(fileName, FILE_READ);
}

void fillAudioBuffer(){
  // Calculate the distance between write and read positions
  uint16_t distance = (writeIndex >= readIndex)
                      ? (writeIndex - readIndex)
                      : (BUFFER_SIZE - readIndex + writeIndex);
  if (distance >= CHUNK_SIZE) {
    // Write bytes into the buffer, ensuring ring buffer logic
    /*
    uint8_t buffer[CHUNK_SIZE];
    int copiedBytes = fs.readBytes(buffer, CHUNK_SIZE);
    memcpy((void*)audioBuffer + writeIndex, buffer, copiedBytes);
    writeIndex = (writeIndex + 1) % BUFFER_SIZE;
    */
    for (uint16_t i = 0; i < CHUNK_SIZE; i++) {
        audioBuffer[writeIndex] = fs.read();
        writeIndex = (writeIndex + 1) % BUFFER_SIZE;
    }
  }
}

void setup() {
  Serial.begin(115200);
  openFile("page1.wav");
  fillAudioBuffer();
  setupDac();
  startAudioTimer(sampleRate);
}

void loop() {
  if(micros() - lastTime > 1000000){
    lastTime = micros();
    Serial.println(counter); // Visualize timing
    counter = 0;
  }
  fillAudioBuffer();
}

void setupDac()       // Note make sure ADC is stopped before setup DAC
{
  *MSTP_MSTPCRD &= ~(0x01 << MSTPD20);  // Enable DAC12 module
  *DAC12_DADPR    = 0x00;               // DADR0 Format Select Register - Set right-justified format
  *DAC12_DAADSCR  = 0x00;               // D/A A/D Synchronous Start Control Register - Default
  *DAC12_DAVREFCR = 0x00;               // D/A VREF Control Register - Write 0x00 first - see 36.2.5
  *DAC12_DADR0    = 0x0000;             // D/A Data Register 0 
  delayMicroseconds(10);                // Needed delay - see datasheet
  *DAC12_DAVREFCR = 0x01;               // D/A VREF Control Register - Select AVCC0/AVSS0 for Vref
  *DAC12_DACR     = 0x5F;               // D/A Control Register - 
  delayMicroseconds(5);                 // 
  *DAC12_DADR0    = 0x0800;             // D/A Data Register 0 
  *PFS_P014PFS   = 0x00000000;          // Port Mode Control - Make sure all bits cleared
  *PFS_P014PFS  |= (0x1 << 15);         // ... use as an analog pin
}