Sound lagging problem for wav player with SD card (Mega 2560)

Hi, everybody.
Before I start writing, I will let you know that my English is not good enough.
Maybe some vocabulraries or grammar can be wrong.

By the way, I'm making wav player with SD card not using any sound related libraries.
Using PCM data in wav to make PWM, and convert the PWM to music using LPF & amp. & speaker.

Until now, It can make sound but has some problems.
Tone of the sound is same to original wav file, but only tempo is lagging.

Timer interrrupt happens by 44.1kHz and PCM data is uploaded to OCRnx in every interrupt.
Then OCRnx make PWM(Timer1&3, 62.5kHz, 8-bit fast PWM.)

  • The wav files also have 44.1kHz sampling rate.
    ** I checked the PWM and interrupt frquency with an osciloscope, it was fine(62.5/44.1kHz for each.)

Before I explain more about the problem, here is the code.

#include <SD.h>
#include <SPI.h>
#include <LiquidCrystal.h>

 /* pin number */
#define CS 53
#define SW4 21
#define SW2 20
#define SW1 19
#define PH_B 18
#define PH_A 17
#define lPWM_H 11
#define lPWM_L 12
#define rPWM_H 5
#define rPWM_L 2

#define F_SAMPLE 44.1e3
#define VOL_MAX 1 // unsettled
#define VOL_MIN -5 // unsettled

LiquidCrystal lcd(9,8,7,6,4,3);
File root, wav;
char wavName[20][12];
byte buf[2][3072];
volatile uint32_t bufCnt = 0;
volatile uint32_t bytCnt = 0;
volatile int8_t encCnt = 0;
volatile bool bufFlag = false;
volatile bool bufSelc = false;
uint32_t datSize;
uint16_t timeEnd;
uint16_t timeElp;
uint8_t wavNum;
uint8_t volCtr = 2;
uint16_t bufSum = 0;

void setup() {
  SD.begin(CS);
  lcdInit();
  wavInfo();
  wavInit(0);
  pinInit();
  interruptInit();
}

void loop() {
  // timeUpdate();
  if(bufFlag) bufUpdate();
}

void lcdInit() {
  lcd.begin(16,2);
  lcd.setCursor(5,1);
  lcd.print(":   /   :");
}

void wavInfo() {
  root = SD.open("/");
  wav = root.openNextFile();
  wav.close();
  for(uint8_t i = 0; ; i++) {
    wav = root.openNextFile();
    if(!wav) {
      wav.close();
      wavNum = i;
      return;
    }
    strcpy(wavName[i],wav.name());
    wav.close();
  }
}

void wavInit(int8_t n) {
  bufCnt = 0;
  bytCnt = 0;
  datSize = 0;

  wav = SD.open(wavName[n]);
  wav.seek(40);
  for(uint8_t i = 0; i < 4; i++)
    datSize += (uint32_t)wav.read() << (8 * i);
  timeEnd = datSize / (4 * F_SAMPLE);

  wav.read(buf[0], 3072);
  wav.read(buf[1], 3072);

  lcd.home();
  lcd.print("            ");
  lcd.home();
  lcd.print(wavName[n]);
  lcd.setCursor(11,1);
  lcd.print((timeEnd / 60) / 10, DEC);
  lcd.setCursor(12,1);
  lcd.print((timeEnd / 60) % 10, DEC);
  lcd.setCursor(14,1);
  lcd.print((timeEnd % 60) / 10, DEC);
  lcd.setCursor(15,1);
  lcd.print((timeEnd % 60) % 10, DEC);
}

void pinInit() {
  pinMode(SW4,INPUT);
  pinMode(SW2,INPUT);
  pinMode(SW1,INPUT);
  pinMode(PH_B,INPUT);
  pinMode(PH_A,INPUT);
  pinMode(lPWM_H,OUTPUT);
  pinMode(lPWM_L,OUTPUT);
  pinMode(rPWM_H,OUTPUT);
  pinMode(rPWM_L,OUTPUT);
}

void interruptInit() {
  // external interrupt
  EICRA = 0b10111111; // PH_A: falling edge, SW: rising edge
  EIFR = 0b00001111; // Clear INT3, 2, 1, 0 Flag
  EIMSK = 0b00001111; // Enable INT3, 2, 1, 0

   /* Timer1,3 */
  TCCR1A = 0b10100001;
  TCCR1B = 0b00001001;
  TCCR3A = 0b10100001;
  TCCR3B = 0b00001001;
      // Clear OCnx on compare match
      // 8-bit fast PWM mode (62.5kHz), prescaler = 1
   /* Timer4 */
  TCCR4A = 0b00000000;
  TCCR4B = 0b00001001;
      // CTC mode (TOP = OCR4A) , prescaler = 1
  OCR4A = F_CPU / (1 + F_SAMPLE);
  TCNT4 = 0b00000000; // counter initializtion
  TIFR4 = 0b00000010; // Enable output compare A Match Flag
  TIMSK4 = 0b00000000; // Disable compare match A interrupt

  sei();
}

void timeUpdate() {
  timeElp = (bytCnt + 3072 * bufCnt) / (4 * F_SAMPLE);
  if(timeElp == timeEnd) {
    TIMSK4 &= ~_BV(OCIE4A);
    wav.close();
    wavInit(encCnt);
  }

  lcd.setCursor(3,1);
  lcd.print((timeElp / 60) / 10, DEC);
  lcd.setCursor(4,1);
  lcd.print((timeElp / 60) % 10, DEC);
  lcd.setCursor(6,1);
  lcd.print((timeElp % 60) / 10, DEC);
  lcd.setCursor(7,1);
  lcd.print((timeElp % 60) % 10, DEC);
}

void bufUpdate() {
  bufFlag = false;
  wav.read(buf[bufSelc],3072);
}

ISR(TIMER4_COMPA_vect) {    
  bufSum = buf[bufSelc][bytCnt] + (buf[bufSelc][bytCnt+1] << 8) + 0x8000;
  bufSum >>= volCtr;
  OCR1B = bufSum & 0x00FF;
  OCR1A = (bufSum & 0xFF00) >> 8;
  bytCnt += 2;
    
  bufSum = buf[bufSelc][bytCnt] + (buf[bufSelc][bytCnt+1] << 8) + 0x8000;
  bufSum >>= volCtr;
  OCR3B = bufSum & 0x00FF;
  OCR3A = (bufSum & 0xFF00) >> 8;
  bytCnt += 2;
  
  if( bytCnt == 3072 ) {
    bufFlag = true;
    bytCnt = 0;
    bufCnt++;
    bufSelc = !bufSelc;
  }
}

ISR(INT3_vect) {
  TIMSK4 &= ~_BV(OCIE4A);
  
  switch(digitalRead(PH_B)) {
  case LOW: encCnt++; // CW
    break;
  case HIGH: encCnt--; // CCW
  }
  if(encCnt > wavNum - 1) encCnt = 0;
  else if(encCnt < 0) encCnt = wavNum - 1;

  wav.close();
  wavInit(encCnt);
}

ISR(INT2_vect) {
  TIMSK4 ^= _BV(OCIE4A);
  // Remove noise during pause
  OCR1B = 0; 
  OCR1A = 0;
  OCR3B = 0;
  OCR3A = 0;
}

ISR(INT1_vect) {
  if(volCtr < VOL_MAX) volCtr++;
}

ISR(INT0_vect) {
  if(volCtr > VOL_MIN) volCtr--;
}

After buf[0] is finished, it starts read wav file from SD card in loop().
And interrupt start to read buf[1] to update OCRnx.

As I thought, interrupt keeps happening 44.1kHz during reading 3072 byte from SD.
But now I think that wav.read(buf[bufSelc],3072) stops timer4 interrupt until it is finished.
Every wav.read(,) inserts subtle delay between every 3072 bytes.
So the tone of the sound is OK, but only tempo is lagging.

If my guess is right, how can I fix this code?
Or not, what is the real reason and how can I fix it?

Please give me enlightenment...


P.S.

I found that timer4's one interrupt cycle spends 16~20us,
wav.read(buf[bufSelc],3072) spends 5.55ms, which is 7us per one sample(4-byte.)
One wav sample should spend no more than 22.6ms(44.1kHz.)

If read function and timer interrupt are compatible,
wav.read(buf[bufSelc],3072) should spend time lower than (5.55ms -> )2ms.

Then what is a solution to read wav file much faster then now?
read() in SD.h is not fast enough to play 44.1kHz 16-bit stereo wav file..

I don't know you, but I think 44 KHz 16 bits stereo might be too much for a poor 8-bit CPU running at 16 MHz. However, what I'm gonna propose might make the player interrupt execute a bit faster:

ISR(TIMER4_COMPA_vect) {
  unsigned int sample16 = toUnsignedSample(*((int*)&buf[bufSelc][bytCnt])) >> volCtr;
  byte* s16P = (byte*)&sample16;
  OCR1B = s16P[0];
  OCR1A = s16P[1];
  bytCnt += 2;
   
  sample16 = toUnsignedSample(*((int*)&buf[bufSelc][bytCnt])) >> volCtr;
  OCR3B = s16P[0];
  OCR3A = s16P[1];
  bytCnt += 2;
 
  if( bytCnt == 3072 ) {
    bufFlag = true;
    bytCnt = 0;
    bufCnt++; 
    bufSelc = !bufSelc;
  }
}

unsigned int toUnsignedSample(int s) {
  if (s & 0x8000) return 0x7FFF - (~s); // Is negative
  return 0x8000 | s; // Is positive
}

Why I think this might help?

  • Using local variables inside an ISR should make the compiler place them in some of the CPU's general purpose registers, instead of the RAM's "stack".
  • Uses part of the buffer to directly "reinterpret" its content as a signed 16-bit integer (aka int type); this should eliminate the involved overhead of copying byte by byte, and doing additions/bitwise ORs + bit shifts. This is possible thanks to the "memory pointers", which also do the other way around: split a 16-bit value into two bytes without any bitmath stuff in between.

I don't know if my method of signed to unsigned conversion is any faster than yours, but I think this is as optimized (and simplified) as I can do it (I've always believed that normal numeric comparators are slower than bit operators). Unfortunately, this process (overhead) is unavoidable; so that's why it has to be as quick as possible.

PD: I wish you luck on that, because I've also tried to replicate that way, what TMRpcm does. Even with 8-bit 22 KHz files I'm still having that "drag" on what you call the "tempo".
Nevertheless, after some troubleshooting I've realized the culprit was the buffer filling process: it takes like double or triple of what the player's interrupt takes to go through the other buffer.

I suspect the SD library doesn't configure the SPI port in its fastest possible speed; because I've seen that TMRpcm messes around with a couple of SPI-related registers, that apparently makes it work pretty well even with buffers as small as 64 bytes (128 actually due to being a double buffer).

ISR(TIMER4_COMPA_vect) {
  uint16_t sample16 = (*((uint16_t*)&buf[bufSelc][bytCnt]) + 0x8000) >> volCtr;
  byte* s16P = (byte*)&sample16;
  OCR1B = s16P[0];
  OCR1A = s16P[1];
  bytCnt += 2;
  
  sample16 = (*((uint16_t*)&buf[bufSelc][bytCnt]) + 0x8000) >> volCtr;
  OCR3B = s16P[0];
  OCR3A = s16P[1];
  bytCnt += 2;
 
  if( bytCnt == 3072 ) {
    bufFlag = true;
    bytCnt = 0;
    bufCnt++;
    bufSelc = !bufSelc;
  }
}

Thank you for your help, Lucario488.
Sorry for my late reply.

I changed a little of your code.
toUnsignedSample() is integrated in the interrupt for reducing function overhead.

Before I changed my code, one interrupt routine spends 16~20us.
Now it spends 12~16us.
...But drag still exsists.

According to my calculation, one interrupt cycle should spend time shorter than 4.5us.
Or wav.read(buf[][3072]) shorter than 4600us(now 13400us).

I can't do any better than this...

ajg1202:
toUnsignedSample() is integrated in the interrupt for reducing function overhead.

You also could place that code directly into the ISR, but as I said, I'm not sure if that's any faster than just reinterpret as unsigned and add 0x8000 (32768).

ajg1202:
Before I changed my code, one interrupt routine spends 16~20us.
Now it spends 12~16us.
...But drag still exsists.

I knew it would help, but not that much. You could squeeze 4 us more with assembly, but doesn't worth the hassle.

You can try this in the setup() though:

SPSR |= (1 << SPI2X);
SPCR &= ~((1 <<SPR1) | (1 << SPR0));

Because...

Lucario448:
I suspect the SD library doesn't configure the SPI port in its fastest possible speed; because I've seen that TMRpcm messes around with a couple of SPI-related registers, that apparently makes it work pretty well even with buffers as small as 64 bytes (128 actually due to being a double buffer).

This is what I'm talking about.

ajg1202:
According to my calculation, one interrupt cycle should spend time shorter than 4.5us.
Or wav.read(buf[][3072]) shorter than 4600us(now 13400us).

It's most likely the buffer filler the culprit, so give my new suggestion a try.

Actually, after I post reply, I set the SPI register like you said.

  SPSR |= (1 << SPI2X);
  SPCR &= ~((1 <<SPR1) | (1 << SPR0));

As a result, sound is exactly the same as before..
Also I tried SPI.SPISettings & setClockDriver but nothing changed.

Actually there is a working version which has the same H/W condition.
No drag, little noise.

The one who made that working player advised me, yesterday, that he just used the default setting.
He filled buffers with file.read() in <SD.h> as I did, and did nothing with SPI things.
He said optimizing the interrupt and loop cycle is enough to make it work properly.

I think I can't optimize my code anymore.
So I wonder if my wav.read(buf, 3072) is fast as usual.

That function spends 13.4ms and it is 229KB/s.
But to update a buffer before another buffer finished,
I need at least 667KB/s reading speed.

Is 229KB/s SD reading speed OK?
And is 667KB/s feasible?

ajg1202:
The one who made that working player advised me, yesterday, that he just used the default setting.
He filled buffers with file.read() in <SD.h> as I did, and did nothing with SPI things.
He said optimizing the interrupt and loop cycle is enough to make it work properly.

But... was he working with CD-quality files? You know: 16-bit stereo samples at 44100 Hz.

ajg1202:
I think I can't optimize my code anymore.

No you can't go any further looks like. You could try assembly if you're adventurous enough; but I think it will still make no big difference because the C++ compiler optimizations are actually pretty good.

ajg1202:
So I wonder if my wav.read(buf, 3072) is fast as usual.

It should, or at least we hope that the creator made it as efficient as possible.

ajg1202:
That function spends 13.4ms and it is 229KB/s.

For real? Tests I've made long ago, proved me that the read speed is more like 70 KB/s; being a for loop and the timer0 overflow interrupt (triggered every 1024 us or 1.02 ms) the only known overheads. Data was read 512 bytes at a time.

ajg1202:
But to update a buffer before another buffer finished,
I need at least 667KB/s reading speed.

Is 229KB/s SD reading speed OK?
And is 667KB/s feasible?

First of all, you don't need that much; for CD-quality files you need at least 175 KB/s.
Secondly, read speeds usually are below 100 KB/s; so you may need to cut down the depth to 8 bits anyways.
Thirdly, 667 KB/s is only feasible on microcontrollers with higher clock rate and probably DMA capability. For the AVRs, maybe with sequential "low-level" (no filesystem or partition table handling) reads.

I might be wrong.

microt = micros();

wav.read(buf[0], 3072);

micrott = micros() - microt
Serial.println(micrott);

This is my code for find out the time for read.
It takes 13,420us, 13.4ms for 3072 bytes.
One byte takes 4.36us(13.4m/3072), so 229KB/s

Of course function overhead is there, but 3072-byte reading is long enough to omit the effect, I think.

And 22us{for 1 sample} * (3072 / 4) = 16,896us (whole time for using one buffer.)

Reading and interrupt can't be done simultaneosly, aren't they?
(16,896us - 13,420us) / (3072 / 4) = 4.52us per interrupt.

But I can't make it 16us to 4.52us.
So if wav.read(buf[0], 3072) is 667KB/s, every task is done in 22us.

Did I mistake something in doing calculation?

By the way, he use 4 PWM as output,
left lower byte, left higher byte, right lower byte, right higher byte.
It is 16-bit, stereo, 44.1kHz (which he said, but I don't he lied to me. :confused: )

And really thank you for keep replying my post, Lucario448. :slight_smile:

ajg1202:
It takes 13,420us, 13.4ms for 3072 bytes.
One byte takes 4.36us(13.4m/3072), so 229KB/s

Well, looks like I've made so long ago that test that I no longer remeber where exactly the 70 KB/s came from (or maybe the overhead was bigger than I recalled?).

ajg1202:
Of course function overhead is there, but 3072-byte reading is long enough to omit the effect, I think.

In fact, 3072 bytes should be an overkill (in theory).

ajg1202:
Reading and interrupt can't be done simultaneosly, aren't they?

It's not possible, unless DMA were available.

ajg1202:
(16,896us - 13,420us) / (3072 / 4) = 4.52us per interrupt.

But I can't make it 16us to 4.52us.

Welcome to the limit of an AVR at 16 MHz my friend...

ajg1202:
So if wav.read(buf[0], 3072) is 667KB/s, every task is done in 22us.

For that, you have no other choice than a higher clocked microcontroller.

SD cards in SPI (MMC backwards compatibility aka "legacy") mode, support clock (SCK) rates of up to 20 MHz, but an AVR only can output up to half of the main clock's frequency (so 8 MHz for 16 MHz systems, that's not even 50% of an SD card maximum speed!). 20 MHz is roughly 2.5 MB/s of "raw" speed*!

As you can see, our biggest constraint is the clock frequency (and/or lack of key features); so you may have to abandon the idea of playing back CD-quality audio on an Arduino Mega. Or at least not without trying our really last hope: disabling all timer0 interrupts (it will halt millis() and micros() counters, but is not a big deal since you don't use them actually).

ajg1202:
Did I mistake something in doing calculation?

Nope, you even made me realize the real culprit: the player interrupt itself.

Since there's no DMA, the poor CPU (which isn't multi-cored either) has to handle everything, including I/O transfers (bytewise on hardware serial ports e.g. Serial, bitwise on software implementations e.g. SoftwareSerial).
Due to this previous reason, any interrupt will essentially pause the buffer-filling process; and thus slowing it down overall.

ajg1202:
By the way, he use 4 PWM as output,
left lower byte, left higher byte, right lower byte, right higher byte.
It is 16-bit, stereo, 44.1kHz (which he said, but I don't he lied to me. :confused: )

And it's right: a pair of 8-bit outputs makes a 16-bit one.

Where I'm not convinced is in the 44.1 KHz part; not because you can't spit out samples that often, but because the carrier (PWM in this case) frequency is insufficient to modulate the entire expected spectrum (it's also most of the reason why AM radio stations sound like a telephone call).

ajg1202:
And really thank you for keep replying my post, Lucario448. :slight_smile:

And thanks to you sir, for that praise! :smiley:

*PD: for "raw speed" I mean DMA-driven or with no overhead at all. The effective speed will greatly depend on how much juggling the library has to do.