arduino due DSP shield timing question

hirschmensch · September 2, 2015, 9:46am

so I built this Codec shield for the Due. It can read 2 channels at 44,1kHz 32 bit, send it to the Cortex M3 via SSC (I²S) ready back 2 channels of digital audio data and convert them to analog audio again.
Works perfectly!

But now I want to process these samples I get from the Codec on the Arduino Due. Changing volume is no big deal and works perfectly too, but I tried a sample pipeline. I read one sample, put it in the pipe, move every sample in the pipe one pointer further and then write out the last sample in the pipe. simple as that. works perfectly again, but only for about 16 samples. if I try 32 samples I already get digital artifacts on the output because the loop that is moving the samples further takes too long (I suppose).

What do you think, is the Cortex M3 THAT SLOW, that it can't even handle a while loop between two sample interrupts?

If I sample at 44,1kHz, the DUE should have about 1 / 44100 ~ 22µs to process anything until the next sample arrives. Is that too little time for an 84MHz processor? How can I work around this? offline processing?

DVDdoug · September 2, 2015, 3:08pm

I really don't know, but I think the Due should be fast enough for that. I assume you're doing 32-bit floating-point multiplication, and that's going to take quite a few CPU cycles (although a volume change it's about the simplest DSP related thing you can do).

I suggest you write some diagnostics. For example, make a loop and count how many simulated volume-changes you can make in 1 second or 10 seconds. Check with just the Due, then add the read/writes to your shield. If you are reading/writing the audio data from/to memory, do a similar read/write test to see if that's the bottleneck.

It might be the way the I²S works, and that might be interfering with the audio sample rate. In that case, a buffer (or a read buffer and a write buffer) may help. (Audio on a multitasking computer only works if you have buffers.) Or, if you are using an I²S library, maybe you can write your own function that synchronizes better with the audio sampling.

Works perfectly!

Are you 100% sure that it's working? Are you sure the digital audio stream stream is actually going through the Due?

P.S.
There is one thing that's simpler than a volume change.... Try a delay.

hirschmensch · September 2, 2015, 4:52pm

Thank you for your response!
I did some diagnostics in the meanwhile... Volume change works fine, yes. The problem was with the linear sample buffer I implemented (basically a delay ). It turns out the Due really is too slow for a linear buffer. I measured the time (using function "micros()") to see how long it takes to shift every sample of a 512 sample buffer to the next position and it's about 242µS. Well, that's too long...

So I switched to a circular buffer (which took some work) and the delay is working fine now.

hirschmensch · September 2, 2015, 4:57pm

The thing is, I'd like to implement a couple of FIR filters. I might try floating point multiplication but I'm sure it's going to be way too slow.

I noticed that the Cortex M3 supports 32 bit 1 cycle fixed point hardware multiplication. Do you know if any fixed point multiplication in the Arduino IDE will be optimized to use the multiplication unit of the Cortex M3 or do I have to write them in Assembler?

hirschmensch · September 2, 2015, 5:51pm

So I tried to put multiplication between two samples and it seems the Due can handle about 55 32 bit multiplications between two 44,1kHz samples (STEREO!) before audible digital artifacts begin to happen.

Has someone ever tried to do the same and can confirm this to me? Or am I doing something wrong?

pjrc · September 3, 2015, 1:49am

hirschmensch:
What do you think, is the Cortex M3 THAT SLOW, that it can't even handle a while loop between two sample interrupts?

44 kHz audio works great on Teensy 3.1. Some of the audio library features use the Cortex-M4 DSP extensions, but the input and output of audio and stuff like the delay line uses only the same features as Cortex-M3.

Here's a video I made recently, showing a delay line using external RAM.

In this video, you can see the Cortex-M4 is able to simultaneously read 16 bit I2S audio from the mic input, write to to the SPI ram chip, then read 4 copies from the RAM chip, and implement 2 software mixers, and write the audio to the I2S output. That uses approx 30% CPU time... which is mostly overhead of the SPI communication to/from the 23LC1024 RAM chip.

Cortex-M3 would struggle to implement some of the computationally heavy stuff, like the filters and FFTs, but for just moving data, if you're properly leveraging DMA transfers, it should be plenty fast enough to handle stereo 44 kHz audio.

hirschmensch · September 3, 2015, 9:33am

Thank you for your answer! That's great input!
I've already watched the video! That's good work!

I've been thinking about using the Teensy3.1... what kept me from using it is that I have to design a whole new shield that I already made for the Due. But you are absolutely right, it would be much better suited for this application mostly because of its DSP features. It also has a floating point unit right?

So you are using the built-in ADC/DAC of the Teensy? I'd like to use my Wolfson Codec chip. Do you know if there's an I²S library available for the Teensy3.1? I couldn't find one in a quick search. I just noticed that the Cortex M4 has a built-in I²S interface.

About DMA, can you give me a quick hint about how to use DMA transfer on the Due? I thought this might be done automatically.

hirschmensch · September 3, 2015, 9:35am

Well there is this experimental I²S library for the teensy3.1 but it seems like it only supports receiving at the moment...

hirschmensch · September 3, 2015, 10:32am

omg... why haven't I seen this PJRC Store and this and this Audio Library, for high quality sound input, processing and output on Teensy 3 before... now I really have to reconsider!

pjrc · September 3, 2015, 10:49am

Ah, good... you found the I2S code in the library!

There's actually 2 sets of I2S code, one for master mode, where Teensy creates MCLK, BCLK and LRCLK; and another for slave mode, where the codec chip creates the clocks. I must confess, the slave mode code is probably broken. It's had very little testing, since the SGTL5000 chip on that inexpensive shield is the most commonly used part.

If you haven't used the GUI yet, definitely give this a try:

http://www.pjrc.com/teensy/gui/

The library has so many features, with so many different objects (and more being added as we go) that a graphical tool was needed to visualize how to use them. The GUI also provides the documentation for each object.

Hardware-wise, Teensy 3.1 has the fixed point DSP extensions, but not the FPU for hardware floating point. The audio code doesn't use any floating point in the high speed processing of audio data. It does use floats for the functions presented to the Arduino sketch. Most of those are designed with inline functions, so if you use a float constant (for example, to set the gain on a software mixer channel), the inline function allows the compiler to do the float conversion at compile time.

Next year a more powerful Teensy is coming, with FPU and roughly double the clock speed. The audio code will remain fixed point with the DSP extensions, because even with the FPU present, integers with the DSP extensions (and very careful coding) are much faster. That kind of high performance will open up really advanced things like real-time pitch shifting, which requires 4 overlapping FFTs and 4 overlapping inverse FFTs, and phase correction code to match them up, all computed in real time at the audio rate.

But for ordinary audio processing, even a with filters and effects, Teensy 3.1 is plenty fast enough.

pjrc · September 3, 2015, 10:58am

For Arduino Due, obviously I have much less experience with its hardware details. I know it has DMA, but Atmel's hardware is different from Freescale's.

I believe the audio library Arduino publishes for Due uses DMA for the built-in DAC. I looked at it briefly about a year ago, but it didn't seem to make effective use of interrupts. To really do audio well, where you allow normal Arduino sketch programming with delays and blocking functions like the Wire library, you need both DMA and interrupts.

hirschmensch · September 6, 2015, 12:40pm

I'm really impressed by the Teensy's DSP capabilities! But I think I'm going to complete this project with the Due, for now. Since I've already put so much effort into this shield. I'm currently writing my own Audio library using interrupts and DMA. I'll see how far this takes me and later migrate to the Teensy.

I'll keep the forum posted with progress using the Due as a DSP.

Thanks again for the great input Mr. Stoffregen. Your name sounds so german so I'll just say Danke für die Hilfe und viel Erfolg weiterhin! Beste Grüße aus Graz!

Grumpy_Mike · September 9, 2015, 5:33pm

ee how long it takes to shift every sample of a 512 sample buffer to the next position and it's about 242µS.

You don't do that, you do it with pointers, you never actually need to move the samples, just move the point where you take it out and put it in.

pjrc · September 9, 2015, 8:53pm

One other really nice feature you get with Cortex-M4 (but not on M3) is a special memory burst access mode.

When you write code like this:

  uint32_t *ptr;
  uint32_t a, b, c, d;

  ptr = sample_buffer;
  a = *ptr++;
  b = *ptr++;
  c = *ptr++;
  d = *ptr++;

Normally a load takes 2 cycles.

The Cortex-M4 processor recognizes you're performing 4 similar load instructions in a row. The first load takes 2 cycles, but the following 3 take only 1 cycle each. The result is you can bring 8 audio samples into the CPU registers in only 5 cycles, which is only 52 ns when running at 96 MHz.

Of course, Cortex-M3 does it in 8 cycles instead of 5, which still isn't bad. But then you've got to spend more instructions dealing with the fact you've got pairs of samples in each 32 bit variable.

Cortex-M4 has the DSP extensions, which give you instructions that operate on 16 bit packed data (so each instruction comes in 2 flavors... one that takes input from the top half of a register, the other from the bottom half). Many of those instructions do multiplies that produce a 48 bit result, but discard (optionally rounding off) the low 16 bits. That also really helps for efficiency, since you get lots of internal resolution if you've planned coefficients and other things well, but it doesn't hog pairs of 32 bit registers for results. The key to speed it having 4 to 8 registers free for bringing in a 8 to 16 sample chunk of a audio all at once and leveraging those instructions to write fast, non-looping, non-branching code. Then if you loop that code, you only suffer the loop overhead and pointer setup once for each 8-16 sample chunk. Actually writing such code takes a lot of careful thought about how the ARM registers are allocated and which DSP extension instructions to use.... but you can achieve really good performance. For example, most of the objects in the Teensy Audio Library use between 1% to 5% of the CPU.

But even if you're coding on Cortex-M3, similar techniques can at least help. If you write a simple loop that moves 1 audio sample per loop iteration, you'll waste almost all the CPU time in loop overhead. Even though Cortex-M3 doesn't optimize successive load/store instructions, you still reduce looping overhead by using 4 or 8 instructions in a row to fill up the ARM register set on each iteration.

Grumpy Mike is right, that you shouldn't need to read and write every sample just to shift them in a buffer. Pointer arithmetic can usually accomplish that. But if you're implementing pretty much any sort of actual processing of the audio sample, even just a mixer that adds sets of them with each other (hopefully dealing properly with overflow/clipping), you do need to read and write every single sample. These ARM chips do have pretty amazing performance, but careful design is needed, so you don't squander it with looping and branching overhead!

hirschmensch · September 18, 2015, 1:42pm

Grumpy_Mike:
You don't do that, you do it with pointers, you never actually need to move the samples, just move the point where you take it out and put it in.

Obviously, yes. It's called a circular buffer. I was doing a linear pipeline to test the processor.

hirschmensch · September 18, 2015, 1:47pm

One other really nice feature you get with Cortex-M4 (but not on M3) is a special memory burst access mode.

When you write code like this:
  uint32_t *ptr;
  uint32_t a, b, c, d;

  ptr = sample_buffer;
  a = *ptr++;
  b = *ptr++;
  c = *ptr++;
  d = *ptr++;
Normally a load takes 2 cycles.

The Cortex-M4 processor recognizes you're performing 4 similar load instructions in a row. The first load takes 2 cycles, but the following 3 take only 1 cycle each. The result is you can bring 8 audio samples into the CPU registers in only 5 cycles, which is only 52 ns when running at 96 MHz.

Of course, Cortex-M3 does it in 8 cycles instead of 5, which still isn't bad. But then you've got to spend more instructions dealing with the fact you've got pairs of samples in each 32 bit variable.

Cortex-M4 has the DSP extensions, which give you instructions that operate on 16 bit packed data (so each instruction comes in 2 flavors... one that takes input from the top half of a register, the other from the bottom half). Many of those instructions do multiplies that produce a 48 bit result, but discard (optionally rounding off) the low 16 bits. That also really helps for efficiency, since you get lots of internal resolution if you've planned coefficients and other things well, but it doesn't hog pairs of 32 bit registers for results. The key to speed it having 4 to 8 registers free for bringing in a 8 to 16 sample chunk of a audio all at once and leveraging those instructions to write fast, non-looping, non-branching code. Then if you loop that code, you only suffer the loop overhead and pointer setup once for each 8-16 sample chunk. Actually writing such code takes a lot of careful thought about how the ARM registers are allocated and which DSP extension instructions to use.... but you can achieve really good performance. For example, most of the objects in the Teensy Audio Library use between 1% to 5% of the CPU.

But even if you're coding on Cortex-M3, similar techniques can at least help. If you write a simple loop that moves 1 audio sample per loop iteration, you'll waste almost all the CPU time in loop overhead. Even though Cortex-M3 doesn't optimize successive load/store instructions, you still reduce looping overhead by using 4 or 8 instructions in a row to fill up the ARM register set on each iteration.

That's amazing! Good to know, thank you!
I am already using DMA for data movement now. If I put these really short functions "inline" it should reduce function overhead too, at least a bit i suppose.

pjrc · September 18, 2015, 4:49pm

I still think you ought to give Teensy / Cortex-M4 a try.

You'd save a lot of time starting from a mature & well-tested library.... and you could be working on actual sounds, rather than so much tedious low-level data movement.

hirschmensch · September 19, 2015, 4:33pm

yeah, for sure! I already have the teensy 3.1 at home. I'm just gonna finish this project with the DUe now.

benbiles · July 17, 2016, 3:05pm

Hi hirschmensch & Paul Stoffregen

I am also using the I2S library with arduino DUE but in my case implimenting a kind of wireless mic.

I managed to make a circular buffer for samples comming in from I2S but it does'nt change the fact that the I2S IO is inturrupt driven and does'nt seam to allow much time for any knd of processing ?

I am trying to encode samples in IMA ADPCM which is'nt very cpu intensive and send the processed data over RF ( same in reverse on reciever )

Multithreaded application code is way above my head and i'm not sure possible on an M3 !

Do you have any pointers ( excuse the pun ) on how to use DMA or is it being used by default along with SSC? do you have a github with any code your working on or is it closed source ?

Reading the SAM3X8E datasheet section on SSC here
http://www.atmel.com/Images/Atmel-11057-32-bit-Cortex-M3-Microcontroller-SAM3X-SAM3A_Datasheet.pdf

it seams like the SSC is using DMA already at a low level already ?

"The SSC’s high-level of programmability and its use of DMA permit a continuous high bit rate data transfer without processor intervention. Featuring connection to the DMA, the SSC permits interfacing with low processor overhead to the following CODEC’s in master or slave mode DAC through dedicated serial interface, particularly I2S Magnetic card reader"

I can process junks of data from the curcular buffer using pointers but is there anything else I can do to improve things ? The I2S library is pulling in 32bits per channel as far as I can see which is overkill but I'm chucking 16bits per channel away directly.

i'm sure the teensy is way ahead in many ways and it would be esier to use that , but understanding how DMA , SSC works and writing the lower level code would be useful to learn for me. for now i'l try and skim throgh Atmels docs on the Atmel SAM3X8E ARM Cortex-M3 and see what I can find. I'm guessing though I would have to entirely re-write the I2S SSC driver to work with DMA if the current HIFI I2S library is not already utilising it ?

Topic		Replies	Views
sample and save on SD Arduino Due	3	968	May 6, 2021
Audio delay/echo project Arduino Due	11	9274	May 6, 2021
Cant find the logical error on DUE digital sampler Arduino Due	8	1816	May 6, 2021
DSP with Arduino Due... will this work? Project Guidance	4	1974	May 5, 2021
Arduino DUE audio from sd card and DSP.. Audio	15	8041	May 6, 2021

arduino due DSP shield timing question

Related Topics