Go Down

Topic: arduino due DSP shield timing question (Read 1 time) previous topic - next topic

hirschmensch

One other really nice feature you get with Cortex-M4 (but not on M3) is a special memory burst access mode.

When you write code like this:

Code: [Select]

  uint32_t *ptr;
  uint32_t a, b, c, d;

  ptr = sample_buffer;
  a = *ptr++;
  b = *ptr++;
  c = *ptr++;
  d = *ptr++;


Normally a load takes 2 cycles.

The Cortex-M4 processor recognizes you're performing 4 similar load instructions in a row.  The first load takes 2 cycles, but the following 3 take only 1 cycle each.  The result is you can bring 8 audio samples into the CPU registers in only 5 cycles, which is only 52 ns when running at 96 MHz.

Of course, Cortex-M3 does it in 8 cycles instead of 5, which still isn't bad.  But then you've got to spend more instructions dealing with the fact you've got pairs of samples in each 32 bit variable.

Cortex-M4 has the DSP extensions, which give you instructions that operate on 16 bit packed data (so each instruction comes in 2 flavors... one that takes input from the top half of a register, the other from the bottom half).  Many of those instructions do multiplies that produce a 48 bit result, but discard (optionally rounding off) the low 16 bits.  That also really helps for efficiency, since you get lots of internal resolution if you've planned coefficients and other things well, but it doesn't hog pairs of 32 bit registers for results.  The key to speed it having 4 to 8 registers free for bringing in a 8 to 16 sample chunk of a audio all at once and leveraging those instructions to write fast, non-looping, non-branching code.  Then if you loop that code, you only suffer the loop overhead and pointer setup once for each 8-16 sample chunk.  Actually writing such code takes a lot of careful thought about how the ARM registers are allocated and which DSP extension instructions to use.... but you can achieve really good performance.  For example, most of the objects in the Teensy Audio Library use between 1% to 5% of the CPU.

But even if you're coding on Cortex-M3, similar techniques can at least help.  If you write a simple loop that moves 1 audio sample per loop iteration, you'll waste almost all the CPU time in loop overhead.  Even though Cortex-M3 doesn't optimize successive load/store instructions, you still reduce looping overhead by using 4 or 8 instructions in a row to fill up the ARM register set on each iteration.

That's amazing! Good to know, thank you!
I am already using DMA for data movement now. If I put these really short functions "inline" it should reduce function overhead too, at least a bit i suppose.

Paul Stoffregen

I still think you ought to give Teensy / Cortex-M4 a try.  ;)

You'd save a *lot* of time starting from a mature & well-tested library.... and you could be working on actual sounds, rather than so much tedious low-level data movement.

hirschmensch

I still think you ought to give Teensy / Cortex-M4 a try.  ;)

You'd save a *lot* of time starting from a mature & well-tested library.... and you could be working on actual sounds, rather than so much tedious low-level data movement.
yeah, for sure! I already have the teensy 3.1 at home. I'm just gonna finish this project with the DUe now.

benbiles

#18
Jul 17, 2016, 05:05 pm Last Edit: Jul 31, 2016, 08:39 am by benbiles
Hi hirschmensch & Paul Stoffregen

I am also using the I2S library with arduino DUE but in my case implimenting a kind of wireless mic.

I managed to make a circular buffer for samples comming in from I2S but it does'nt change the fact that the I2S IO is inturrupt driven and does'nt seam to allow much time for any knd of processing ?

I am trying to encode samples in IMA ADPCM which is'nt very cpu intensive and send the processed data over RF ( same in reverse on reciever )

Multithreaded application code is way above my head and i'm not sure possible on an M3 !

Do you have any pointers ( excuse the pun )  on how to use DMA or is it being used by default along with SSC? do you have a github with any code your working on or is it closed source ?

Reading the SAM3X8E datasheet section on SSC here
http://www.atmel.com/Images/Atmel-11057-32-bit-Cortex-M3-Microcontroller-SAM3X-SAM3A_Datasheet.pdf

it seams like the SSC is using DMA already at a low level already ?

"The SSC's high-level of programmability and its use of DMA permit a continuous high bit rate data transfer without processor intervention. Featuring connection to the DMA, the SSC permits interfacing with low processor overhead to the following CODEC's in master or slave mode DAC through dedicated serial interface, particularly I2S Magnetic card reader"

I can process junks of data from the curcular buffer using pointers but is there anything else I can do to improve things ? The I2S library is pulling in 32bits per channel as far as I can see which is overkill but I'm chucking 16bits per channel away directly.

i'm sure the teensy is way ahead in many ways and it would be esier to use that , but understanding how DMA , SSC works and writing the lower level code would be useful to learn for me. for now i'l try and skim throgh Atmels docs on the Atmel SAM3X8E ARM Cortex-M3 and see what I can find. I'm guessing though I would have to entirely re-write the I2S SSC driver to work with DMA if the current HIFI I2S library is not already utilising it ?









Go Up
 


Please enter a valid email to subscribe

Confirm your email address

We need to confirm your email address.
To complete the subscription, please click the link in the email we just sent you.

Thank you for subscribing!

Arduino
via Egeo 16
Torino, 10131
Italy