One other really nice feature you get with Cortex-M4 (but not on M3) is a special memory burst access mode.When you write code like this:Code: [Select] uint32_t *ptr; uint32_t a, b, c, d; ptr = sample_buffer; a = *ptr++; b = *ptr++; c = *ptr++; d = *ptr++;Normally a load takes 2 cycles.The Cortex-M4 processor recognizes you're performing 4 similar load instructions in a row. The first load takes 2 cycles, but the following 3 take only 1 cycle each. The result is you can bring 8 audio samples into the CPU registers in only 5 cycles, which is only 52 ns when running at 96 MHz.Of course, Cortex-M3 does it in 8 cycles instead of 5, which still isn't bad. But then you've got to spend more instructions dealing with the fact you've got pairs of samples in each 32 bit variable.Cortex-M4 has the DSP extensions, which give you instructions that operate on 16 bit packed data (so each instruction comes in 2 flavors... one that takes input from the top half of a register, the other from the bottom half). Many of those instructions do multiplies that produce a 48 bit result, but discard (optionally rounding off) the low 16 bits. That also really helps for efficiency, since you get lots of internal resolution if you've planned coefficients and other things well, but it doesn't hog pairs of 32 bit registers for results. The key to speed it having 4 to 8 registers free for bringing in a 8 to 16 sample chunk of a audio all at once and leveraging those instructions to write fast, non-looping, non-branching code. Then if you loop that code, you only suffer the loop overhead and pointer setup once for each 8-16 sample chunk. Actually writing such code takes a lot of careful thought about how the ARM registers are allocated and which DSP extension instructions to use.... but you can achieve really good performance. For example, most of the objects in the Teensy Audio Library use between 1% to 5% of the CPU.But even if you're coding on Cortex-M3, similar techniques can at least help. If you write a simple loop that moves 1 audio sample per loop iteration, you'll waste almost all the CPU time in loop overhead. Even though Cortex-M3 doesn't optimize successive load/store instructions, you still reduce looping overhead by using 4 or 8 instructions in a row to fill up the ARM register set on each iteration.
uint32_t *ptr; uint32_t a, b, c, d; ptr = sample_buffer; a = *ptr++; b = *ptr++; c = *ptr++; d = *ptr++;
I still think you ought to give Teensy / Cortex-M4 a try. You'd save a *lot* of time starting from a mature & well-tested library.... and you could be working on actual sounds, rather than so much tedious low-level data movement.
Please enter a valid email to subscribe
We need to confirm your email address.
To complete the subscription, please click the link in the
email we just sent you.
Thank you for subscribing!
via Egeo 16