One other really nice feature you get with Cortex-M4 (but not on M3) is a special memory burst access mode.
When you write code like this:
uint32_t *ptr;
uint32_t a, b, c, d;
ptr = sample_buffer;
a = *ptr++;
b = *ptr++;
c = *ptr++;
d = *ptr++;
Normally a load takes 2 cycles.
The Cortex-M4 processor recognizes you're performing 4 similar load instructions in a row. The first load takes 2 cycles, but the following 3 take only 1 cycle each. The result is you can bring 8 audio samples into the CPU registers in only 5 cycles, which is only 52 ns when running at 96 MHz.
Of course, Cortex-M3 does it in 8 cycles instead of 5, which still isn't bad. But then you've got to spend more instructions dealing with the fact you've got pairs of samples in each 32 bit variable.
Cortex-M4 has the DSP extensions, which give you instructions that operate on 16 bit packed data (so each instruction comes in 2 flavors... one that takes input from the top half of a register, the other from the bottom half). Many of those instructions do multiplies that produce a 48 bit result, but discard (optionally rounding off) the low 16 bits. That also really helps for efficiency, since you get lots of internal resolution if you've planned coefficients and other things well, but it doesn't hog pairs of 32 bit registers for results. The key to speed it having 4 to 8 registers free for bringing in a 8 to 16 sample chunk of a audio all at once and leveraging those instructions to write fast, non-looping, non-branching code. Then if you loop that code, you only suffer the loop overhead and pointer setup once for each 8-16 sample chunk. Actually writing such code takes a lot of careful thought about how the ARM registers are allocated and which DSP extension instructions to use.... but you can achieve really good performance. For example, most of the objects in the Teensy Audio Library use between 1% to 5% of the CPU.
But even if you're coding on Cortex-M3, similar techniques can at least help. If you write a simple loop that moves 1 audio sample per loop iteration, you'll waste almost all the CPU time in loop overhead. Even though Cortex-M3 doesn't optimize successive load/store instructions, you still reduce looping overhead by using 4 or 8 instructions in a row to fill up the ARM register set on each iteration.
Grumpy Mike is right, that you shouldn't need to read and write every sample just to shift them in a buffer. Pointer arithmetic can usually accomplish that. But if you're implementing pretty much any sort of actual processing of the audio sample, even just a mixer that adds sets of them with each other (hopefully dealing properly with overflow/clipping), you do need to read and write every single sample. These ARM chips do have pretty amazing performance, but careful design is needed, so you don't squander it with looping and branching overhead!