DSP speed - thought experiment

Since I've built a little DSP shield for the Arduino Due and am now doing my first digital FIR filters and other stuff with it, I already encounter the limits regarding processing speed. Now I'm in the middle of a little thought experiment...

For processing in real-time, it's quite clear to me that the speed of the ARM Cortex M3 is the limit of what I can do. But if I say that things do not have to happen in real-time and latency is fine, what are the limitations then?

In other words, would it be possible to read a couple of samples (a buffer), process this buffer in more complex ways and then write the buffer out? But still in a continuous flow of buffer blocks (resulting in continuous audio)?

The idea is that I can easily do offline DSP processing. For instance, read a whole file of music, do the processing and write the processed file out again, which does obviously not take longer than the whole song lasts. So maybe I could split, for instance, a song into buffer blocks, process them and have digital processing of a higher quality?

Or is it in any case critical to be able to process one sample in the time it takes to read the next sample (=sample rate)?

For 48kHz sample rate, this would mean ~20µs, which isn't really much compared to the M3s speed of 80MHz! Maybe a few hundred MAC operations at max!?

I'm eager to hear your thoughts on the subject! I couldn't really find satisfying information or methods on this on the internet. Maybe someone knows about books on this topic?

One more thing...

When you're using a USB audio interface, and your PC can't manage to process everything in time, you can increase the buffer size of the interface, which results in a larger latency, but everything works fine again! Can this be applied to embedded DSP too?

I'd estimate DUE processing through-output in 250-400 ksps. Working with "block-structure" data is easier, so FFT / FHT may approach 400 ksps, or 200 ksps sampling rate in case of stereo signal, close to 100 ksps - if stereo + 50% overlap (required for windowing).
Doing sample by sample, FIR / IIR, so stereo 200, and with 4-TAP filter gives close to 50 ksps.

Thanks for the quick reply! Seems like a reasonable approximation! So the equation "more latency = more processing power" doesn't quite hold, right?

Btw, I really like your quote "per aspera ad astra"! I had to google it I admit though. :wink:

So the equation "more latency = more processing power" doesn't quite hold, right?

Correct. Gain in performance comes from less interruption calls, in real-time it's ones per sample, and another case ones per block. So equation looks like hyperbolic:
T = 1/( k + m), where k - interrupt overhead, and m - usefull calculation.
T = (N x m) / ( k + N x m). T'd approach "1" in case boundary-less block N->infinity.

I'll try expanding into frequency domain! Well, at least my software... :wink:

hirschmensch:
In other words, would it be possible to read a couple of samples (a buffer), process this buffer in more complex ways and then write the buffer out? But still in a continuous flow of buffer blocks (resulting in continuous audio)?

That's exactly how I designed the Teensy Audio Library. It processes audio in 128 sample blocks, which is approx 2.9 ms at 44.1 kHz sample rate.

Unfortunately, it won't run directly on Arduino Due, because it makes heavy use of the Cortex-M4 DSP extensions, which aren't present in the Cortex-M3 processor on Due. It's also built around Freescale's peripherals and DMA engine, which are different from Atmel's.

But as to your original question, most certainly yes, collecting small blocks of samples works very well. If you give this library a try, I believe you'll see it's extremely effective.

Alright, thank you! I'll have to check that library out!
So you do everything in frequency domain or just work with sample blocks in time and/or frequency domain?

hirschmensch:
So you do everything in frequency domain or just work with sample blocks in time and/or frequency domain?

If you look at the source code, you'll see nearly all the objects manipulate blocks of time domain samples.

Of course, the FFT and tone detection objects collect up groups of time domain blocks and make magnitude-only frequency domain data available to the sketch.

So far, I've not implemented any features by turning time domain data into complex frequency domain data. I might try some later in 2015, particularly real-time pitch shifting.

The latency on a computer is mostly related to the fact that you have a multitasking operating system. Even when you are running only one application the operating system is multitasking in the background.

The audio input flows into a buffer at a constant rate. Then when the operating system gets around to it, it reads the buffer in a quick burst. If computer is busy doing something and the buffer doesn't get read in time, the buffer overflows and you get a glitch.

The output buffer works oppositely. Data is written into the buffer in a quick burst and flow out at a nice constant rate. If the buffer doesn't get re-filled in time, you get buffer underflow and a glitch.

Bigger buffers give the operating system more time to do other things before you get a glitch, but obviously this increases latency.