Advice for a newbie in DSP: which board could work for me?

Hi all!

I am new in this forum, and in hardware implementation of DSP too (I am already familiar with the theory and a bit of software implementation). I am therefore seeking guidance for my first steps in this domain ...

What I am seeking to create is a real time DSP processor for guitar using a development board. Here is the main goal:

  • the guitar signal is sampled at 44100Hz (i can go a bit down with the sampling frequency, but not less than 32 kHz) and at 16 bits or more.
  • a 1024 tap FIR filter is applied. 1024 input signal values (x(n), x(n-1), ... x(n-1023)) and 1024 coefficients (b0, b1, ..., b1023) are multiplied and added for each output value. Hence the output becomes:
    output(n) = b0x(n) + b1x(n-1) + ... + b1023*x(n-1023)
  • as a bonus, it would be nice to be able to store different sets of coefficients, to change the response of the FIR filter.

Despite searching for info here and there, I am at a loss with microcontrollers and many of the terms used to characterize their performance. Does a board exist which can meet the aforementionend requirements both for the sampling and the processing of the signal ?

Thank you in advance,


An Arduino or any other generic microcontroller isn't going to do that.
You will need a processor which has instructions specifically designed for DSP.


If you're interested in staying in the Arduino world, at least as far as being able to utilize
Arduino shields, you might look at some of the boards that use 32-bit ARM processors instead
of ATmegas. I should imagine they will be powerful enough for your app.

Eg, Arduino Due, Digilent chipKit, etc.

Thanks for the answers.

I was looking at the Arduino Due. If I understood correctly, the built in ADCs can sample the signal up to 1000ksample/s at 12 bits. Maybe that could be enough for my application.
Furthermore, this shield could be interesting to boost up the sampling process, even though I am not sure how the interface really works: audio codec sheild for arduino and maple | Open Music Labs
Or I could perform some oversampling if possible, and obtain 14 bits at 62.5kHz.

Any other suggestion concerning the memory (for filter coefficients and sample storage) and processing power?

According to UM0585 (STM, 72 MHz cortex-3 CPU, I’d estimate performance as close to DUE):

Table 16. 16-bit, 32-tap FIR filter (ASM filter)
cycle count time
3727 51.76 ?s

This in ASM, 51 usec would imply <20 kHz sampling, 32 -tap only. 1024 taps and ~ 50 ksps sampling would ask for 2.5 x 32 x 72 MHz = 5.760 MHz core.
Probably, you don’t need 1024 FIR, check your math.

audio codec sheild for arduino and maple | Open Music Labs

Looks like a good board, also has Arduino libraries, so should be easy to interface. You can
use it for sampling, and the Arm controller to do the heavy-duty signal processing.

Have a look at the Teensy 3 which uses an ARM Cortex M4
It is easy to “overclock” it at 96MHz instead of the basic 48MHz clock.
The M4 has DSP instructions, such as multiply and accumulate. However, AFAIK the Teensy 3 doesn’t have any software to allow easy access to the DSP instructions so the only way to get at them is with assembler code.
As Magician says, you should check whether you really need a 1024 tap FIR filter. The largest ones I have seen and/or used were on the order of 120 taps.


The 1024 taps are pretty necessary, I could go down to 512 but that's pretty much the lowest limit (impulse responses for guitar cabs must represent a very irregular curve which covers the whole frequency spectrum!).

I found this document which may help in increasing the speed for FIRs and provides some info on the speed of Cortex 3 and 4:
Maybe it could be implemented in the Due, time for some reading!

I'll look into the teensy 3, it seems interesting ... though the arduino due with codec shield is as well!

The DSP concepts PDF that you linked to has a slide titled "Cortex-M4 FIR Performance" which shows how many cpu cycles are required per tap. When you sample at 44100Hz, a DSP chip that can do 1 cycle per tap would require a processor speed of at least 44100*1024 = 45.2MHz to handle 1024 taps. This eliminates the Teensy3 running at 48MHz (you need some head-room).
But the Teensy3 isn't a DSP chip and even if you could get their optimum claimed 1.6 cycles per tap it would require a minimum speed of 72MHz. The Teensy3 overclocking at 96MHz might be able to do that but only if you really know what you're doing. Realistically, I think you might be able to get 3 cycles per tap which is 135MHz. Neither a Teensy3 nor a Due are going to manage that.
If you drop the requirements to 32kHz sampling rate and 512 taps and find out how to use the DSP instructions in the Teensy 3 it should work. I don't think the Due would be able to do it. The PDF says standard C code for an FIR takes 12 cycles per tap so without DSP instructions it would require a 200MHz clock - the Due is 84MHz.

Good luck.


Probably, I 'm missing something, but IMHO linked doc just an absurd.

DSP assembly code = 1 cycle

I want to see this DSP. Real world CPU, doesn't matter is it DSP or not, would require to load one register with coefficient, another with a data/sample, than multiply and accumulate, store results, than repeat. Shifting data back and forth from / to memory (best scenario RAM) and registers (real world CPU doesn't have 512 - 1024 registers to keep everything at hands, and there is no 512 multipliers to do math in parallel) alone would require ~10 instructions - referencing address, get pointer, direct/indirect etc. 3727 clock cycles per 32 tap FIR reported by STM, gives 116 cycles per tap, more than my highly optimistic 10, and I think it's very good. 512 x 116 = 59632, to get in 20 usec (50 ksps) you need = 2.98 GHz core.

doesn't matter is it DSP or not

It does matter, a lot. Processors which are specifically designed to do DSP have special instructions and memory architectures which allow them to overlap a lot of functions.
The DSP processor I used to use was an ADSP 2181. It has a 33MHz clock but with proper programming it could perform three instructions (of the right type) simultaneously so that for many DSP operations it was effectively running at nearly 100MHz. It has an architecture which allows it to multiply and accumulate two operands while simultaneously fetching the next two operands. It also has a zero-overhead DO loop which means that once the count is initialized you can repeat an instruction or group instructions without having the overhead of counting and testing a register. This is particularly useful for an operation like the multiply-accumulate (MAC) required by a FIR filter. The MAC and fetching of the next oeprands can be done in one instruction and the instruction can be repeated N times (the length of the filter) without any looping overhead that you would have in a normal processor. This means that one pass through a 100-tap FIR filter can be done in about one microsecond.
The Cortex-M3 processor used by the Due does not have any special DSP instructions. The Cortex-M4 processor used by the Teensy3 has multiply and accumulate instructions (and some others) but it does not have the zero-overhead DO type of instruction.


AFAIK, cortex-m3 has MLA, not sure about DUE, but referenced above STM32 has two MLA and MLS. Do-while or For loop overhead isn't really a problem, data flow is. Fetching new operands while calculating present, will works when memory data bus as fast as CPU, what I can see, they especially mention in report "wait states", and for 72 MHz there is 2 of them, 1 for 48 and 0 for 24 MHz. Teency-3 , also has "wait states" ?

AFAIK, cortex-m3 has MLA, not sure about DUE,

The DUE is an M3. The M3 has MLA and MLAS but they take one or two cycles. The M4 has those instructions plus a lot of other DSP-related instructions all of which complete in one cycle and which are not available on the M3.
See Page 6 of