Processor Selection for Audio Analysis

Looking for a processor to perform FFT and digital filtering of music for beat detection, from 40Hz to 10kHz, for a potential lighting product. A key need is fast response, no longer than 10msec, preferably around 5msec per measurement loop. Plan to implement a combination of Goertzel/IIR filters for specific frequencies or narrow ranges, and a wide-bin FFT for the remainder of the spectrum. Capture and analysis of a 40Hz signal takes at least 1 cycle (25msec) but looking mainly for signal presence, not amplitude accuracy, so hoping to detect an amplitude change in less than a full cycle.

From experimentation, an Arduino/Nano just doesn't have the horsepower or memory for FFT or digital filters. Have used analog filters and peak detectors, but they're too much effort to implement in any quantity. Since it's already familiar and I have existing code, want to keep with the Arduino IDE and looking at either an "enhanced" Arduino (Due, M0) or a Teensy 3.2. Expect these processors can all run existing Goertzel/IIR and FFT algorithms, but don't know how flexible they are relative to sample rates and FFT sizes. The Teensy audio library seems tailored around the 44kHz sample rate so may not be very flexible.

Any suggestions on what processor would be better for these needs?