1. Record sound (sentences, not too long, maybe 5 seconds max)
I would like to stream this to the server
That will really slow down the rate at which you can sample data. To the point where "forget it" rears it's ugly head.
1. The quality of the audio is not of great importance, I can filter is a bit on the server side, but it needs to be good enough to be able to push through voice recognition (speech-to-text)..
The higher speed DUE would be a better choice, since it also has a lot more SRAM where you would store the sampled data prior to sending it.
You might be better off starting with a computer with the processing power to do the voice recognition locally
read the FFT reference in the Library page