Speech/Voice Recognition

Hello everyone!

My name is André, and I need some help to set up a project, using Arduino, to recognize sounds, voices and, more specifically, speech. And I need to do this without any shields, or servers! Just a frequency analysis, using FFT for example.
For my class project, I have to do something like this and compare the efficiency with a system using a shield, like EasyVR 3.0.
I've been searching around and some members have already posted something like that, however the files on the links were no longer available.
I have to do this until the end of the semester, so I'm in a hurry! If any of you could help me, or send me examples, algorithms for studying, etc, I would be so much grateful. If you have working sketches would help a lot too, since I could study the code line by line.

Thank you very much in advance! Please help me :frowning:

André

One more thing!

Without the shield or the server, it can be a simple digital signal processing, just to make a LED to light up, for example; I say "green" and a green LED lights up, that would suffice with the use of Arduino stand-alone. Obviously, if a sketch could do something more it would be better, but just to facilitate the job for you guys...

Hello everyone!

My name is André, and I need some help to set up a project, using Arduino, to recognize sounds, voices and, more specifically, speech. And I need to do this without any shields, or servers! Just a frequency analysis, using FFT for example.
For my class project, I have to do something like this and compare the efficiency with a system using a shield, like EasyVR 3.0.
I've been searching around and some members have already posted something like that, however the files on the links were no longer available.
I have to do this until the end of the semester, so I'm in a hurry! If any of you could help me, or send me examples, algorithms for studying, etc, I would be so much grateful. If you have working sketches would help a lot too, since I could study the code line by line.

One more thing!

Without the shield or the server, it can be a simple digital signal processing, just to make a LED to light up, for example; I say "green" and a green LED lights up, that would suffice with the use of Arduino stand-alone. Obviously, if a sketch could do something more it would be better, but just to facilitate the job for you guys...

Thank you very much in advance! Please help me :frowning:

André

You can see this tutorial for speech/voice recognition :

https://create.arduino.cc/projecthub/msb4180/speech-recognition-and-synthesis-with-arduino-2f0363

Oh, Sorry, I have to do something without the use of servers and shields. Only with the Arduino. I would like to do some FFT and spectrum analysis.
Do you have some example codes with FFT or something like that?

The FFT won't help you with voice recognition, but if you want to learn something about it, check out ArduinoFFT - Open Music Labs Wiki.

Arduino is totally unsuited for voice recognition.

I have already seen something like that around here, but, as I said, the files were no longer available on the links... I just wanted to do a simple recognition system, to perform an analysis on frequency peaks of a voice signal.

More specifically on this link: Voice Recognition Techniques - Project Guidance - Arduino Forum

@andrecr03, do not cross-post. Threads merged.

You haven't even said which Arduino you're using. There are many different Arduino boards, and many more that are not officially Arduino but are Arduino compatible.

This is important, because the many boards vary greatly in capability. Arduino Uno can just barely manage even a small FFT, and probably can't do any significant data processing without considerable blind (or "deaf") time between each set of data collected. Boards like Arduino Due are much more powerful, but still there can be tricky matters of software if you wish to do analysis while still collecting the next imcoming data so you don't have gaps between each FFT.

I just wanted to do a simple recognition system,

Sorry there is no such thing. They are all complex.

I say "green" and a green LED lights up,

and what if you say "great", "greedy", "margarine" and the green light comes on?
Basically an FFT is only the very first step, you need to take lots of FFTs for the duration of the sound. Then you have to do a search of all the template sounds you have stored and see which one matches most closely. Then you have to look at the probability that the close match you have is actual a word you want.
The whole thing works on probabilities. There is no such thing as a perfect speech recognition system.

Sorry, my mistake. I'm using an Arduino Due.
I know there isn't a simple speech recognition, but what I mean is that I want to make a system that can analyze spectrums and choose actions..

Of course it won't be perfect Mike. But Apple, Amazon and others have figured out how to do this pretty well, at least for English language with USA dialect. Admittedly they are using huge server farms, so perhaps the methods are impractical for microcontrollers? Or maybe not?

I'm particularly interested in this for my Teensy Audio Library.... since we already have continuous 50% overlapped windowed 1024 point FFTs running on Teensy with quite a lot of the CPU time still available to actually do that pattern matching (especially on the newer, faster Teensy 3.6). In the coming years, we're going to get more and more powerful chips, since today's fastest microcontrollers are still mostly only 90 nm silicon.

Are there good public references for how the FFTs are distilled to smaller data sets, and those then matched to patterns? Or is that sort of knowledge only existing as the "secret sauce" at Apple, Google & Amazon?

There are lots of template matching algorithms in the public domain, they are mainly based on correlation. Look at the gesture controlled stuff for a simple example.

Apple, Amazon and others have figured out how to do this pretty well, at least for English language with USA dialect.

That is the problem I am actually English with a north Manchester accent, it is not at all strong but voice recognition is annoying imprecise. Anyone with a strong accent doesn't stand a chance. We don't all sound like Dick van Dike in the movies, in fact none of us do.

Please, I just have to do a word recognition algorithm. It doesn't have to be speaker independent. It can be only for my voice!
I just want to know how can I take a voice signal from a microphone connected to the board (a KY-037 microphone) and do some signal processing on it.
To sum up, are there some example algorithms I can use, modifying the code?
I would like to know and use libraries and functions, like to do a FFT, show a spectrum, start listening the microphone, etc.
I don't know any of that

Can you explain what is wrong with the link in reply #3?

See link in reply #5 for FFT code.

Grumpy_Mike:
Can you explain what is wrong with the link in reply #3?

Please, correct me if I'm wrong. He suggests me to use a server (bitvoicer), but I thought that when you use a server, you don't really develop the voice recognition system/algorithm, and just train words and the programmer only decides how to use each recorded command, like the EasyVR.
For this project, to be graded, I have to do a voice recognition algorithm with Arduino, even if it's simple, but I really have to do the windowing, the FFT, etc, and the block diagram, along with the code, in order to perform the recognition.
Please, correct me if it's needed. As far as I know, the server only provides you processing power, and not DSP techniques and libraries, as I requested.

Thanks for all the help until now.

All simple voice recognition systems require training. The complex ones do as well, it is just that the training is different.
I think you have an impossible project. There are no easy answers, no magic algorithm.

Basically you have to input a sample of sound. Top and tail it, that is remove the start and end bits leaving just the word. Then extract parameters from the sound. I don't think that an FFT is a suitable prameter but it depends on what sort of system you want, an unspecific voice would not depend on any way on the frequency profile.
Finally you need to compare what you have with a previously prepared template to see what probability you have for a match.

The parameter extraction part defines what sort of system you have. As I understand it, the more complex system attempt to extract fricative sections from the speech, and then look up the combinations of these in a dictionary.

andrecr03:
I would like to know and use libraries and functions, like to do a FFT, show a spectrum, start listening the microphone, etc.

About a year ago I wrote a very detailed tutorial and taught a couple hands-on workshops which included this exact subject. But I have some good news and some bad news for you.

First, the bad news: this tutorial and the library code only works on Teensy 3.x. It will not work on Arduino Due. The code makes heavy use of DMA and special peripherals not present on Due. It also uses the Cortex-M4 DSP extensions, which are not present in the Cortex-M3 processor on Arduino Due. Maybe this material can offer some very minor help even if you don't have compatible hardware, but to actually use it, you will need get a Teensy board. (full disclosure: my company makes these boards... so my opinion is biased, but I really do wish to help you if I can)

I also do not know how to use the FFT output for speech recognition. I'm writing this message for you in hopes you might share whatever you learn. I can only help with the part you asked above... how to get the audio data and perform FFTs, and do so in a way that other code can easily work with the data while more audio is captured and analyzed by the FFT without gaps.

The good news: I wrote a very detailed 31 page tutorial manual. You can get it here:

https://www.pjrc.com/store/audio_tutorial_kit.html

Alysia and I also went to the trouble to shoot & edit a complete 48 minute walkthrough video. It shows all the tutorial steps. Scroll down at that page for the video. Of course, actually doing the tutorial yourself with the real hardware is essential, but if you get stuck and the words & pictures of the 31 page manual aren't clear, hopefully the video helps.

Again, this code only works on Teensy 3.2, Teensy 3.5 and Teensy 3.6. The tutorials use this audio shield. There are other ways to get signals in and out, and after you've read or watched the tutorial and if you look around the options offered in the design tool (part 2 of the tutorial) you'll see other ways to input & output signals. But as a practical matter, for a microphone that shield is probably the best way to get started. Notice the part of the tutorial where the mic gain is software adjustable....

If you read or watch part 3-2 of the tutorial, hopefully you can see how this works for FFT data. The audio library takes care of acquiring the signal and computing the FFT, which you can then use in your program.

Again, how you'll use the FFT data to recognize works is beyond my knowledge. I am curious. If this info helps you get started, and if you learn anything useful that could help me or others, I sincerely hope you'll share. Maybe you'll even participate here on the forum and occasionally help them, as we're trying to help you?