Speech to text in Arduino

Hello,
I'm currenly making an Arduino Voice Recognition System that needs to know what word are you saying over a microphone. Modules like the DF-Robot one don't help me, because they have a limited number of commands (and basically, as I said before, I only want to translate speech to text, insead of doing something in particular).
I've been looking for a solution but I haven't found any util gadget.
I'm wondering if I should establish communication between the arduino and another board that would do this through pyhton, for example.

Thank you!

only :wink:

While it might seem simple today with smartphones and AI-driven virtual assistants you can talk to, implementing a speaker-independent system of ChatGPT, Amazon, Google Assistant, or Apple Siri quality / latency on a small Arduino board, even on a powerful ESP32 or Raspberry Pi, remains a significant engineering challenge and really just wishful thinking. Won't happen...

Commercial systems process user input by employing quality microphone arrays and digital signal processors, advanced AI and robust computational resources either on board or in the cloud. For example, the Apple Neural Engine in the A15 Bionic chip can execute up to 15.8 trillion operations per second whereas an ESP32 at best would operate 480 millions operations per second using its two cores (which you'll never reach).

If you want to meet such a quality/latency for Voice Recognition, you will have to capture the audio stream and route it to an online cloud service or (with less quality) to a local computer harnessing it built in power and some code (like your python stuff based on SpeechRecognition · PyPI)

Thank you for the fast answer!
I deduce from your message that I won't be able to implement it with a extern module.
But I've found a library similar to the SpeechRecognition one that seems to work on micropython. (Updloading the speech audio file to an extern server. I don't know how latency-efficient it is...)
Because of ESP32 can be programmed with micropython as well... Do you think I'm in the right way?
Thank you!

You would need to clarify your requirements like speed, latency, precision or is it OK to offload the speech recognition to the cloud (means you need the cloud for it to work) etc...

if you are more at ease with Python than C++, then sure, micropython is an option - it's not as fast as C++ though so depends how much computation you need to happen locally.

I want to talk into a microphone and have what I said appear on an LCD. Consequently, the less latency (assuming I'm going to upload the audio to a server to get the text) the better.
If I had to put a value, I would say that the maximum latency I would accept would be between 2 and 3 seconds.

Actually I'm used to use C++ with Arduino. The only reason I would program in micropython is because of the library I mentioned.

Thank you again!

OK. they hide the hardwork of capturing the incoming audio and streaming it to the server of choice.

I've never used it, so don't know what to expect in terms of performance (it also depends on your wifi network and the latency of the remote engine you chose)

You must understand the Python is significantly slower than C++, by an order of several magnitudes. This is because Python is an interpreted language where as C++ is a compiled language. This means C++ is compiled down to machine code that is then run.

Where as Python has to go through each line one by one and then look up what to do with this line.

I also think you are trying to do far too much for the current state of knowledge you have.

Some libraries have a python API but the main part of the code could be in C++, so you could get ok results with such a lib.

(I don’t know the details of that specific one)

I have translated a C++ library into Python but the two are fundamentally incompatible. It is the way Python treats some fundamental operation. For example modular arthritic gives runtime errors something that C++ can never give because mainly there is nothing to give these errors to in an embedded system.

Variable types are also incompatible.

What I meant is that you can use tools like pybind11 or Boost.Python, which create bindings between the two languages. These bindings allow Python to treat C++ functions as if they were native Python functions.
You can also use ctypes or cffi, which let Python interact with precompiled C++ shared libraries by directly calling their functions.

Under the hood, many Python libraries for AI, such as TensorFlow and PyTorch, use C++ at their core to achieve high performance while providing a simple Python interface.

I'm curious. Why is it that the bbc micro could do this (in a fashion), the spectrum also had a microspeech and with all the megabytes or giga, also with vastly more computing power this is still a problem? Even our gp used speech to text to record patient notes on a very basic pc clone.
~Confused

Simple it couldn't.

What it did do was use Liner Predictive encoding to generate speech. This idea was taken up in the ThinkSpeach Arduino library which copies this algorithm. There were also four extra ROMs that could be added in a cartridge in the lower right of the keyboard.

It could not recognise words only produce them. I was the author of the Body Building Course in the Micro User Magazine.

I know python is slower than C++!
As I said, I would only use Python because I haven't found any library or example that helps me with my purpose in C++. Anyways, for the project I'm looking for I don't need extreme processing speed if I'm going to upload the audio file to a service like Google Translate API.

That's exactly why I'm asking on this forum buddy! :wink:

And has any one given you an answer in the affirmative? If it can be done, which many people don't think it can be, then this requires a lot more work than can be answered here, especially as you don't seem capable of following up on the hints people have given you.

Post#2 says it all.

I sort of remember those.

As well as being involved in the manufacture of the beeb, I was a keen user, difficult not to be when we could get them for free.

I do remember doing some speech stuff with the SP0256AL2 speech synthesizer.

Mind you if we had met back them, I doubt I could have spoken to you, too much inside info and secrets.

So did you work in Cambridge for acorn then?

I do remember making the journey from Manchester down to Cambridge on several occasions for announcements and briefings. You probably didn't know but on several occasions we were told that one or another piece of information was confidential and not to use it. We always honoured that undertaking.

We also had a real ex Fleet Street journalist working for the Mag, that used to make up stories, I was once quoted on a radio one news bulletin over something I said, but made up some other consequences which was not true. Then The Sun rang me up wanting to know if I was sorry, but as this was all made up that is what I told them. They seemed to understand how the press worked.

I was Senior Engineer at AB Electronic Systems in Abercarn looking after the electronics side. We made the Beebs, Archimedes, Prestel, 2nd Processors, Atoms, Doomsday, did bits of Spectrums and other stuff.

Acorns approach to manufacture was along the lines of, here is a working prototype, there are no design issues, now make us 5,000 a week.

The factory was only small and we just did not have the space to cope with hold ups in production. We made, with Acorns approval, design "suggestions" that cut the production failure rate to more acceptable levels.

Came across this video of the factory recently;

I don't appear but I recognize several people in the Video.

There were of course boards that were damaged in production, so although working could not be sold. So plenty of 'free' beebs to be found.

It really saddens me to see how people like this treat those they consider "students" or "less knowledge people". You literally have no idea who I am or what knowledge I have.

Comments like these make the person asking for help on the forum feel denigrated and not taken seriously. If you don't feel like answering, just don't answer.

I'm writing this to say that I have developed my own solution using Deepgram as a transcription service. I have uploaded it to github for anyone who wants to use it.
Thank you.

Comments like this are my assessment of your current state of knowledge, based on what you have shown us to date. It is not designed to be a putdown or to belittle you, although some people do take offense at this.

So well done at getting it working and proving me wrong.

But it did use an external computer to do the heavy lifting, something you seemed to reject at one stage.

I can't get it to work no matter how close I hold my ESP32 S3. What do you think the problem might be?