text to speech without TTS hardware engine

Hoek · October 24, 2015, 5:46am

For my daughters roboclock project I implemented text to speech.. sort of by pre-rendering the speech for text and phrases into .wav files stored on an SD card. Works really well, is clear and simple. Also, because of the way it works very little text is used in the main sketch.

I have a program that renders speech for all numbers 0 - 1000 along with all minutes of the day and all other phrases needed. I was thinking I could record a lot less numbers and piece together an assortment of numbers to make any number I want. At the moment I piece together numbers to make [minus][number][point][decimal1][decimal2] etc.

The main project is on a Due which IMO kicks butt with being 32 bit, nice amount of memory and not too shabby CPU speed.

:o However I would really like to be able to read quotes etc off the SD and play them. ie "Good morning! How are you today?" I have a dictionary with ~38000 words in it and was thinking to render every word and create a binary tree that's readable off SD card to find the correct file for any word.

Wondering if anyone else has any other ideas on how to tackle this. Because of the volume of entries I doubt I could keep the entire index in memory... but may be able to keep 1/100 in memory which would allow almost instant access to narrowing it down to 100.

Johnny010 · October 24, 2015, 9:33am

38,000 words.
Average is 5 letters?

= 200,000bytes

200kB give or take.

Maybe pre-store the most commonly used 1000-2000 words in RAM and then the rest on an SD file?

So a phrase such as "I went to the shop" would already have I, went, to and the in RAM...leaving only "shop" to look up on slow storage.

You could then read the number of characters in the word you are looking for and have the SD file read from a pre-determined line where that x number of characters begins.
Sort them in size order. Alphabetical search would also work.
I know python has some kind of sorting algorithm that is used to sort and to find things...forgotten what it is called as I have little coding experience in python :(.

gist.github.com

https://gist.github.com/deekayen/4148741#file-1-1000-txt

1-1000.txt

the
of
to
and
a
in
is
it
you
that

This file has been truncated. show original

According to: How many words in the english language ? How many do i need to know?

In English, in everyday writing/speaking, we use about 90% of the first 1000 words that are most commonly used...this will speed things up for you ;).

Johnny010 · October 24, 2015, 9:38am

PS. I lied.

Shop is in that list as well...so you can tick them all as being in RAM.

Oh and sorry but yes, have the 1000 words in RAM with 2 bytes after to refer to an address on a file on the SD card which has the 1000 pre-rendered.

Eg. byte array[6000]={'A','P','P','L','E',0,1}

So when a search along the byte array comes across APPLE - > Return Address ->Look on SD file for that address -> Return the pre-rendered clip for that word.

You could offload the LUT to EEPROM on an external IC...lots more words...32KB worth of space...average maybe 7bytes per word (5 for letters, 2 for address) = 4500 ish words addressed.

http://www.hobbytronics.co.uk/arduino-external-eeprom

Hoek · October 24, 2015, 1:26pm

Hi, thanks for the input so far. I managed to render the 58K odd words and oddly... it took 5 minutes to render and > 1 hour to copy to the SD card cause so many files.

I have some very old code (ie mid 80's) that implements a Btree index that seems to work well. It's been a long time since I've used it but I'm able to read 1 word at a time and index it. At the moment each word just gets a +1 on the index so essentially 1st file is "w1.wav" next is "w2.wav" etc. The program that creates the .wav files trims silence off the end as this gives me extra time to find the next word before it needs to be played.

Another approach is to convert the text -> arpabet and then "mash" words together on the fly. This would require a lot less space and would hopefully make it easier to learn new words.

example arpabet :-
ABARE AA0 B AA1 R IY0
ABASCAL AE1 B AH0 S K AH0 L
ABASH AH0 B AE1 SH
ABASHED AH0 B AE1 SH T
ABASIA AH0 B EY1 ZH Y AH0

At the moment I need to test what I have... port the Btree functions to work with SD card library and see how it runs. Main problem I would have with the arpabet is generating the wav files for "AA0" "IY0" "EY1" etc.

I also have no idea how the SD library performs when there's 58K files in a single directory.

Hoek · October 27, 2015, 12:13am

Been busy experimenting with the 44 English phonemes and I can put them together and get ... something that could potentially sound like speech. I think the main problem I have atm is "cico" aka crap in = crap out. My sound sources are not really suited to the job at the moment and the words don't flow even though the wav files are trimmed so no leading or trailing silence.

Regardless, I had a good idea about the binary search index that should be simple and fast.

:o The dictionary has 58,000+ words ranging from single letters to at least one of 22 letters. Rather than having 1 array with all and requiring either fixed length or an index to each variable length string I would have 22 arrays (one for each text size) which means fixed length searching.

I've discovered very nice sized SPI FLASH memory modules with 8,16,32 and even 64 Megabits which from what I can see should be more than adequate for storing and retrieving data. (They're well suited for 3.3v and some have 100Mhz+ clock speed... the ones I looked at are all SIOC-8 packages but I ordered some SIOC-8 breakout boards to cure that)

On a side note... I noticed 58,000+ wav files took a long time to copy to SD so might look at merging them into several larger files and removing all the redundant header information and make the indexes point to an offset into particular files. I have a feeling the SD card library would have an easier time looking for an offset than a file name from 58,000+ files as well I'll have to change my PCM library to play a particular file at an offset for a fixed length

<length 1><wave info 1><length 2><wave info 2>....

The best part will be... creating everything will be as simple as pushing a button to render the files and transferring a directory to a SD card. An optional function to recreate the EEPROM index from the SD card would also be included.

Topic		Replies	Views
RE: text to speech without TTS hardware engine Audio	5	807	June 20, 2024
Doing Text to Speech without a Dedicated Chip? Audio	8	4622	May 6, 2021
how to get the TTS.H libraries Programming	4	4852	May 5, 2021
SP03???? Looking for an I2C text to speech synthesizer Audio	3	1856	May 6, 2021
Convert speech to text with esp32 i2s imp411 General Guidance	30	145	April 18, 2025

text to speech without TTS hardware engine

Related topics