Need to get an idea on words size a text-match tool should cover

For small text interpreters, AT comand matcher and response, this function+data tool runs as fast as serial... no pause after collecting text to find a match, there is a high baud rate it can maintain both ways.

However --- I am looking at putting the keyword list and maybe all the links into EEPROM on 328P Arduinos and not sure if the links should go in RAM.

Links -- word count + 26 start-letter indexes will take 54 bytes,
then for every keyword there's 4 bytes overhead. Room for the letters of all the keywords is what's left. With 100 keywords that leaves 570 chars for the whole list if it all fits in 1024 byte EEPROM. That averages less than 6 chars per word. With 80 words the average goes to 8 and with 64 words it's 11.

Without that, 2 bytes word count and 1 terminating zero per word can allow more or longer words.

For AT commands and responses or other text using Q&A, is 80 or so short sequences more than enough?

Also, how many commands should a text interpreter support, like for a user-made command interpreter? Is 80 enough?
I could fit just the words+terminators in 1022 bytes to gain name space at a cost in RAM.

This tool can speed up and simplify text-based serial sketches but I wonder about word count limits. I'm going with EEPROM storage for now because it simplifies loading the list compared to putting loads of PROMEM lines in user sketch source.

I have already begun a list-load to EEPROM sketch, what direction that takes depends on responses.

Haven’t really got a clue what’s going on here .
I think you need to explain a bit better what this project will do - inputs ? Display ? Etc . Application etc

My first thoughts would how does interpret text ? This looks like the critical path in terms of speed .

1 Like

Don't forget to change all letters to upper case before processing.

I put that in the process, only to reduce the numer of 1st word for each start char. I could go with 1st chars must be caps though and use another 52 bytes of overhead.

Given a list of keywords, serial input is read 1 char at a time and submitted to a match function that returns status, later on that.
The function "walks" through the list data trying to match input to data and if it finds a match the keyword number is know on arrival of the final char for one kind of match (where a longer, superset word may be matched) and a complete match when the next input char is a delimiter. The lesser match allows the keyword known in cases where a number (KEY##) or symbol (AT+) may occur or even when the keywords are part of a concatenation of keys or syllables -- I keep the possible uses open.
The matches are status returns and another is No Match Possible.

I have done this with data in flash but it's such a Huge PITA coding to use it that no one even tried. That code is fast enough to make on-the-fly match and report of file text fed in at 200000 baud without error and enough cycles left over to do something without stopping the input. What I wrote before was never so fast, everything was a Rush Job but this is version 2 of what I took my sweet time doing starting 2015. V2 was released in Jan 2019.

Long ago I made versions of the same tool to handle User Key Entry on ANSI Terminals where errors got spotted, stopped with a beep, and allowed to finish correctly without re-entering from the start. The secretaries, engineers and business users liked that.

The way the function works, on an ANSI Terminal it can support auto-fill cleanly.

Where the usual procedures buffer chars then attempt a match, this matches or not on the arrival of the final char or delimiter.
It doesn't include a numeric evaluation function or interpreter for symbols. It should not either. It is a code tool.

One sketch will save data to EEPROM and the function in a demo is what to use in an interpreter or input entry program.

Consider one or more MCU's on I2C or WIFI that do different things in a robot or greenhouse. Each has an interpreter and set of keywords so can be commanded via serial. Each can be addressed by name or group, perhaps LEFT ARM to begin with, the rest just watch for end of line while the left arm parts continue to parse and lex what to do. There could be one or many modules, the brain doesn't need to know just what to tell. Ditto for pumps, mixers, fans, lines, tunnels, etc.
Want to add a new element? No big deal. it will need hardware, an interpreter with new element keyword(s) and integration in the brain that since it sends serial commands and reads inputs can have a PC/RPi/internet.
More capability won't need a bigger chip until it really does, put off buyin Megas, etc just for more one feature and each module has it's own core to give attention to a process.
If all the modules use the same MCU, that can be bought at discount quantity (starts at 5 or 10) and inventory spares easier to deal with.
It means a system with overhead but that should be minimal and modular and open to change. This is a tool for that.

I like the idea of putting the symbols in EEPROM because it means that you can change/add/remove symbols without reprogramming the system.

However I also like the hard-coded flash idea

and I'll offer that it might be useful to build a program in Python that can take the symbol list and build the header automagically. It sounds like the type of job a computer would be perfect for. That would remove the programming overhead if you think that's something that's holding the idea back.

When i'm done making the EEPROM version work I should be ready to try the same with write to flash using omniboot.

I did have, might still have the sketch that generates PROGMEM source lines but didn't generate the support data lines I had to use to make those demos. Back then in 2018 I didn't try to write flash after boot! And then with RL problems frustrating it was COVID time, I put the project away.

With large word sets, it's possible to condense books to some degree and generate indexes, do finds. How many unique words in a novel or a few? But in enough books, 32-bit addressing isn't enough it seems. I can break that down for sure but it'd be a project I don't have time for... I look at word-frequency charts of the Library of Congress and 65,000 words isn't enough though with a compromise allowing literals in book data should work, the words around 65000th most frequent aren't all that rare!

If I broke up a huge word list to store on several matchers that all read the entry stream to get one to find a match, it's still a major task! Possible but I'm not exactly young and funded as IMO that's what it would take to try.

So for now, would 80 keywords be enough for practical?
In flash, an Uno can be good for 2 or 3000 depending on average length and with an ATmega1284 or Mega2560, many times that,
Imagine 8 or 10 of those on SPI!

Look at the AT command set. I think that is an obvious use case. Make sure you're at least big enough for that.

1 Like

Oh another cool use case would be taking input from a speech to text engine on some other device. The text stream goes to the Arduino and it can react to certain words or phrases.

I hadn't thought of that! It's well within packing on an Uno flash though i can't say if enough data for an assembler/disassembler and code would fit, there might be others willing to try!
The Uno is junior-size anyway, larger even newer MCU's are already here.

My major peeve with PC's is OS tasking 'latency?', too slow to run a motor.

Already one of the possible uses i wanted to get addressed by those who know more than I on the subject. If not the whole then subsets to cover specific devices. I remember doing AT commands with modems but who uses those nowadays?

There's also HTML and its cousins.

These are what I mean when i write about interpreters (and single pass data compilers, I've written that much using earlier versions of mtch tools).

Uno EEPROM is 1024 bytes. A word list with null terminators (print function friendly) that fills that and leaves the support links and data to RAM (256 3 letter commands) doesn't seem practical.

I've had time to think about it. But it's not just for me.

It reads input and matches to pre-compiled keywords with associated output, for starts. It worked really well with accounting data in the 90's, the keyword set was not large, < 1000.
It was at the heart of a single-pass compiler looking for anomalies across a dozen data sources and pinned everything down to the required 1 second.