Keyword lists for tutorial requested

The lists should be <= 20 words.

The words should be all-alpha text in alpha-sorted list order.

Whole sets of words should start with the same few to several letters. Some of those should be subsets of those whole sets, reason being to show user entered chars “walk the data” through variables used by the match routine.

Because it will use Serial Monitor the tutorial may print the keyword list a lot. It should leave a few lines before the top of the list scrolls off the screen for feedback to the next char.

What I had before and might use again are sets of string.h function names like strchr and strlen.

As for the demo, everything will be kept in RAM. It is not meant to be built upon or used in something else, put the text in flash as there is code out for that able to support 2000+ keywords averaging 7 chars length on an Uno.

That stated, for short word lists in smallish sketches.. why not? I should put comments in what parts to remove for actual use.

Contemplating.

Not clear what You intend to start with. Is it a kind list being a database You want to search for a match with a given word?

Earlier a member wanted to search for a match with a rather large database in an SD. It took a lot of time reading the database. The solution was he created an index file pointing at the beginning of each line in the database. Using the interval half method via the index file the short time for a hit made him astonished….

No clue what the title of the topic was.

A large list could be divided between a string of active Mega2560’s on an SPI bus and all catch the serial stream at the same time knowing that if there’s a match, only one will match it to report on MISO.

SD is great but I’d rather keep up with highspeed text with cycles left over for a 1-pass compile or something like it. Been there with less, done it slower and still beat the competition. It might be interesting to know what baud rate an SD-based version could keep up with… starting at 1200. Might get to 9600!

Maybe a list of terms and commands to run a greenhouse via serial text? Or just for laffs, interesting close words.

This tool could be used in a multi-mcu machine to make each board able to recognize and run valid-to-it commands while other mcu’s respond to other commands. Projects that got too big for an Uno might run fine on 2 or more 328P boards and pass data/control between each other.

With an ANSI terminal, the word list can support auto-fill very well.

It can support user key entry especially to limited word sets.

I need to dig out my old-old number-entry code to finish this but for now, parse&lex aka keyword match is enuff.

May be look at commands with no parameters from the core 3GPP AT command specifications which are implemented by most cellular modules on the market.

AT
ATI
AT+CSQ
AT+CREG?
AT+CGREG?
AT+CEREG?
AT+CGATT?
AT+CGDCONT?
AT+CGACT?
AT+CGPADDR
AT+CGCLASS?
AT+CGSMS?
AT+CPIN?
AT+COPS?
AT+COPS=?
AT+CLCK?
AT+CPWD?
AT+CFUN?
AT+CFUN=?
AT+CPAS
AT+CMEE
AT+CMGF?
AT+CMGS=?
AT+CMGL
AT+CMGR
AT+CSCA?
AT+CNMI?
AT+CMUX?
AT+CGMI
AT+CGMM
…

Or look at some ASCII protocols like HTTP or FTP and find common commands with no parameters. For example for FTP you could pick


ABOR
ALLO
CDUP
EPSV
FEAT
HELP
MODE
NOOP
PASV
PWD
QUIT
REIN
STAT
STRU
SYST
TYPE
XPWD

Or for older vi fans some common commands with no params

:args
:checktime
:earlier
:exit
:help
:help!
:ls
:later
:next
:prev
:quit
:quit!
:version
:write
:writequit

That AT list looks good for showing branching when the entered char does not match the current candidate list word.

TKS!

Is that the return of A lex+yacc where the parser reads the input one character at a time, groups characters into tokens using lex, and then yacc applies grammar rules to these tokens to build the syntax structure ?

Aha .. A school project …

Oh no, NOT GRAMMAR!

No, it parses the text between delimiters one char at a time and when the closing delimiter gets read and recognized the routine knows the word matched or not matched.. and if the input text word has no match the routine likely spotted that before the closing delimiter is read. Lex is match entry to known words.

The output of this could likely feed a yacc of some sort.

What baud rates can lex match on ‘your’ PC?

In that case GoForSmoke might not have learned a lot in the 14 years that he is a member here :smiley:

Long time in prep school

:grinning_face:

I never tried - old stuff for sure but quite rich.

I suppose you rely on an alphabetically sorted list and maintain a min and max for the likely tokens, narrowing down the possibilities as each new character gets in or do you use a btree ?

this is one reason why I make the tutorial, so people can use the code.

I do keep a word list with an index list pointing to the start of each one. So the list can be in any order, the links are sorted.

I keep a first word for each starting-char, the 1st char read uses this to find the 1st candidate match word if there is one, else no-match status.

I keep a ‘magic number’ for every list word. It is the number of chars this word begins with that match previous word start chars. It’s just shorter to say ‘magic’.

One day I looked at sorted lists and saw a pattern in words that start with the same chars and what that means; when the next input char does not match the next candidate char, there’s no need to match the previous chars in the next list word if you know that many letters do match a priori. Comparing how many chars have been matched to that value shortens the search that happens whenever the new char doesn’t match the current candidate list word. That section of the original and later code are heavily commented. Short story for speed, it’s like a f__king miracle.

This is a kind of data-walker, it keeps track of itself and steps as short as necessary though the data could have many words to step through here and there but when I fed the speed test its own list 1000000 times it averaged 144 cycles, ~9 micros per matched char on an Uno fed as fast as it would take, not serial.

The goal is to have match or no-match (token) when the final char arrives in order to operate on-the-fly with cycles left over to do useful processing without interrupting the stream. Get it Real Time so to say.

LOL warning, the routine seems to block almost 10 micros!

I got here at the end of 10 years medical trauma after a series of operations in 2000. I come here to work on my recovery.

I learned a lot more before 2000, writing code to feed the wolf from 1980 till 1999. I was sharp then. I forgot more C++ than I knew since and some parts actually hurt to think of… there’s holes.

If the command vocabulary is fixed, you can sort it alphabetically once and use min/max indices to track the range of words that still match the characters received so far. Each new character narrows this range. When the separator arrives, you treat it as matching \0, so the same logic applies. If min surpasses max, the input cannot match any command.

A common mistake is forgetting backtracking. For example, if the valid command is AABC but the noisy stream delivers AAABC, a naive walker accepts AA, expects B, gets A, and rejects. Many implementations then reset the parser at the wrong position, skipping viable matches. They might continue with ABC or BC, both of which fail, even though AABC was present.

You can combine both approaches: use the min/max narrowing to save memory and speed up matching, and also maintain a buffer as long as the longest command to rescan when mismatches occur. This way you don’t sacrifice efficiency for robustness, and the parser can recover from noise without losing the benefits of the character-by-character walker.

Sorting is done manually by the developer when creating the array, which is acceptable as a one-time effort and then on modern 32-bit MCUs (unlike small AVRs), the maximum command length can be computed at compile time with constexpr variables and functions, so no dynamic allocation is needed.

It is much easier and leaner to just follow the first word that matches the input chars until one does not and then the magic numbers tell me which to check or not, if the match process continues or no match is possible.

Here is the code for when an input char != list word char. It looks for the next lowest in alpha-order word that so-far matches all of the so-far input chars. A no-match ends searching except for the word end.

      else // find an alternate match or no_match
      {
        current++;

        // we have chars matched and how many chars of the current word the next word matches
        // if that number is smaller than already matched chars then no_match
        // if that number is bigger then try the next list word
        // if that number is the same then try to match the next char to text,
        // ---- if they match then new current word and charsMatched++
        // ---- else try the next list word

        while ( current < keys )
        {
          charToMatch = pgm_read_byte( charsMatchingPrevWord + current );

          if ( charToMatch < charsMatched )
          {
            matchState = no_match;
            break; // break from the while loop
          }
          else if ( charToMatch > charsMatched )
          {
            current++;
          }
          else // check this one
          {
            curPtr = (PGM_P)pgm_read_word( key_list + current );
            charToMatch = pgm_read_byte( curPtr + charsMatched );

            if ( text == charToMatch )
            {
              charsMatched++;
              charToMatch = pgm_read_byte( curPtr + charsMatched );

              if ( charToMatch == 0 )
              {
                matchState = list_matched;
              }
              break;  // break from the while loop
            }
            else
            {
              current++;
            }
          }
        }

        if ( current >= keys )
        {
          matchState = no_match; // no_match
        }
      }
      break;

Here is the complete old speed test that comes from:

// serial_word_match_speed_test ver 1.0 Feb 11, 2019 by GoForSmoke on the Arduino.cc forum.

/*
  User defines a PROGMEM array to hold a sorted word list

  1-pass match/no_match keywords and evaluate numbers in text stream/data.
*/

void usage()
{
  Serial.println( F( "\nserial_word_match" ));
}


const int keys = 32; // put this on a 1284P, there could be > 256 words in the list

enum keywd { boffms, bonms, bpin, bstrtms, bstat, hlp, prnt, rst, st, strt, stp };

// the keys must be in alpha order, all start with lowercase alpha
const char key000[] PROGMEM  = "blinkOffMs";
const char key001[] PROGMEM  = "blinkOnMs";
const char key002[] PROGMEM  = "blinkPin";
const char key003[] PROGMEM  = "blinkStartMs";
const char key004[] PROGMEM  = "blinkState";
const char key005[] PROGMEM  = "help";
const char key006[] PROGMEM  = "memccpy";
const char key007[] PROGMEM  = "memchr";
const char key008[] PROGMEM  = "memcmp";
const char key009[] PROGMEM  = "memcpy";
const char key010[] PROGMEM  = "memmem";
const char key011[] PROGMEM  = "memmove";
const char key012[] PROGMEM  = "memrchr";
const char key013[] PROGMEM  = "memset";
const char key014[] PROGMEM  = "print";
const char key015[] PROGMEM  = "reset";
const char key016[] PROGMEM  = "set";
const char key017[] PROGMEM  = "start";
const char key018[] PROGMEM  = "stop";
const char key019[] PROGMEM  = "strcat";
const char key020[] PROGMEM  = "strchr";
const char key021[] PROGMEM  = "strcmp";
const char key022[] PROGMEM  = "strcpy";
const char key023[] PROGMEM  = "strlen";
const char key024[] PROGMEM  = "strlwr";
const char key025[] PROGMEM  = "strstr";
const char key026[] PROGMEM  = "strupr";
const char key027[] PROGMEM  = "time";
const char key028[] PROGMEM  = "wait";
const char key029[] PROGMEM  = "wander";
const char key030[] PROGMEM  = "watch";
const char key031[] PROGMEM  = "win";

const byte charsMatchingPrevWord[ keys ] PROGMEM =
  //   0-4,       5-9,      10-14,     15-19       20-24     25-29    31-32
{ 0, 6, 5, 5, 8, 0, 0, 4, 4, 4, 3, 4, 3, 3, 0, 0, 0, 1, 2, 2, 4, 4, 4, 3, 3, 3, 3, 0, 0, 1, 2, 1 };

const int a_z_starts[ 26 ] PROGMEM =
  // a,b, c, d, e, f, g,h, i, j, k, l,m, n, o, p, q, r, s, t, u, v, w, x, y, z
{   -1, 0, -1, -1, -1, -1, -1, 5, -1, -1, -1, -1, 6, -1, -1, 14, -1, 15, 16, 27, -1, -1, 29, -1, -1, -1 }; // a-z

PGM_P const key_list[ keys ] PROGMEM =
{ // these are pointers to the words in the list
  key000, key001, key002, key003, key004, key005, key006, key007,
  key008, key009, key010, key011, key012, key013, key014, key015,
  key016, key017, key018, key019, key020, key021, key022, key023,
  key024, key025, key026, key027, key028, key029, key030, key031
};

char text;
int current = -1;  // current match word index, may be -1 for none
PGM_P curPtr;  // pointer into PROGMEM to get word list chars
byte charsMatched, charToMatch;  // no key words over 255 chars....
byte matchState; // match_start, matching, number, no_match, list_matched, finished_match

byte maxDigits = 5;
unsigned long ulAccum;

enum matchType { match_start = 0, matching, number, no_match, list_matched, finished_match };

void listWords()
{
  int i;
  byte j;
  long listLen = 0;

  Serial.println( "\n\n\n\n\n\n\n\nword list" );
  for ( i = 0; i < keys; i++ )
  {
    Serial.print( "key " );
    Serial.print( i );
    Serial.print( " - " );
    curPtr = (PGM_P)pgm_read_word( key_list + i );
    j = 0;
    while (( charToMatch = pgm_read_byte( curPtr + j++ )) > 0 )
    {
      listLen++;
      Serial.write( charToMatch );
    }
    Serial.println();
  }

  Serial.print( "\nTotal list characters = " );
  Serial.println( listLen );

  Serial.println( "\na - z starts" );
  for ( j = 0; j < 26; j++ )
  {
    i = pgm_read_word( a_z_starts + j );
    if ( i >= 0 )
    {
      Serial.write( 'a' + j );
      Serial.print( " " );
      Serial.println( i );
    }
  }

  Serial.println( "\n\nchars that match previous word chars" );
  for ( i = 0; i < keys; i++ )
  {
    j = pgm_read_byte( charsMatchingPrevWord + i );
    Serial.print( "key " );
    Serial.print( i );
    Serial.print( " - " );
    Serial.println( j, DEC );
  }
}



/*
   need int indexes to current and alternate words and byte chars matched offset.

   states - match_start, matching, number, no_match, list_matched, finished_match
*/


void match() // uses globals, new char is in char text
{
  if ( matchState == finished_match )  matchState = match_start;

  switch ( matchState ) // match_start, matching, number, no_match, list_matched, finished_match
  {
    case match_start :
      if ( text >= 'a' && text <= 'z' )
      {
        current = int( pgm_read_word( a_z_starts + ( text - 'a' ) ));
        if ( current < 0 )  // no list word starts with the text letter
        {
          matchState = no_match; // no_match
        }
        else
        {
          matchState = matching;
          charsMatched = 1;
        }
      }
      else
      {
        matchState = no_match; // no_match
      }
      break;

    case matching : // matching
    case list_matched :
      if ( text == ' ' || text == '\n' )
      {
        if ( matchState == list_matched )      matchState = finished_match;
        else if ( text == '\n' )    matchState = match_start;
        else                        matchState = no_match;
        break; // exit the case
      }

      curPtr = (PGM_P)pgm_read_word( key_list + current );
      charToMatch = pgm_read_byte( curPtr + charsMatched );

      if ( text == charToMatch ) // matches the list word so far
      {
        charsMatched++;
        charToMatch = pgm_read_byte( curPtr + charsMatched );

        if ( charToMatch == 0 )
        {
          matchState = list_matched; // matched a list word
        }
      }
      else // find an alternate match or no_match
      {
        current++;

        // we have chars matched and how many chars of the current word the next word matches
        // if that number is smaller than already matched chars then no_match
        // if that number is bigger then try the next list word
        // if that number is the same then try to match the next char to text,
        // ---- if they match then new current word and charsMatched++
        // ---- else try the next list word

        while ( current < keys )
        {
          charToMatch = pgm_read_byte( charsMatchingPrevWord + current );

          if ( charToMatch < charsMatched )
          {
            matchState = no_match;
            break; // break from the while loop
          }
          else if ( charToMatch > charsMatched )
          {
            current++;
          }
          else // check this one
          {
            curPtr = (PGM_P)pgm_read_word( key_list + current );
            charToMatch = pgm_read_byte( curPtr + charsMatched );

            if ( text == charToMatch )
            {
              charsMatched++;
              charToMatch = pgm_read_byte( curPtr + charsMatched );

              if ( charToMatch == 0 )
              {
                matchState = list_matched;
              }
              break;  // break from the while loop
            }
            else
            {
              current++;
            }
          }
        }

        if ( current >= keys )
        {
          matchState = no_match; // no_match
        }
      }
      break;

    case no_match :
      if ( text == '\n' ) // reads and discards serial until newline
      {
        current = -1;
        charsMatched = 0;
        matchState = match_start;
        Serial.println( "\n error\n" );
      }
      break;

    case number :
      if ( text >= '0' && text <= '9' )
      {
        if ( charsMatched++ < maxDigits )
        {
          ulAccum *= 10UL;
          ulAccum += byte( text - '0' );
        }
        else  matchState = no_match; // exceeded max length
      }
      else if ( text == ' ' || text == '\n' )
      {
        matchState = finished_match;
      }
      break;
  }
}

void hitAny()
{
  while ( Serial.available() < 1 );
  while ( Serial.available() > 0 )
  {
    Serial.read();
    delay( 250 ); // in case autorepeat is held down
  }
}

void speedTest( int count )
{
  static int i;
  static byte j;

  while ( count-- > 0 )
  {
    for ( i = 0; i < keys; i++ )
    {
      curPtr = (PGM_P)pgm_read_word( key_list + i );
      j = 0;
      while (( text = pgm_read_byte( curPtr + j++ )) > 0 )
      {
        match();
        if ( matchState == no_match )
        {
          Serial.print( "test no_match fail at count " );
          Serial.print( count, DEC );
          Serial.print( "   word " );
          Serial.print( i );
          Serial.print( "   char " );
          Serial.println( j );
          return;
        }
      }
      text = '\n';
      match();
      if ( matchState != finished_match ) 
      {
          Serial.print( "test state " );
          Serial.print( matchState );
          Serial.print( " fail at count " );
          Serial.print( count, DEC );
          Serial.print( "   word " );
          Serial.print( i );
          Serial.print( "   char " );
          Serial.println( j );
        return;
      }
    }
  }

  return;
}
//===========================================================

void setup()
{
  Serial.begin( 115200 );
  Serial.flush();
//  listWords();
//  hitAny();
  usage();

  unsigned long startms = millis();

  speedTest( 1000 );

  if ( matchState == finished_match )
  {
    startms = millis() - startms; // startms now has elapsed millis
    Serial.print( "\n millis = " );
    Serial.println( startms );
  }
}


void loop()
{
}
 

I don’t want to deal with more of the data than I need to. Enough to match or no-match each char as it arrives. If it doesn’t match the list word currently being followed, it’s time to find the next word that can using the precompiled “how many chars of the current word the next word matches” ( pairs with next word).

use min/max indices to track the range of words that still match the characters received so far

The magic number coupled with chars-already-matched is my case by case guide to that with no more prep-work needed but I can see how ‘the magic’ could make that very quick work since the first word with magic number < chars-matched cannot match nor can any word after that.

In a sorted list with many words starting with the same base, the next word from each will have the most matching start chars. Past that, the data makes the pattern and the number of chars already matched limits next match.. is inside of that min-max already.

But you are welcome to upgrade the code. The last member to try left the precompiled data out which did pass the positive test I used for speed (worse-casing by forcing all-chars-matched with correct words, no-matches use way fewer cycles) so his passed false matches too. Except for that his hex file was a few bytes smaller than my spaghetti-hack and a tiny bit faster especially giving wrong answers. I got the feeling that making it work right would have resulted in bigger slower code than my mess, otherwise it would have been posted.

My answer is fix the serial even if by lowering the baud rate.

My routine does no backtrack. It does on-the-fly pretty fast.

Thanks for sharing.

This makes sense and seems indeed pretty effective.
The backtracking might not be needed, working with non garbled input where commands arrive clearly with delimiters can be an acceptable constrain.

I’ll try to implement something €probably later though as I’m busy with some home improvement work at the moment)

It will be an interesting tutorial if you get to write it up. Any intent to make that as a small word matching library ?

First I want to use a short list and have the user enter text via serial monitor and lay out the match process using variables from the code.

One of the list words will likely be ‘list’ and cause a list to be printed. That’d be a demo command. Others’d be necessary.

Each char processed needs to generate debug print info enough that with a copy of the source, the working should be bleedin obvious. If I do a good job, that happens.

Other apps… there’s so many. I started by 1981 just replacing Basic $INKEYS that let bad data in the company code, half or more of the crashes were due to bad data. That led down a trail….

In RAM? Sure, decide how many words (2 byte pointer each plus terminating zero each plus total words-chars plus 1 byte magic number each plus 2 bytes per legal start char.. say 26 alphas) might fit in 256 bytes? If the average length is 8 that’s ( 256 - 52 ) / ( 8 + 4 ) = 17 words. 400 bytes gets it up to 29.

One of the match modes is for when the current list word is matched but the delimiter has not arrived. There may be list words like:

rumple

rumpled

rumplestiltskin

the first two need the delimiter match but if your list has no superset words the provisional match can be used to build full words from a list of partial words/syllables. There’s many ways to go with that just off the top… and more overhead space and cycles.

If I put non-overlapping dictionaries on ESP boards, I could achieve a large vocabulary over wifi.

Another thing is once you have the word number/token, is it okay to have the associated data on SD or will it need to be on hand as quickly as words get identified, perhaps the associated data is to perform some task n-a-o-w.

Seems you have found something to keep you busy all the upcoming winter :slight_smile: