Referring to a word bank

Could anybody point me toward a technique i could use to store words in the arduino code to reference against in a twitter stream. For example in the code i want to have a bank of words that i would associate with a 1 and a bank of words i would associate with a -1. I would then read the twitter string and see if any of the words contained therein we're also in one of the word banks. If they were then i would either +1 or -1 to that twitter strings initial score of 0 , therefore declaring the twitter string as positive or negative.
Any help is much appreciated. Thanks .

How many words do you want these word banks to contain, and will these words be defined during compile time or do you want to be able to specify or change them dynamically?

If you want to be able to specify them dynamically you are going to be very limited on the size of these word banks as the Arduino has very little RAM. If you can define them statically, then you can use PROGMEM to handle larger word banks in flash memory (but there will still be a finite limitation on the size).

The Uno has 2k of RAM and ~30k of Flash to work with.

There are several ways you could do this. You need an array, obviously, but that array could be an array of char * or an array of String objects.

Which one to use depends to a large extent on what you have to compare to. Are you collecting the twit data in a String object or in a char array?

The ideal technique is to not store the actual words but hash values of the words. Your application will be able to recognize considerably more words and it will be able to recognize words using less CPU time.

Storing hash values for dictionaries can be done with a Bloom filter - Bloom filter - Wikipedia -
Some python code - http://www.coolsnap.net/kevin/?p=13 - that could serve as a basis for a C++ class

Furthermore: use external memory if more storage is needed for the (dynamic) bitmap or wordlist, EEPROM or an SD.

Some questions arise:

Wordlist size?
How much tweets per minute?
How fast must the wordmatch algorithm be?

Unless you have a small number of words, or you can just store them in PROGMEM, you are going to run out of storage space, as the others said.

With a small amount of extra hardware you could add 512 Kb of extra storage as I describe here:

Or you could add 32 Kb using a somewhat easier-to-work-with chip using I2C:

I'm inclined to agree with Coding Badly too, something like a 2 or 4-byte hash keeps the size down, and also simplifies the search.

I recently improved the Regular Expression library that I adapated from the Lua regexp search.

One of the functions in that is a search that iterates over a string, and calls a function for each match. So you could set up a regular expression to match a word (eg. "%a+") and run that over your Twitter feed. For each match the function is called. The function would then hash the word fed to it, and then lookup in PROGMEM or EEPROM for a matching hash. Reasonably fast and quite simple.

With 32 Kb of EEPROM, and using a 4-byte hash for each word, obviously you could store 8192 words, which may well be adequate. But if you need a positive/negative flag for each word, then you might need to store 5 bytes for each word (or maybe borrow the low-order hash bit).

I'm trying to remember my hash probability stuff. I vaguely recall that a 32-bit hash would reasonably reliably avoid collisions for 2^16 elements (65536). This may or may not be acceptable for you. If not you may want to increase the hash size. But then, if the hash reaches something like 8 bytes, you may as well just store the words. Or the Bloom filter might help there - could be a lot quicker too.

PaulS:
There are several ways you could do this. You need an array, obviously, but that array could be an array of char * or an array of String objects.

Which one to use depends to a large extent on what you have to compare to. Are you collecting the twit data in a String object or in a char array?

Im probably going to have a maximum of 50 words in each bank. I suppose using an array of strings would be best? I will probably compile the twitter data as a string object and string split it up.

If you're going to have 50 words/bank, hashing is definitely the way to go - 50 words * 10 bytes/word = 500 bytes per bank, and a full 1kb dedicated to storing strings is more than you can usually afford.

Now the question becomes a small and fast hashfunction?

As a word consist from letters which can be lowercased, so there are only 26 values => 5 bits/char

An interesting code might be Soundex - http://en.literateprograms.org/Soundex_(C) - extended to include more digits.

The 4 bytes as Nick proposed could be one byte for the first char, leaving 3 bytes for digits, BCD encoded thats room for 6 digits.

Hashing works somewhat ok with small dictionaries and long words. A more classic approach is to use a Trie preferably with suffix compression as you just care about presence of a word and don't have any payload associated with it. The data structures can be stored in a flat array sometimes even packing it to save some more space. This gives fairly compact directories, although they need a little processing to build.

Korman