Hashing function correctness and uniqueness.

Hi.
I am trying to find a good hashing function for Arduino and have found a couple. I would like to ask if they will work 100% with what I am doing. By that I mean that if they will generate a unique hash for the input I give it.

So here is the input:

unsigned int inputData[16]={ 
 0b0000001000000000,
 0b0000010000000000,
 0b0000000100000000,
 0b0000000010000000,
 0b0000000001000000,
 0b0000000000100000,
 0b0000000000010000,
 0b0000000000001000,
 0b0000000000000100,
 0b0100000000000000,
 0b0010000000000000,
 0b1110000000000000,
 0b0000000000000010,
 0b0000000000000001,
 0b0100000000000000,
 0b0010000000000000};

Then I convert it to a char array by first adding each integer into a string object and then using that to construct the char array. If there is a better way then please let me know:

String s;
  for(int i = 0; i < 16; i++)
  {
    s += String(inputData[i]);
  }
  
  char buf[s.length()+1];
  s.toCharArray(buf,s.length()+1);

Then buf array will hold these values:

5121024256128643216841638481925734421163848192

This conversion needs to be done because the algorithms that I have found require a char array.

The first one is an MD5 hash algorithm: MD5 hash algorithm for Arduino
This one will generate unique enough hash for my purpose but the hash it generates is a bit memory intensive since it returns 32 chars.

The second one is Jenkins hash function Jenkins one at a time hash function.

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

This one returns an integer which is great for my purpose but I am not sure about the uniqueness of it.

So I have a few questions:
Does someone know how to make some tests to check for the uniqueness of the Jenkins function or to calculate how unique it is?
Does someone know an alternative hashing function that will provide me with uniqueness and a relatively space restricted output?
Is it possible to generate a shorter output from MD5?

Cheers!

Copying to a String and then hashing is not the best idea. As AWOLCoding Badly suggested you should type cast the array data from unsigned int to char and pass it directly.

I think it would be something like...

 uint32_t myHash = jenkins_one_at_a_time_hash((char*)inputData, sizeof(inputData) * 2);

The (char*) is type casting from unsigned int and the 'sizeof(inputData) * 2' is because there are double the number of bytes/chars (8 bits) compared to unsigned int's (16 bits).

Why do you need to create a hash? How much input data do you have? How much do collisions matter?

i.e. What are you trying to do?

MD5 is deprecated - use SHA256 instead if you need a cryptographically strong hash

But what actually is your requirement for the hash? Just for a hash-table?

wildbill: Why do you need to create a hash? How much input data do you have? How much do collisions matter?

i.e. What are you trying to do?

See this thread for why OP is interested in hashing.

Riva, yes. Thank you. I should have included the link in my post :)

Riva: I think it would be something like...

 uint32_t myHash = jenkins_one_at_a_time_hash((char*)inputData, sizeof(inputData) * 2);

I will try that out.

MarkT, yeah. I know that it is deprecated but I do not need cryptographically strong hash. It is more important that it is relatively fast and provides an output that does not take much memory. See the link provided in Riva's post to my other thread for why I need hashing function.

Riva: ...the 'sizeof(inputData) * 2' is because there are double the number of bytes/chars (8 bits) compared to unsigned int's (16 bits).

My Buggy Sense is going crazy.

Hashes are by definition not unique. No hash algorithm will guarantee that two different inputs won't result in identical outputs. A digital signature algorithm does try to make sure that there aren't two valid inputs which will result in the same signature but it's not impossible - the real strength of the algorithm is measured by "How hard is it to find the other input that has this signature?"

What is the real purpose? Compression? You want the 16-word array compressed because you want to store many of those arrays? A key or signature?

MorganS: What is the real purpose?

Asked and answered. Please try to keep up.

No hash algorithm will guarantee that two different inputs won't result in identical outputs.

You are wrong. There is an entire field in Computer Science that studies such hash functions. Bob Jenkins has some information and source code on his site... http://burtleburtle.net/bob/hash/perfect.html

For @simmisj, a minimal perfect hash function is not necessary. (However, it would be an interesting exercise to use one.)

But, given the small data space (512 states) finding a perfect 32 bit hash function is almost trivial.

The kind of has here is one specifically to reduce the size of the data, so perfect hashes are not relevant.

I am puzzled why the OP asked for a unique hash given that - this rather implied cryptographic hashes which are for all practical purposes unique.

But I've answered the underlying issue in the original thread on Game-of-Life - you don't hash, just use the live cell count.

MorganS: No hash algorithm will guarantee that two different inputs won't result in identical outputs.

[quote author=Coding Badly date=1426548970 link=msg=2143175] You are wrong. There is an entire field in Computer Science that studies such hash functions. Bob Jenkins has some information and source code on his site... http://burtleburtle.net/bob/hash/perfect.html

For @simmisj, a minimal perfect hash function is not necessary. (However, it would be an interesting exercise to use one.)

But, given the small data space (512 states) finding a perfect 32 bit hash function is almost trivial. [/quote] Given a larger input space than output space, a hash function will have collisions.

In the case at hand, you cannot compress an arbitrary 512-bit input into a 32-bit hash value without collisions (see the Pigeonhole Principle).

If it were possible to hash any 512-bit input to a 32-bit output with no collisions, we would use this imaginary hash function recursively to compress large data sets, such as a whole movie, down to a single 32-bit value. And then the MPAA would prosecute everyone who has ever transmitted a 32-bit value across the Internet because it matches a 32-bit value that they "own".

christop:
In the case at hand, you cannot compress an arbitrary 512-bit input…

Apparently, my use of the word “state” confused you.

A 512 bit value has 2512 states.

A 9 bit value has 29 states which is 512 states. (This would be a good point to reread my post.)

You are correct. It is impossible to find a minimal perfect hash function that can reduce a 512 bit value to a 32 bit value.

However, that is not even close to what we are trying to do.

It is trivial to find a minimal perfect hash function that can “reduce” a 9 bit value to a 32 bit value.

But, as I said, and, you apparently did not bother to read, it makes no difference for this application. Any good hash function will work.

MarkT: But I've answered the underlying issue in the original thread on Game-of-Life - you don't hash, just use the live cell count.

I believe that would involve a slightly more complicated pattern matching algorithm: has this sequence of live cells occurred in the last N iterations. But, it would probably use less memory.

With hashing the detection is simple: has the current hash value occurred in the last N iterations.

I suspect either would give good results.

Actually we’re both wrong. The OP has 256 bits of state (16 * 16). I don’t know where you got 512 states (29) from. If there were only 9 bits of state it would obviously be wasteful to use a 32-bit hash.

christop:
Actually we’re both wrong. The OP has 256 bits of state (16 * 16). I don’t know where you got 512 states (29) from.

Based on your “logic”, a 1 by 1 matrix of boolean values has 1*1 = 1 state. I guess that’s a degenerative “boolean” value.

If there were only 9 bits of state it would obviously be wasteful to use a 32-bit hash.

Uh huh. I will believe you when you can work out how I arrived at 512.

Actually we’re both wrong.

I agree. 512 is not correct. I did make a mistake.

No, based on my logic, 1 bit of state has 21 = 2 states, and 256 bits of state has 2256 = 115792089237316195423570985008687907853269984665640564039457584007913129639936 states. I’m using “N bits of state” to mean “2N states”. Perhaps that’s where the confusion lies.

I don’t know where you got either 512 or 9 from. Maybe you originally thought the array was 16 8-bit values (16*8 = 512)? I also know the OP mentioned in the Conway’s Game of Life thread that he wants to save 9 previous generations to detect stagnation. Could that be it?

And why wouldn’t you believe me about the wastefulness of using a 32-bit hash on a 9-bit input? The math is plain and simple. It wastes 32 - 9 = 23 bits. The simplest hash function h(x) = x needs only 9 bits of output. It is also fast and guaranteed to be unique for all 512 possible input values.

...the 'sizeof(inputData) * 2' is because there are double the number of bytes/chars (8 bits) compared to unsigned int's (16 bits).

[quote author=Coding Badly link=msg=2142967 date=1426539174] My Buggy Sense is going crazy. [/quote] To be complete and more portable should that have been sizeof(inputData) * sizeof(inputData[0])?

According to my Uno...

unsigned int inputData[16]={ 
 0b0000001000000000,
 0b0000010000000000,
 0b0000000100000000,
 0b0000000010000000,
 0b0000000001000000,
 0b0000000000100000,
 0b0000000000010000,
 0b0000000000001000,
 0b0000000000000100,
 0b0100000000000000,
 0b0010000000000000,
 0b1110000000000000,
 0b0000000000000010,
 0b0000000000000001,
 0b0100000000000000,
 0b0010000000000000}; 

void setup() 
{
  Serial.begin( 115200 );
  Serial.println( sizeof( inputData ) );
}

void loop() 
{
}

sizeof( inputData ) = 32. The size of the array in bytes. Just what we need for the call to jenkins_one_at_a_time_hash. :wink:

Silly me, getting confused (and wrong, should be / not *) the code to determine the number of elements in the array and not the number of bytes.
Thanks for pointing out my err

Just a thought. Isn't a crc much less computational intesive than a hash function?