I was surprised to discovered that there isn’t a good cross-platform hash function defined for strings. MD5, SHA, FVN, etc, all define hash functions over bytes, meaning that it’s under-specified for strings.
So I set out to create a standard 32 bit string hash that would be well defined for implementation in all languages, have very high performance, and have very good hash properties such as distribution. After evaluating all the options, I settled on using Bob Jenkins’ lookup3 as a base. It’s a well studied and very fast hash function, and the hashword variant can work with 32 bits at a time (perfect for hashing unicode code points). It’s also even faster on the latest JVMs which can translate pairs of shifts into native rotate instructions.
The only problem with using lookup3 hashword is that it includes a length in the initial value. This would suck some performance out since directly hashing a UTF8 or UTF16 string (Java) would require a pre-scan to get the actual number of unicode code points. The solution was to simply remove the length factor, which is equivalent to biasing initVal by -(numCodePoints*4). This slightly modified lookup3 I define as lookup3ycs.
So the definition of the cross-platform string hash lookup3ycs is:
The hash value of a character sequence (a string) is defined to be the hash of it’s unicode code points, according to lookup3 hashword, with the initval biased by -(length*4).
So by definition
lookup3ycs(k,offset,length,initval) == lookup3(k,offset,length,initval-(length*4))
4)) == lookup3(k,offset,length,initval)
An obvious advantage of this relationship is that you can use lookup3 if you don’t have an implementation of lookup3ycs.
Here’s my optimized version for Java
Update: I’ve also included a 64 bit version called lookup3ycs64