Non-cryptographic hash functions are used where speed is important and the possibility of attacking the characteristics of a function is not so important. Recently, an attack on the algorithmic complexity of hash tables by creating multiple collisions of a hash function that can lead to DoS has been actively discussed. We will consider modern non-cryptographic hash functions, conditions for their use, possible methods of protection against attacks on hash tables, and why it turned out that this is not so easy to fix.

Non-cryptographic hash functions
If everyone has a lot of cryptographic hash functions, then little is known about non-cryptographic (general-purpose hash functions). Non-cryptographic functions are used where data is not affected by third parties (the attacker). For example, such functions can be used to build hash tables.
Criteria that are important for non-cryptographic hash functions:
- Output values ​​should have a uniform distribution (uniform distribution). This criterion is important for building hash tables so that the search speed is optimal.
- Each input bit must, in 50% of cases, affect each output bit (avalanche is an avalanche criterion ). This criterion is important for using a hash as a unique document identifier. Also, sometimes the hash can have a uniform distribution, but fail the avalanche criterion - in this case there will be keys for which the hash will not give a uniform distribution.
- There must be no correlation between the pairs of bits at the output. This will not be considered, because usually there is no problem with this criterion.
- Speed All of the listed hashes are much faster than cryptographic ones.

Since usually the input value space is larger than the output value space, collisions are inevitable. In an ideal function, the output values ​​have a uniform distribution, so the probability of collisions is minimal. This can be checked chi-square test (χ²).
')
FNV-1a
2003 Speed: 4.5 cycles per byte.
Very simple function [1] (RFC [5]).
1. Uniform distribution: poor, up to 14 bits can be used for a hash table (up to 2
14 hash table slots). Slightly better in the FNV-1a variation, which is recommended by default.
2. Avalanche test: bad. Also bad in FNV-1a.
Both criteria are significantly improved in modified FNV-1a [2], where at the end mixing is added to FNV-1a.
Bob Jenkins' Hash (lookup3)
2006, length 32 and 64 bits, 32-bit instructions, speed: 2.62 cycles per byte (for messages 16 bytes long).
1. Uniform distribution: excellent for all bits.
2. Avalanche test: good. There is a small problem with the upper bits.
The first version of the function was called One-at-a-Time.
CRC32
1975, length: 32, 64 and 128 bits, speed: 1.2–0.13 cycles per byte (for messages shorter than 64 bits, depending on the CPU).
1. Uniform distribution: poor distribution of lower bits. Use the upper bits for the hash table, then all tested bits can be used - up to 16 upper bits (up to 2
16 hash table slots).
2. Avalanche test: very bad. Avalanche effect is absent. The purpose of the hash is affected: its output bits are designed to detect errors in the data transmission channel. Also, the CRC has an important property: you can split the message into parts, count the hash for each part, then merge the message, and count the hash of the entire message from the hashes of its parts.
MurmurHash 2
2008, length 32 and 64 bits, 32 and 64-bit instructions.
There is a problem with the uniform distribution of some keys [3].
MurmurHash 3
2011, length 32 and 128 bits, 32 and 64-bit instructions, speed: 1.87 cycles per byte (for messages 16 bytes long).
1. Uniform distribution: good (in one case out of three “almost suspicious”).
2. Avalanche test: excellent.

CityHash
2011, length 32, 64, 128 and 256 bits, 64-bit instructions. Rather difficult function.
For the fastest option, you need a CRC32 processor instruction (SSE 4.2 - Intel Nehalem).
1. Uniform distribution: good (in one case out of three “almost suspicious”).
2. Avalanche test: good. There is a small problem with the lower bits.
Spookyhash
2011, length 32, 64 and 128 bits, speed: 0.33 cycles per byte. By Bob Jenkins
1. Uniform distribution: average (in one case out of three “suspicious”).
2. Avalanche test: excellent.
Statistical analysis of hash functions can be done using the tools [17], using the results of the analysis [2], [4], [18].
CityHash is more complex and has more bandwidth (speed on long messages) than MurmurHash. MurmurHash is simpler, has a low latency for short messages (<20 bytes).
CityHash uses CRC instructions for new processors, providing up to 0.17 cycles per byte, which is faster than all other hashes; for comparison, SHA-1 - 4.6 cycles per byte (Ivy Bridge). Without the support of these instructions, CityHash is two times slower than SpookyHash (according to the author of the latter).
For the 32-bit platform, MurmurHash 3 can be faster than SpookyHash and CityHash.
Perfect hash function
If the array of messages is known in advance, then an ideal hash function can be used: no collisions and a search in one comparison operation. Example: gperf.
If the hash function has a normal distribution, then the probability of a collision is as follows:
- for 32 bits: with a probability of 1 in 10 000 there will be a collision among 927 hashes, with a probability of 1 in 100 there will be a collision among 9 292 hashes,
- for 64 bits: with a probability of 1 in 10 000 there will be a collision among 60.7 million hashes, with a probability of 1 in 100 there will be a collision among 609 million hashes.

How real are the collisions? For 32-bit hashes, they are met regularly. For example, CRC32 coincides with “codding” and “gnu”, “exhibitors” and “schlager”, “curl” and “natural”, etc. This is not the fault of the statistical features of the hash function, but a consequence of the limited size of the hash - only 4 bytes.
All listed functions, except for CRC32 and FNV-1, have good statistics. For a 32-bit system, MurmurHash may be the fastest. If the system cannot do unaligned memory read, for example, ARM or SPARC, then MurmurHash may also work faster than anyone. CityHash speed is highly dependent on the compiler, since a lot of complex code is used, it can be either more or less than the rest.
Recommendations
You can use MurmurHash 3, CityHash, SpookyHash V2, gperf as a general-purpose hash function.

DoS attack hash table
You can imagine a DoS attack, when an attacker feeds data into a hash table so that they all fall into one slot, and you get the worst speed for finding one element - O (n), inserting each new collision - O (n²), the memory size increases ( hash-flooding). To launch an attack, you need to be able to place arbitrary data in a hash table, and find hash function collisions.

The attack is also called hash DoS or, more generally, an attack on algorithmic complexity attack.
Collisions can be created in several ways:
- if the hash function is known in advance and the hash length is small, then use the search method: compute and search in memory meet-in-the-middle.
- if the hash function is not resistant to collisions, then the collisions can be calculated mathematically. Non-cryptographic functions are fairly simple, and it is often enough to search for collisions in the reverse order ([6], oCERT [7]).
We list the
properties of cryptographic hash functions (not applicable to general-purpose hash functions):
- resistance to restoration of the pre-image (given m , it is impossible to find X so that H ( X ) = m ),
- resistance to the restoration of the second pre-image (given m 1 , it is impossible to find m 2 so that H ( m 1 ) = H ( m 2 )),
- resistance to the search for collisions (it is impossible to find m 1 and m 2 so that H ( m 1 ) = H ( m 2 )).
Sometimes the last two criteria are called resistance to collisions of the first and second kind, respectively. We will call the collision only the case when neither the message nor the hash was originally specified.
Usually the “easiest” attack is the search for collisions. Due to the “birthdays” paradox, the complexity of this attack is at least two times less than the length of the hash. This attack exists for SHA-1: with a theoretical collision complexity O (2
80 ), the attack reduces the difficulty to O (2
61 ). There is no recovery attack of the first or second preimage for SHA-1.
To increase collision resistance, you can:
- use a cryptographic hash function to compute a hash of sufficient length. It will be resistant to collisions, but if collisions are once calculated, they can be used to attack any system.
- use HMAC (MAC based cryptographic hash function).
- use MAC based on PRF or universal one-way hash function.
A MAC based on a cryptographic function has the following properties:
- you cannot find the key, having the ability to get a hash for any message,
- You can not find the correct hash for the message without knowing the key.
- Greater resistance to collisions, since the family of hash functions is not known in advance to the attacker.
If we decide to use a cryptographic hash function (for example, SHA-3) or a MAC based on it, we will encounter the following disadvantages:
- too slow for hash tables: the algorithm is designed for long messages, and it works slowly for short messages,
- although the full length of the hash is collision-resistant, but if we take only the first n bits, it will simplify the search for collisions,
- in the case of HMAC, cryptographic hash functions are used that do extra work to ensure collision resistance, although this is not necessary because the secret key is used.
A universal hash function besides a message also receives a key, using which selects a hash function from a finite family of functions. With a key known to the attacker, the function must be resistant to the restoration of the second prototype, but there is no requirement for collision resistance. If the key is unknown to the attacker, then the function will be resistant to collisions. Considering our speed requirements, it is advisable to do without cryptographic functions.
To complicate the search for collisions, they sometimes try to build a MAC based on general-purpose hash functions, using seed as the key. But the resulting “MAC” will not have the required properties, since it is not built on the basis of a cryptographic function. The following attacks are possible:
- create collisions without knowledge of the seed, using mathematical analysis of the hash function,
- to figure out the seed, having access to the result of the hash function (for example, the program gives the data, ordered by hash by default, this provides data leakage about the hash value),
- get data via timing attack.
Therefore, although this approach may make life more difficult for the casual attacker, it often does not withstand the simplest cryptanalysis (oCERT [8]). It has been shown that most general-purpose hash functions are not resistant to cryptographic attacks:
- xxHash is not resistant to restoring the second prototype even without knowing the key, knowing the hash [12].
- MurmurHash 2, MurmurHash 3, CityHash 1.0.3 not resistant to collisions without knowing the key and hash: a way was found to quickly find collisions that do not depend on the choice of the seed (“universal” multicollisions) [13].
- Python hash is not resistant to key recovery, knowing the result of the hash function, after which you can attack against the 64-bit state using meet-in-the-middle [13].
- Python hash is not resistant to collisions without knowing the key and hash [14].
Until 2008-2010, most languages ​​and frameworks used fairly simple hash functions:
- PHP 4, ASP.NET and JavaScript: DJBX33X,
- PHP 5, Ruby 1.8 and Java: DJBX33A,
- Rust: DJB (till 2012),
- Python: modified FNV,
- JRuby: MurmurHash2,
- Haskell: FNV-1,
- Redis: djbhash, later MurmurHash 2.
Exception: Perl 4, used One-at-a-Time with a random seed since 2003 (this year they discussed this attack at the conference), later (5.17.6) used MurmurHash 3.
To reduce the number of collisions and complicate the DoS attack, many languages ​​over the last year switched to modern hash functions with a random seed. And when it turned out that this does not solve the problem with the attack, many have changed the hash and the second time.
An example of a
collision-resistant MAC :
- VMAC, UMAC, Poly1305-AES - use AES-128, optimal for long messages.
- SipHash - is a PRF, does not use cryptographic algorithms, is optimal for short messages.
- “Strongly universal string hashing is fast” [10].
Siphash
2012, length 64 bits, key 128 bits.
Speed: 15.50 cycles per byte for 8 bytes of a message, 3.66 cycles per byte for 64 bytes of a message, up to 1.96 cycles per byte for long messages. For comparison, the fastest finalist SHA-3 BLAKE-512: 134 cycles per byte for 8 bytes, 20 cycles per byte for 64 bytes.
You can use any number of compression and finalization rounds. For example, SipHash-2-4 - 2 and 4 rounds, respectively.
It was developed as a collision-resistant hash function based on a cryptographic pseudo-random function (PRF), fast for short messages. Since it is intended to be used with a random key, there is no requirement for collision resistance. It differs from non-cryptographic hash functions in that it works correctly and safely with the key, providing both the MAC properties and the properties of the universal hash function.

The situation by language currently looks like this:
- Rust: SipHash 2-4,
- Rails: SipHash 2-4,
- Perl (5.8.1): SipHash 2-4 (64 bits) and One-at-a-time (32 bits),
- Ruby (1.9.3-p327), JRuby: SipHash 2-4,
- Python (2.7.3, 3.2.3): own hash function, with the inclusion of -R a random seed is added [11],
- Haskell (1.2): SipHash (for strings),
- Go: aeshash (supported by AES-NI) or a simple eigenfunction (variation FNV-1),
- PHP (5): DJBX33A, there is a limit on the number of GET / POST elements (max_input_vars),
- Java SE (7u6 and 8): when jdk.map.althashing.threshold is turned on, MurmurHash is used with a random seed (alternative string hashing),
- Scala: Byteswap32 and MurmurHash 3 or hash from Java,
- .NET (4.5): when enabled, UseRandomizedStringHashAlgorithm uses its own development Marvin32 [9].
Open tasks:
- Java: there is no response to MurmurHash cryptanalysis [16],
- Python: recognized that the key "-R" is useless, and ponder how best to fix the attack on the hash table [14]
- Go: implemented aeshash, but there is no cryptanalysis yet [15].
Other methods of protection
It is believed that the fight against this attack is not the task that the hash function should perform. You can instead:
- use data structures with less complexity at worst. For example, a balanced tree - AVL-tree or red-black tree - guarantees in any case O (log n ). Here it is important to take into account new attacks, for example, cascade rebalance of a tree with specially formed data, but even in this case we will get a better result - O ( n log n ),
- limit the number of processed items (for example, POST / GET),
- raise an exception if the number of collisions is greater than the limit (or just discard data, if possible),
- raise an exception if the CPU is busy above the normal for search operations (timeout).
Recommendations
For data structures that use untrusted data, apply algorithms with complexity O (log
n ), for example, a balanced tree. If it is not possible, then use a hash function randomly selected from the family of functions (MAC), for example, VMAC or SipHash.