Slightly less than the fastest, portable, 64-bit hash function, with decent quality.
Yes, and in the ladies, something like this. Read on?
Russian version is here .
We omit the definition of hash functions along with a detailed listing of the properties and requirements for their cryptographic application, assuming that the reader either owns the necessary minimums of knowledge or fills them up . We also agree that hereinafter we mean non-cryptographic (cryptographically unstable) hash functions, unless explicitly stated otherwise.
Hashing is applied in the mass of algorithms, while almost always the most efficient (fast) data processing is required, along with a certain minimum level of hashing quality. Moreover, the term “quality” means, first of all, “conditional randomness” (stochasticity) of the result with respect to the initial data. Somewhat less often additional requirements are imposed: resistance to the deliberate generation of collisions or irreversibility.
For a slender presentation, it is necessary to define the concept of “quality” hash functions and the rest of the requirements in a little more detail:
Omitting citing evidence and other calculations, we can state:
So, we can say that t1ha appeared as a result of searching for a compromise between quality and speed, at the same time taking into account the capabilities of modern processors and the already found methods (arithmetic-logical combinations) of mixing and spreading dependencies (avalanche effect).
The basic version of t1ha is the (most) fast portable hash function for constructing hash tables and other related applications. Therefore, the basic version of t1ha is focused on 64-bit little-endian architectures, accepts a 64-bit primitive value (seed) and produces a 64-bit result, which includes a key-length gain and a seed. It is worth noting that t1ha is intentionally designed to return 0 for zero input data (a key of zero size and zero seed).
64-bit operations : It may be worthwhile to clarify that it is 64-bit operations that give speed and quality without selecting portability. Actually, the wider the digit capacity of arithmetic operations, the more they produce an avalanche effect and better mix the data. Well, data processing, all other things being equal, is certainly faster by 8 bytes than by 4. On the other hand, exactly 64-bit operations are natively available on many modern processors, and can be more or less tolerably translated into 32-bit ones. All other options, including SIMD operations, force us to greatly sacrifice portability and / or speed on non-native platforms.
64-bit result : To construct hash tables, in many cases, a smaller width result is really enough. Even 32 bits may be more than enough. However, when using 64-bit operations, the 64-bit result is actually obtained by itself. At the same time, a sufficiently high-quality 64-bit hash result allows you to quickly perform a comparison for non-equality, and with good accuracy to compare for equality.
The above-mentioned "magic" of replacing comparisons can be incomprehensible and unclaimed, and can increase the speed of work by an order of magnitude only due to the locality of the data, i.e. Smaller CPU cache flushing. Simply put, you can build a hash table structure so that the calculated hash values ​​lie side by side (packed in cache lines). And for real data the processor had to go only if the hash values ​​coincided. And in this case, the 64-bits from t1ha allow you to get the ultimate result . And 128 bits will not give a win anymore, but taking less from 64 bits is always possible.
Comparison with HighwayHash : I have a twofold attitude to this unofficial project of Google employees .
If all goes well, then additional options will appear in the t1ha suite (primarily for the width of the result) and optimized for E2K. On this topic comparison with HighwayHash, I would like to close.
To evaluate the quality of the hash function in all aspects is quite difficult. You can go analytically, or conduct various statistical tests. Unfortunately, the analytical approach is not very effective for evaluating hash functions with a compromise between quality and speed. Moreover, a comparative analytical evaluation of such functions tends to be subjective.
In contrast, for statistical tests it is easy to obtain transparent quantitative estimates. At the same time there are well-proven test packages, such as SMHasher . For t1ha, the results are simple - all t1ha variants pass all tests without any comments. On the other hand, one should not assume that t1ha has any properties in excess of those that are necessary for the target application (building hash tables).
The number of collisions at all levels of t1ha variants corresponds to the birthday paradox . In a stricter formulation, the probability of collisions t1ha corresponds to the probability of coincidence of random discrete values ​​of the corresponding capacity.
A similar probability of collisions for all more or less high-quality hash functions. But this is exactly the probability, so a literal number of collisions can vary on a specific data set.
After the first publication of this article, Yves Orton discovered that the first level t1ha1()
variant does not always meet the strict avalanche criterion . This drawback is insignificant for the targeted applications of t1ha1()
and imperceptible from a practical point of view. However, this disadvantage is eliminated in the next t1ha2()
level, which was originally planned to provide a slightly higher quality. On new processors, with the use of current versions of compilers, t1ha2()
on average, one cycle faster than t1ha1()
, and in other cases it can be one cycle slower. It is worth noting that t1ha2()
additionally offers streaming hashing mode and a 128-bit result.
Readers would certainly appreciate a thorough and in-depth analysis of the quality and / or stamina of t1ha . However, based on the target t1ha application areas, this seems redundant. Simply put, speed was more important to us, including for short keys. Therefore, multi-round mixing was not considered. The present t1ha saves on matches cycles and gives a 64-bit result - it is practically meaningless to measure the compromise found otherwise than statistically, and these results are simply good.
I just take an example in statistical prove with colleagues from Google ;)
It is necessary to clarify the presence of the phrase “ the fastest ” in the title. Indeed, it is extremely unlikely that there is a hash function that will be useful and at the same time the fastest on all platforms / architectures. Different sets of instructions are available on different processors, and similar instructions are executed with different efficiencies. Obviously, the “ universal fastest ” function most likely cannot be created. However, it seems acceptable to use the “fastest” function that is portable and at the same time the fastest, at least on the most common platform (x86_64), while having little chance of losing on any modern processor with a decent optimizing compiler.
The source code of the project includes a test that checks both the correctness of the result and measures the speed of each implemented version. At the same time, on x86, depending on the capabilities of the processor (and compiler), additional variants of functions can be checked, and measurements are made in processor cycles.
In addition, the project website contains tables with the results of performance measurements through a modified version of SMHasher from Reini Urban . Accordingly, all figures can be double-checked and / or get results on a specific processor using a specific compiler.
Here you can make a comparison with some closest competitors t1ha.
Hashing short keys (average for 1..31 bytes).
We look at the right column "Cycles / Hash" (the smaller the value, the better) :
Function | MiB / Second | Cycles / Hash |
---|---|---|
t1ha | 12228.80 | 35.55 |
Fasthash64 | 5578.06 | 43.42 |
CityHash64 | 11041.72 | 51.77 |
xxHash64 | 11123.15 | 56.17 |
Metrohash | 11808.92 | 46.33 |
Hashing long keys (256 Kb).
We look at the middle column “MiB / Second” (the higher the value, the better) :
Function | MiB / Second | Cycles / Hash |
---|---|---|
t1ha | 12228.80 | 35.55 |
FarmHash64 | 12145.36 | 60.12 |
CityHash64 | 11041.72 | 51.77 |
xxHash64 | 11123.15 | 56.17 |
Spooky64 | 11820.20 | 60.39 |
Development t1ha pursued purely practical purposes. The first such goal was to obtain a fast portable and sufficiently high-quality function for constructing hash tables.
Then the fastest version of the hash function was required, which would give a result of comparable quality, but was adapted to the target platform as much as possible. For example, the basic t1ha version works with little-endian byte order, which is why on big-endian architectures conversion is necessary with inevitable loss of performance. So why not get rid of unnecessary operations on a specific target platform? In the same way, several more options were added:
A little later, it became clear that more options would be needed that were designed for various applications, including different results, the quality and durability requirements. Such diversity required restoring order. What was expressed in the change of the naming scheme, in which the numeric suffix denotes the “level” of the function:
t1ha0()
is the fastest option for the current processor.t1ha1()
is the basic portable 64-bit version of t1ha.t1ha2()
is a portable 64-bit version with a little more concern for quality.t1ha3()
is a fast portable 128-bit version for fingerprinting.In this scheme, it is assumed that t1ha0()
is a dispatcher that implements redirection depending on the platform and capabilities of the current processor. In addition, the use of the suffixes "_le" and "_be" for an explicit choice between the little-endian and big-endian variants is not excluded. Thus, under the “t1ha” signboard now there are several hash functions, and this family will be replenished, including with an eye on the domestic E2K “Elbrus”.
An idea of ​​the current set of functions and their properties can be obtained from the output of the embedded test ( make check
). It is worth noting that all functions pass all SMHasher tests, and the performance of the AES-NI variants varies greatly depending on the processor model:
Intel(R) Core(TM) i7-6700K CPU @ 3.00GHz Build by GNU C/C++ compiler 8.2 [...] - use RDPMC_40000001 as clock source - measure granularity and overhead: 53 cycles, 0.0188679 iteration/cycle Bench for tiny keys (7 bytes): t1ha0 : 13.14 cycle/hash, 1.877 cycle/byte, 1.598 Gb/s @3GHz t1ha1_64le : 15.14 cycle/hash, 2.163 cycle/byte, 1.387 Gb/s @3GHz t1ha2_atonce : 15.50 cycle/hash, 2.163 cycle/byte, 1.387 Gb/s @3GHz t1ha1_64be : 16.78 cycle/hash, 2.397 cycle/byte, 1.251 Gb/s @3GHz xxhash32 : 17.17 cycle/hash, 2.453 cycle/byte, 1.223 Gb/s @3GHz StadtX : 17.59 cycle/hash, 2.513 cycle/byte, 1.194 Gb/s @3GHz t1ha0_32le : 18.28 cycle/hash, 2.612 cycle/byte, 1.149 Gb/s @3GHz t1ha0_32be : 20.24 cycle/hash, 2.892 cycle/byte, 1.037 Gb/s @3GHz xxhash64 : 22.17 cycle/hash, 3.167 cycle/byte, 0.947 Gb/s @3GHz t1ha2_atonce128* : 29.93 cycle/hash, 4.277 cycle/byte, 0.701 Gb/s @3GHz t1ha2_stream* : 79.81 cycle/hash, 11.402 cycle/byte, 0.263 Gb/s @3GHz HighwayHash64_avx2 : 83.75 cycle/hash, 11.964 cycle/byte, 0.251 Gb/s @3GHz HighwayHash64_sse41 : 85.25 cycle/hash, 12.179 cycle/byte, 0.246 Gb/s @3GHz t1ha2_stream128* : 99.06 cycle/hash, 14.152 cycle/byte, 0.212 Gb/s @3GHz HighwayHash64_portable: 480.75 cycle/hash, 68.679 cycle/byte, 0.044 Gb/s @3GHz HighwayHash64_pure_c : 652.58 cycle/hash, 93.226 cycle/byte, 0.032 Gb/s @3GHz Bench for large keys (16384 bytes): t1ha0 : 1185.00 cycle/hash, 0.072 cycle/byte, 41.478 Gb/s @3GHz t1ha2_atonce : 3436.00 cycle/hash, 0.210 cycle/byte, 14.305 Gb/s @3GHz t1ha2_atonce128* : 3440.00 cycle/hash, 0.210 cycle/byte, 14.288 Gb/s @3GHz t1ha1_64le : 3449.00 cycle/hash, 0.211 cycle/byte, 14.251 Gb/s @3GHz t1ha2_stream* : 3479.00 cycle/hash, 0.212 cycle/byte, 14.128 Gb/s @3GHz t1ha2_stream128* : 3508.00 cycle/hash, 0.214 cycle/byte, 14.011 Gb/s @3GHz StadtX : 3550.00 cycle/hash, 0.217 cycle/byte, 13.846 Gb/s @3GHz xxhash64 : 4121.00 cycle/hash, 0.252 cycle/byte, 11.927 Gb/s @3GHz t1ha1_64be : 4567.00 cycle/hash, 0.279 cycle/byte, 10.762 Gb/s @3GHz HighwayHash64_avx2 : 4580.00 cycle/hash, 0.280 cycle/byte, 10.732 Gb/s @3GHz HighwayHash64_sse41 : 6412.00 cycle/hash, 0.391 cycle/byte, 7.666 Gb/s @3GHz t1ha0_32le : 7191.00 cycle/hash, 0.439 cycle/byte, 6.835 Gb/s @3GHz t1ha0_32be : 7928.00 cycle/hash, 0.484 cycle/byte, 6.200 Gb/s @3GHz xxhash32 : 8197.00 cycle/hash, 0.500 cycle/byte, 5.996 Gb/s @3GHz HighwayHash64_portable: 41895.27 cycle/hash, 2.557 cycle/byte, 1.173 Gb/s @3GHz HighwayHash64_pure_c : 53296.11 cycle/hash, 3.253 cycle/byte, 0.922 Gb/s @3GHz
If we talk in a little more detail, then t1ha is built according to the Merkle-Damgård scheme (in the “wipe-pipe” version) with hardening of the data size and the suval value. Inside the main compression cycle, a 256-bit state is used, with the same size of the input block. Moreover, for each data operand there are two injection points with cross pollination. Upon completion of the compression cycle, a 256-bit state is compressed to 128 bits.
When performing the described actions, 64-bit operations are used, which are combined into mixers ARX (Add-Rotate-Xor) and MUX / MRX (Mul-Rotate-Xor). It is important that all these calculations are built in such a way as to ensure the possibility of parallel execution of most operations and tight packing of u-ops both into a pipeline and into x86_64 performing devices. Due to this, a sufficiently good quality is achieved with an almost maximum hash rate of long keys.
It is worth noting that the compression cycle runs only for blocks of sufficient size. If there is less data, then the intermediate 128-bit state will consist only of the key size and the priming value.
Further, the remaining tail of the data in portions of 64 bits is mixed alternately to the halves of the 128-bit state. Finally, the state is mixed at the same time with compression to a 64-bit result. An important feature of t1ha here is the use of a mixer based on wide multiplication (128-bit product of two 64-bit multipliers). This allows for qualitative mixing with a good avalanche effect in fewer operations. Despite the fact that wide multiplication is a relatively expensive operation, fewer such operations allow t1ha to process short keys in a record-low number of processor cycles.
It should be noted that the mixer used based on wide multiplication and exclusive OR is not perfect. Although t1ha passes all SMHasher tests, the author has an idea of ​​the consequences of non-injectivity . Nevertheless, the resulting quality seems to be rationally sufficient, and the development plans for the t1ha line already reflect the intention to provide a slightly better option.
Thanks for attention. All good.
Russian version is here .
Source: https://habr.com/ru/post/339160/
All Articles