I will talk about such a problem as password hashing in web services. At first glance, it seems that everything is “clear” and you just need to take a normal algorithm, which have already been thought out a lot, write a little bit of code and roll everything into production. But as usual, when you start working on a problem, there is a bunch of pitfalls that must be taken into account. Which ones? The first of them is, perhaps, the choice of the algorithm: although there are many of them, each has its own characteristics. The second is how to choose the parameters? More and better? How to deal with the response time to the user? How much memory, CPU, threads? And the third is what to do with computational DoS? In this article I want to share some of my thoughts about these three problems, the experience of implementing a new password hashing algorithm in Yandex and a small amount of code.

Attacker & Defender
Before proceeding to the algorithms and the construction of a hashing scheme, we must generally understand what we are protecting ourselves from and what role the password hashing should play in the security of a web service. Usually, the script is such that an attacker breaks a web service (or several web services) through a chain of vulnerabilities, gets access to a user database, sees password hashes there, dumps a base and goes to have fun with the
GPU (and, in rare cases, with
FPGA and
ASIC ).
What does the defender do in this case? First of all, it sends users whose password hashes have been compromised, forcing a password change and additional authentication using tethered phone, mail, etc. In this case, a good hashing algorithm gives time to understand the scale of the disaster, to run all the necessary procedures and, as a result, save user accounts from being captured by an attacker.
Hardware
As I mentioned above, an attacker can use such equipment as GPU, FPGA, ASIC to speed up the calculations. In my opinion, the most dangerous thing on this list is the GPU, because so many people are fond of mining cryptocurrencies, three-dimensional games, etc. The GPUs are ready to start sorting password hashes. FPGAs, in turn, are not widespread, costly, often the capabilities of the debugging board are not enough, and it’s usually unrealistic to solder something at home (at high frequencies, increased requirements for soldering quality, etc.). And finally, ASIC requires quite substantial investments, design, start-up of the production cycle. Often it will be more expensive than the cost of information that an attacker can get by hacking your service.
')
Well, the defending party usually uses server CPUs that run web applications. Server CPUs have a lot of cores (but less than in a GPU), a large L3 cache, etc. Therefore, an obvious idea when developing hashing algorithms is to optimize them for CPU capabilities and slow them down as much as possible on GPU, FPGA and ASIC. Among these methods of slowing down the following can be identified:
1. The use of a large amount of RAM (GPU shared memory is limited, and going to global memory is very “expensive”; external memory must be soldered to FPGA and ASIC, which increases the cost of the whole circuit, delays in access, etc.) and stability to the Time-Memory Tradeoff (the so-called Memory Hard algorithms).
2. Using readings / records of small amounts of data at random addresses in a small region of memory that fits in the L1 cache (if you try to read 4 bytes from the global memory GPU, then 32 bytes will actually be read, which causes the GPU to use the memory bus ).
3. Using the multiplication operation MUL. On the CPU, it is performed as quickly as normal operations like shift and ADD, but on FPGA and ASIC this leads to a more complex configuration, more delays in data processing, etc.
4. Using the so-called instruction-level parallelism and the hashing algorithm SIMD-friendly design. Modern CPUs carry on board various advanced instruction sets such as SSE2, SSSE3, AVX2, etc., which allow you to perform several operations at a time and significantly speed up the calculations.
The listed techniques are not used in all algorithms. So, in
Argon2 (the winner of the
Password Hashing Competition ) all the listed techniques are used, except the second one. In
Yescrypt , which received special recognition in the PHC competition, all four techniques are used (but you must include a special mode of operation).
For ourselves, we chose Argon2, since this algorithm is well researched, easy to understand and implement, and also well optimized for x64 and SIMD.
Computational DoS problem
If the algorithm in the process of work uses a lot of memory and is guaranteed to take a certain amount of CPU time, then a mindless choice of parameters can lead to a computational DoS situation, when small RPS can damage the web service or significantly slow down the answers to the user. The reality is that there are not very many ways out of the situation. Here are some of them:
1. “Fill up the problem with iron” - add a lot of additional north, which is running a web service. This in some sense will help to cope with the computational DoS with proper load balancing, but this method does not solve the problem of long answers to the user. In other words, adding a large amount of resources does not reduce the response time, but only increases the peak RPS per cluster. As you can guess, this is not our way.
2. Maximum optimization of the hash algorithm code, use of
SIMD instructions , etc.
3. Reduction of the algorithm parameters used to an acceptable level - with the addition of various mitigations at the level of the entire password hashing scheme.
Obviously, it is worth doing the last two points. If high performance is not so important for you, then using an optimized version of the algorithm, you can afford greater parameters, which will make it even harder to search for passwords on the GPU, FPGA, ASIC. And adding mitigations at the password hashing scheme level can also make it impossible (or, at least, difficult to execute) offline attacks on the hash base and allow brute force hashes only if the attacker is in your network, which makes it easy to detect and investigate this situation. incident.
Protocol level mitigations
What mitigations at the level of the hashing scheme currently exist:
1. Using local parameters (local parameters). This idea is quite simple - you need to add a secret parameter to the algorithm, which is stored not in the database, but in the application (for example, in environment variables). Thus, an attacker will not be able to gain access to the database with password hashes - you will have to break the application as well. Also, the database administrator will not be able to dump the hashes and have fun with them at home with the GPU.
2. Using a large ROM (Read-only memory) to mix in values from it when hashing passwords. This idea was proposed by the authors of Yescrypt as one of the adaptations of the algorithm for large scale password hashing. In fact, if you use ROM of about 100 GB, then it will be difficult to steal it - you need to have a server with a minimum memory of 100 GB for a quick bust on the CPU. On GPU, FPGA and ASIC, everything will be slow too, since we not only use a large ROM, but also optimize the hashing algorithm so that it is slow on these types of equipment, etc. The disadvantages of the idea are that you have to time to live with this great ROM and you most likely will never get rid of it.
3. Using
CryptoAnchors - small microservices that perform only one operation: they use PRF with a secret key, for example,
HMAC . The secret key is stored in microservice and never leaves it. The essence of the idea is that microservice is small, simple. Auditing is easy, and hacking is very difficult, so using it allows you to turn an offline attack into an online attack. In other words, to attack a hash base, an attacker must remain inside the network and send requests to this microservice.

The CryptoAnchors idea is used in a Facebook password hashing scheme called
Passwords Onion , but it can also be used in other
parts of the infrastructure .
4. The use of the so-called
Partially-Oblivious PRF (in fact, this is part of paragraph 3). If you use CryptoAnchors with something like HMAC, then the problem arises of changing the secret key when it is compromised. One way to solve the problem is to create another layer of HMAC, which leads to various inconveniences. In addition, in the case of ordinary CryptoAnchors, this microservice sees all the hashes that the application sends. In other words, if the service is hacked, the attacker will be able to collect hashes in a pure form and conduct an offline attack. To solve these two problems, researchers from CornellTech have proposed using Partially-Oblivious PRF, based on
billinear pairing . This design allows you to change the secret key, organize logging and restrictions on the number of requests for each user. At the same time microservice does not see the hash in the clear. You can read more
in their article .

In short, the idea is that the application hashes the password, uses blinding to disguise it, and passes the masked password to the microservice along with the user ID t. Microservice applies billinear pairing to these values, using the key k known only to it, and transfers the result to the application, which, in turn, can apply unblinding (remove the masking) and compare the result with what is stored in the database. At the same time, the linearity of billinear pairing allows microservice with POPRF to issue to the application a token for updating the key, and the application can go through the database and update the records.

Performance Optimization. Argonische is our implementation of Argon2
GitHub has the
official implementation of the Argon2 algorithm, however it uses
‑march=native
, which for us is fraught with service
‑march=native
with the exception of
Illegal instruction
, because we use different processor models, and this setting forces the compiler to optimize the code for the processor model that builds . If we create the most portable configuration of the assembly, we will lose about 15–20 percent of the possible performance (and in the case of AVX2 - up to 65 percent). Therefore, we have created our own implementation of the Argon2 algorithm, which allows you to maximize the capabilities of the CPU on which the code runs.
Our implementation uses a technique called Runtime CPU dispatching. Its essence is that during the initialization of the library with the algorithm code, the
cpuid
instruction is executed, which determines which of the advanced instructions are supported by the current CPU, and selects the code branch with the corresponding optimizations. Our library contains code optimized for SSE2, SSSE3, SSE4.1 and AVX2 instruction sets. The difference in performance on Argon2d with parameters
p=1,m=2048,t=1
can be seen in the graph below.

And since Argon2 uses
Blake2B , we received Blake2B as a bonus, optimized for the above instruction sets. Inside the company, we recommend using Blake2B as a quick and safe replacement for
SHA-1
and
HMAC-SHA-1
. So, the differences from the official implementation of Argon2 are as follows:
1. C ++ 14 and cmake as an assembly system.
2. Runtime CPU dispatching.
3. Blake2B, optimized for SSE2, SSSE3, SSE4.1, AVX2.
4. OpenMP instead of pthread, and if OpenMP is not present, then all calculations will be performed sequentially.
Also in the process, we wrote from scratch the version of Argon2 for the AVX2 instruction set and
sent PR to the official repository, where our changes were accepted by the maintainers.
In general, the problem of password hashing in large and heavily loaded services is solved. To solve it you need:
• accelerate the implementation of the hashing algorithm,
• select the algorithm parameters based on the KPI for the response time,
• make changes to the hash scheme to protect against offline attacks.
Thanks
We thank
Solar Designer (aka Alexander Peslyak) for a huge amount of thoughts, ideas, experiments, and useful discussions of the problem of password hashing in large Internet companies. We also want to thank
Dmitry Khovratovich for various ideas, joint discussion and analysis of different approaches to password hashing. Many thanks to Igor
cerevra Klevantsu, who, in the intervals between making corrections to the C ++ standard, found time to review the code of our implementation of Argon2.
useful links
•
Project Argonische on GitHub•
Official repository Argon2•
Article about Pythia and Partially-Oblivious PRF•
Intel Intrinsics Guide•
PASS: Strengthening and Democratizing Enterprise Password Hardening