πŸ“œ ⬆️ ⬇️

Encryption GOST 28147-89 on x86 and GPU-processors

The article presents the results of testing optimized GOST encryption algorithms obtained in September and March 2014 by the Security Code company on new Intel server processors, as well as on graphics processors from various manufacturers.

Acceleration of encryption GOST 28147–89


With the development of IT technologies, the volume of data transmitted over the global Internet, stored in network storages and processed in the "clouds", has sharply increased. Some of this data is confidential, so it is necessary to ensure their protection from unauthorized access. Encryption is traditionally used to protect confidential data, and when encrypting large volumes, symmetric encryption algorithms are used, such as the well-known block algorithm, AES. In order to comply with Russian legislation, when encrypting such information as personal data, it is necessary to use the domestic algorithm of symmetric block encryption GOST 28147–89. The data encryption operation is quite expensive and requires additional time for data processing, resulting in reduced performance and increased delays. To reduce this negative effect in data protection, it is necessary to increase the encryption speed. In general, encryption algorithms are implemented in software, but hardware methods are used to achieve high speeds. Unfortunately, in modern x86 processors, hardware acceleration of encryption is implemented only for the AES standard (AES-NI instruction set). This standard is based on a special algebraic structure, and other encryption standards can be accelerated using AES-NI instructions only if their structure is the same as AES (for example, Camellia).
When implementing GOST 28147–89, you cannot use AES-NI instructions, but you can use other approaches to speed up encryption. For example, multi-block encryption - one encryption software stream processes several blocks in parallel. But parallelize the processing of a single data array (Fig. 1) only in those encryption modes GOST 28147–89, where there is no feedback between the processed blocks (gamming, ECB). For modes that have feedback between blocks (CFB, MAC), you can use parallel processing of several blocks from different encryption streams (Fig. 2). With this approach, the encryption speed GOST 28147–89 can be measured in ECB mode without loss of generality. It is worth noting the size of the buffer for each such encryption stream - it should not exceed 4 KB (sector size on the HDD). In the results below, we limited ourselves to 512 bytes, and each stream was encrypted in its own key.

Acceleration of GOST 28147–89 on the central processor (CPU) using SIMD technologies


Modern x86 architecture processors (as well as ARM, PowerPC, and others) contain a vector computing block for parallel processing of multiple data streams using SIMD (single instruction, multiple data) technology. This CPU block can be effectively used for multi-block encryption GOST 28147–89. The greatest effect in this case is achieved due to the PSHUFB hardware data shuffling instruction (Fig. 3), which allows you to significantly speed up the nonlinear transformation (hereinafter S-box) in the algorithm. In combination with AVX (or AVX2) command extensions and the ability of x86 processors to execute several commands in parallel (out-of-order execution), multi-block encryption provides high speed even for one processor core.
The implementation of the GOST 28147–89 algorithm using AVX instructions (Intel processor architecture Sandy Bridge / Ivy Bridge) allows processing data at a speed of 8.5 clock cycles / bytes for a single processor core (the same speed for the AES-256 algorithm without using AES instructions NI). For Intel Haswell architecture with AVX2 support, the performance increases to 6.7 clock / byte. If the target encryption system uses only the CPU, then its performance will increase linearly with the total number of CPU cores (Fig. 4) in the system and their frequencies. On the Intel Xeon 2 x E5-2697 v3 platform, when processing 16 encryption streams of GOST 28147–89 on one CPU core (AVX, Fig. 4), the maximum speed was 9425 MB / s (10.5 MB / s per 1 stream). When using AVX2 instructions (32 encryption streams) on a test platform, the speed was 13133 MB / s (7.3 MB / s per 1 stream) or more than 100 Gbit / s. The low encryption speed of one stream needs to be explained. For example, we need to encrypt 128 sectors on a hard disk (sector size 512 bytes). Since the sectors can be encrypted in parallel, the encryption speed on 1 CPU core in this case will be 10.5 MB / s Γ— 16 streams = 168 MB / s and 7.3 MB / s Γ— 32 streams = 233.3 MB / s for AVX and AVX2 respectively. It is worth noting that Intel Hyper-threading technology allows you to increase the overall speed of the platform by 30% with a 50% reduction in speed by one thread (Fig. 4).

')

Acceleration of GOST 28147–89 on the central processor (CPU) on general-purpose registers


When implementing GOST 28147–89 on general-purpose registers for an S-box operation, a pre-calculation table of 4 KB is compiled (first proposed by Vinokurov A.). Such an approach requires a lot of non-linear memory access operations, and the speed of execution of such an implementation of the encryption algorithm depends on the CPU memory subsystem. For the Intel Sandy Bridge / Ivy Bridge architecture, the encryption performance in this case is 60 clock / byte. In this architecture, there are 2 data load ports (LD - load data) in each core of the CPU, which allows the use of multi-block encryption and, in this case, encrypt 2 blocks in parallel.
Then the encryption speed increases to 30 clocks / byte. For the platform under test, the total encryption speed of GOST 28147–89 on general-purpose registers was 2682 MB / s (23.9 MB / s per stream) or 21.97 Gbit / s (GPR, Fig. 4). Multi-block encryption using SIMD provides maximum performance when processing from 4 encryption streams. At the same time, the encryption speed of one stream in SIMD is lower than the speed of one stream in general registers. The parallelization ratio of encryption per CPU was 0.99 without the use of Intel Hyper-threading technology. When using this technology, the coefficient has dropped to 0.65.

Acceleration of GOST 28147–89 on a graphics processor (GPU)


To further accelerate encryption according to GOST 28147–89, we investigated heterogeneous systems (CPU + GPU). The development of the GPU architecture has led to the fact that they are closer in their capabilities to the CPU, both in programming methods and in hardware. But using the GPU as an encryption accelerator is associated with a number of technical problems. Typically, the graphics accelerator is a peripheral device that connects to the CPU via the PCI Express data bus. This introduces into the implementation of any algorithm using common graphics accelerator technology (GPGPU) additional operations on copying data from CPU memory to GPU memory and back, but this effect can be reduced by processing data processing (Fig. 5). The cores in modern GPUs are structurally similar to the vector computing blocks in the CPU.
Architecturally, GPUs are aimed at executing a large number of parallel streams of programs that perform many arithmetic operations. A modern graphics accelerator incorporates a large number of arithmetic logic units (ALUs) and a memory architecture oriented towards the transfer of large data blocks. In the GOST 28147–89 algorithm, the S-box operation for general-purpose processors requires a multitude of non-linear memory access operations, where replacement tables are stored. Nonlinear memory read operations for GPUs are performed with longer delays than for CPUs. Therefore, we implemented the S-box operation in the GOST 28147–89 algorithm using a Boolean function in order to optimally use the computational capabilities provided by the GPU. At the same time, we managed to get the encryption speed on the Nvidia GeForce GTX 750 GPU (Maxwell architecture) 2588 MB / s (161.8 Kb / s per stream) or 21.2 Gbit / s (Fig. 6). For AMD HD 7790 GPU (GCN 1.1 architecture), the encryption speed is ~ 1700 MB / s. Although the graphics accelerator from Intel showed no outstanding results (295 MB / s), it has no equal in terms of prevalence. Its speed will be enough to encrypt a single hard disk. It should be noted that not the most productive GPUs were tested, Nvidia and AMD have more productive solutions on similar architectures: GeForce GTX 980 and FirePro W9100. We assume that the speed in this case will increase in proportion to the number of ALUs of these GPUs. 4 times for GTX 980 and 3 times for FirePro W9100. At the same time, the performance of encryption on a single GPU cannot exceed the data transfer rate on the PCI express v3 bus in one direction (12 GB / s).


Conclusion


By speed, encryption GOST 28147–89 approaches AES and can be a good alternative to it. If you combine encryption on the CPU and GPU, you can achieve an encryption rate of 1 node at 53 GB / s (platform 2 CPU Intel Xeon E5-2697 v3 + 4 GPU Nvidia GeForce GTX 980). Let us briefly list the areas where such encryption speeds can be claimed. Firstly, to encrypt the standard 40 Gbit / s and 80 Gbit / s networks, which will be implemented in the next versions of APK Continent. Secondly, in distributed network disk storage. Currently, the Security Code is developing a pass-encoder for the iSCSI protocol. Thirdly, the encryption operation itself can be sold as a service in cloud services - the cloud client pays extra for encrypting its data or connections. You can sacrifice high speed to increase energy efficiency and reduce the cost of encryption equipment for networks of 1 Gbit / s and 10 Gbit / s.

Source: https://habr.com/ru/post/255863/


All Articles