How we accelerated on FPGA AES: driver development

Recently, we used the Ethond board as a mini router and launched OpenVPN on it.

But it was found that the processor is often loaded at 100%, and the speed does not rise above 15-16 Mbit / s. On a 100 megabit communication channel, this is very small, so we decided to speed up the process by hardware.

The guys from the FPGA-developers group made an open IP-core firmware for Altera CycloneV with the implementation of the AES-128 cipher, which can encrypt 8 Gbit / s and decrypt 700 Mbit / s. For comparison, the openssl program on the CPU (ARM Cortex A9) of the same CycloneV can handle only about 160 Mbps.

This article focuses on our research on using AES hardware encryption. We briefly present the description of the cryptographic infrastructure in Linux and describe the driver (the source code is open and available on github ), which communicates between the FPGA and the kernel. The implementation of encryption on FPGA is not the topic of the article - we describe only the interface with which the processor interacts with the accelerator on the processor side.

Now we understand that it would be better to first determine what is the main factor in reducing the bandwidth of the channel: what if in fact not the encryption process itself, but the passage through the software stack takes most of the time? We did not do this, and the performance gains as a result were not at all what was expected. However, there is still a benefit: it was interesting to learn how to register the hardware acceleration mechanism in Linux, how to access it from user programs, and finally, how to get popular things like openssl and openvpn to choose an accelerated algorithm, rather than a standard software implementation.

In the future, we are going to accelerate OpenVPN on Ethond - then expect another article from us!

Introduction

What are we so interested in cryptography? At first glance, it does not go side by side with the equipment we do. However, the application for it, we can also find. A huge part of the traffic is encrypted, and every Internet user regularly encounters, even without knowing it, cryptography. For example, the popularity of VPN is growing: according to a study , at the beginning of 2017, three out of ten people are using this technology.

We had the idea to make our own small router that could encrypt and decrypt at high speeds. The idea is that the VPN does not connect the user from his machine, but the router itself. Well, it was just interesting to try yourself in the new.

What can accelerate cryptographic algorithms? If they are expressed through arithmetic and logical operations, it turns out slowly. In the implementation of specialized digital circuits can achieve much better performance. A good set of related links can be found in the Wikipedia article on AES instruction set .

It is possible to implement the algorithm hardware on the CPU and call its execution through special instructions. A well-known example is Intel's AES-NI . However, embedded processors often do not have such functionality or have restrictions on export / import. In this case, you can install additional peripheral devices that will process the data themselves. And, of course, the functionality of such peripherals can be implemented on the FPGA.

Having decided to translate the possible into the real, we began research on hardware accelerated encryption on FPGA.

Theoretical information

In the course of researching ways to bundle hardware encryption on FPGAs and user programs, we have been stuck for a while in the study of hardware and we think it makes sense to present here some basic points.

Linux kernel

The Linux kernel implements many cryptographic algorithms: symmetric ciphers, hashes, and block cipher modes of operation. All this can be used by the kernel itself: for example, to encrypt disks (dm-crypt) or work VPN (IPsec). Unified access to cryptographic functions is provided by Kernel CryptoAPI, which allows drivers to register hardware implementations of the corresponding algorithms.

User programs can also access CryptoAPI. One of the interfaces providing this feature is the socket address family AF_ALG and the AF_ALG wrapper library over it. The main competitor of AF_ALG is the cryptodev kernel cryptodev (access to CryptoAPI via the character device /dev/crypto ).

According to the authors of cryptodev , their solution is much more productive than AF_ALG (see comparison ).

CryptoAPI Linux

The diagram shows CryptoAPI, where two implementations of AES-128 are registered: software and hardware, accessed through the appropriate driver.

A request to CryptoAPI can, for example, be IPsec , as well as the af_alg.ko and cryptodev.ko . Both give user programs the opportunity to access the cryptographic subsystem of the kernel, the first through the address family, the second through the character device.

As a rule, this is done transparently: CryptoAPI itself, when receiving a request, chooses which implementation to use, however, if you wish, you can learn some details of its work through the /proc/crypto file. It contains, in particular, the following fields:

name - the name of the implemented algorithm.
driver is a unique name for a separate implementation of the algorithm. Those that end in -generic are usually standard software implementations in the kernel.
priority . If the kernel has multiple implementations of the same algorithm, it will choose the implementation with the highest priority. Each driver itself assigns an arbitrary priority when registering an algorithm. An implementation is selected whose priority value is the highest.

Userspace

CryptoAPI, despite the fact that it is already a high-level abstraction, has another wrapper: very few people in userspace refer to it directly and most programs prefer using libraries, for example, libcrypto and libssl from the openssl project. They are used, for example, openssh , opvenvpn and, opvenvpn , openssl . These libraries support engines, what are usually called plug-ins, mechanisms for adding new implementations of cryptography algorithms.

The openssl developers have already written engines for encryption in the kernel via AF_ALG and /dev/crypto . Therefore, many programs automatically get access to the hardware implementation of the cryptoalgorithm, if it is registered in CryptoAPI.

Interaction of CpyptoAPI Kernel with Userspace Interfaces

The diagram shows several programs that use cryptographic functions provided in the libssl and libcrypto libraries that can access CryptoAPI via /dev/crypto or AF_ALG . For example, in libcrypto cryptodev engine and afalg engine respectively, are specified for both.

Real experience

Having dealt with the theory, we turn to the harsh reality: the description of our practical experience.

Purpose of the study

When developing an encryption accelerator driver, the main research question was how much bandwidth between FPGAs and user programs we can provide on our CycloneV boards onboard (for example, Ethond or BlueSom ). In our conditions, this turned out to be important: when encryption occurs so quickly, most of the time is spent on sending data and synchronizing what is happening in different parts of the system.

Metal

Our driver is almost completely determined by what iron provides us. Therefore, we start with a description of the hardware.

The scheme of the hardware

The diagram shows a simplified model of interaction between the processor and the FPGA inside the SoC CycloneV. For a more accurate and detailed description, you can refer to the original image in the "Introduction to the Hard Processor" chapter of the Cyclone V Device Handbook, Volume 3: Hard Processor Technical Reference Manual , an exciting but thick book.

Linux with our driver is running on MPU (Microprocessor Unit) inside HPS (Hard Processor System). The processor can access FPGA registers via L3 SWITCH over the HPS-to-FPGA interface. Appeals to registers in the diagram are shown by blue arrows.

Devices implemented in FPGA can access SDRAM-memory via FPGA-to-HPS interface via DMA (green arrows), as well as send interrupt processor (red arrows). The FPGA implements two independent entities: an encryption accelerator and a decryption accelerator. Each of them consists of two related blocks, one of which implements an algorithm (encryption / decryption), and the other communicates with the memory via DMA.

Control status registers of encryption accelerator in FPGA

Both pairs of "blocks" implemented in the FPGA have their own registers.

Encrypt Core and Decrypt Core have identical sets of registers that allow you to specify a key (Key), an initialization vector (IV, initialization vector) used in the next encryption / decryption operation.

Decrypt DMA and Encrypt DMA also have a mirror structure. Through their registers, you can specify the addresses and lengths of the memory segments — we call these sets of parameters descriptors — in one part of which the source data are located, and in another it is necessary to place the result of using AES. Also available is the ability to enable and disable interrupts to alert when each descriptor has finished processing.

Crypto API

Let's tell a little more about some concepts of the Linux cryptographic subsystem.

One of its main entities is "transformations" (transformation) - this is the name for any data transformations: hash sum calculation, compression, encryption. The driver can independently "register" the transformation - provide an opportunity to use it.

The class of transformation types is quite extensive, but only three of them are involved in encryption:

CRYPTO_ALG_TYPE_CIPHER : a cipher that operates on single blocks (in terms of block ciphers).
CRYPTO_ALG_TYPE_BLKCIPHER : a cipher that operates on chunks of data with a length multiple of the block size, and synchronous: the cipher function does not complete until encryption is completed.
CRYPTO_ALG_TYPE_ABLKCIPHER : it differs from the previous one in that it is asynchronous: the encryption function is used only to start encryption and terminates without waiting for completion. The end of the operation is reported by the driver itself, which implements encryption.

In CryptoAPI, it is also possible to set "templates" (templates) - implementations of complex entities, for example, a specific mode of the block cipher or HMAC, based on simple transformations such as encrypting a single data block or calculating a hash sum.

In particular, the existence of patterns allows the use of CRYPTO_ALG_TYPE_CIPHER to implement a block cipher, although the transformation itself operates with only 16-byte blocks. However, this is very CPU-intensive: the kernel itself controls the operation mode of the cipher. CRYPTO_ALG_TYPE_BLKCIPHER and CRYPTO_ALG_TYPE_ABLKCIPHER take it on themselves, and the kernel does not have to use the templates that provide the mode.

Since our firmware in FPGA now implements AES-128 in CBC mode, CRYPTO_ALG_TYPE_CIPHER is of no interest to us: we have already implemented the required mode of operation and no additional costs are required from the processor.

Our driver provides a transformation of the CRYPTO_ALG_TYPE_BLKCIPHER type, it seemed to us that it would be easier to implement.

The transformation gives the CryptoAPI function to set the IV (initialization vector), key, for encryption and for decryption. All this is indicated by the transformation driver when it is registered in the fields of the blkcipher_alg structure. This structure looks like this:

 struct blkcipher_alg { int (*setkey)(struct crypto_tfm *tfm, const u8 *key, unsigned int keylen); int (*encrypt)(struct blkcipher_desc *desc, struct scatterlist *dst, struct scatterlist *src, unsigned int nbytes); int (*decrypt)(struct blkcipher_desc *desc, struct scatterlist *dst, struct scatterlist *src, unsigned int nbytes); const char *geniv; unsigned int min_keysize; unsigned int max_keysize; unsigned int ivsize; };

Consider the most interesting fields: setkey , callback for setting a key, and encrypt / decrypt for encryption / decryption. The blkcipher_desc structure contains an IV for the encryption operation. The src and dst callbacks of encrypt and decrypt set the memory areas from which you need to take the original data and in which you need to place the result.

Variants of a cryptography stack on the example of Openssl

Now we have a little better idea of how the cryptographic stack is implemented in Linux, and we can more responsibly consider various options for its construction with various attendant advantages and disadvantages.

Possible options for the implementation of the driver

Obviously, if the encryption accelerator driver should be used by the kernel itself - for example, in IPsec - you need to register your implementation with CryptoAPI. In this case, both user programs and Linux have access to the driver. This is marked on the diagram as "Driver option 1".

However, there is an alternative possibility, designated as "Driver option 2": the implementation of its interface for the user environment, bypassing CryptoAPI and the cryptodev and af_alg . This may seem like the wrong decision: rarely avoiding standardized mechanisms leads to pleasant results. However, we have not studied the question enough to be sure that our case is well suited for CryptoAPI and that does not impose significant limitations on performance.

In the current implementation, we chose the first option, but we are ready to try the second one.

Soc-aes-accel architecture

When the kernel wants us to encrypt or decrypt something, it calls our functions fpga_encrypt and fpga_decrypt . They have identical signatures and do the same thing, only one refers to the registers of the encryption device in the FPGA, and the second - to the registers of the encrypting device.

These functions take two pointers to arrays from a struct scatterlist . Each of them stores a sequence of memory areas: src - where to get the source data, dst - where to put the result. It is guaranteed that each of these pieces of memory does not cross the borders of one page.

The driver task looks trivial:

Display the addresses of each memory area in the corresponding bus addresses - those on which the device can make calls on DMA;
Write the bus addresses and lengths of memory areas in the registers of the DMA controller;
Ask the DMA controller to send an interrupt after processing all the pieces;
Wait for the interrupt.

But the devil is in the details: the version of the DMA controller in the FPGA, which we have implemented, accepts only input data that is a multiple of 16 bytes. Soon this restriction will be removed, but as long as FPGA developers do not have time to change the firmware, we work with what we have: copy the memory pieces that are not multiple to 16 bytes into the serial buffer.

Memory alignment circuit

Performance

Now, when the driver is written, you need to answer the burning question: how productive is it?

Of course, it would be possible to use openssl for measurements, however, we decided to write our own program openssl_benchmark.c : openssl does not provide us with sufficient flexibility. In particular, you cannot set the exact size of a single buffer that will be sent to libcrypto, since openssl may decide to process the input data in parts. Also, performance indicators become more difficult to distinguish, as it takes some time to input / output, allocate memory, initialize openssl, and the like.

Our program works simply: it allocates and zeros the buffers of a given size as input, and then sends them to libcrypto for processing the specified number of times. It only measures the time spent on calls to libcrypto. Due to the fact that the processing takes place repeatedly, it is possible to quite accurately determine the performance of precisely performing AES without taking into account the losses caused by auxiliary tasks like receiving input data.

Using this program, we encrypted and decrypted buffers of different lengths (1000 operations of both types for each buffer length). Then they calculated the bandwidth in Mbps. We did all this twice: in the first case, libcrypto encrypted our data with its software implementation ("Software" on the charts), in the second - gave it to the kernel via the cryptodev engine ("Hardware" on the charts).

Encryption performance
Decryption performance

Of particular interest is the moment in which the hardware implementation overtakes the software in terms of bandwidth. Let's take measurements at a smaller interval and at smaller intervals:

Encryption performance
Decryption performance

We see that the performance of the software implementation is almost independent of the buffer size. This can be explained by the fact that the data exchange between our program and libcrypto is very fast and as the buffer size grows, it quickly becomes imperceptible to waste time on calling functions.

With encryption on FPGA, the situation is more complicated. A call to librypto is sent to the cryptodev engine, it opens /dev/crypto , sets up a session for encryption with several system calls, and only then sends pointers to the encryption buffers to the cryptodev module. That, in turn, forms in the form of a struct scatterlist * two lists of physical pages, onto which a user buffer is projected in its virtual address space, and passes to CryptoAPI. Our driver has the highest priority of all registered implementations of AES, so it gets lists. When the driver makes sure that the lengths of all the pieces of memory are a multiple of the length of the AES block, he gets the corresponding bus addresses and writes them to the FPGA registers.

We see that requests between FPGAs and user programs must pass through several layers and the time losses on them are very large. Only the data of a very large amount of bandwidth ceases to strongly influence the time for the implementation of the request. On the graph, this moment can be determined by the fact that the line becomes almost horizontal: most of the time is spent on data processing.

A reader who looks not only at the shape of the lines, but also at specific numerical values, of course, wonders: why is decryption stabilized at 250 Mbps, and encryption at 400 Mbps? , , FPGA, , -.

, CPU . CPU : CPU , . , .

CPU top . , " " top , , - .


sixteen	256	1024	4096	8192	450000
92%	89%	81%	63%	53%	29%
93%	91%	86%	74%	64%	43%

- , FPGA , FPGA .

, , , CPU. , , .

Conclusion

FPGA . : . . - .

Source: https://habr.com/ru/post/324042/

All Articles