Over the past few years, the topic of performance of the Linux network stack has gained special relevance. This is quite understandable: the volume of data transmitted over the network and the corresponding loads grow by leaps and bounds.
And even the wide distribution of 10GE network cards does not solve the problem: there are a lot of bottlenecks in the Linux kernel itself that impede fast packet processing.
There are numerous attempts to circumvent these bottlenecks. The techniques used for bypass are called kernel bypass (you can find a brief overview, for example,
here ). They allow you to completely eliminate the Linux network stack from the packet processing process and to make the application running in user space interact with the network device directly. We would like to talk about one of these solutions -
DPDK (Data Plane Development Kit), developed by Intel - in today's article.
')
About DPDK there are many publications, including in Russian (see, for example:
1 ,
2 and
3 ). Among these publications, there are some quite good ones, but they do not answer the main question: how exactly does packet processing using DPDK occur? What are the stages of the package path from the network device to the user?
It is these questions that we will try to answer. To find the answers, we had to do a great job: since we did not find all the necessary information in the official documentation, we had to familiarize ourselves with a lot of additional materials and immerse ourselves in studying the source code ... However, take everything in order. And before we talk about DPDK and what problems it helps to solve, we need to remember how the packet handling is performed in Linux. From this we begin.
Linux packet handling: basic steps
So, when the packet arrives on the network card, it is copied from there to the main memory using the DMA mechanism — Direct Memory Access.
UPD . Specification: on the new hardware, the packet is copied to the Last Level Cache of the socket from which the DMA has already been initiated, and from there, from there to memory. Thanks izard .
After that, you need to inform the system about the appearance of a new package and transfer the data further to a dedicated buffer (Linux allocates such buffers for each package). For this purpose, Linux uses an interrupt mechanism: an interrupt is generated whenever a new packet enters the system. Then the packet still needs to be transferred to user space.
One “bottleneck” is already obvious: the more packets one has to handle, the more resources are spent on it, which negatively affects the operation of the system as a whole.
UPD . In modern network cards, interrupt moderation technology is used (this expression is sometimes translated into Russian as “interrupt coordination”), which can be used to reduce the number of interruptions and unload the processor. For clarification, thanks to T0R .
Package data, as mentioned above, is stored in a dedicated buffer, or, more precisely, in
the sk_buff structure . This structure is allocated for each package and is released when the package falls into user space. This operation uses a lot of bus cycles (i.e., cycles that transmit data from the CPU to the main memory).
There is another problem point with the sk_buff structure: the Linux network stack was initially trying to make it compatible with as many protocols as possible. The metadata of all these protocols are included in the sk_buff structure, but they may simply not be needed to process a particular package. Due to the excessive complexity of the structure, processing slows down.
Another factor negatively affecting performance is context switching. When an application running in user space needs to accept or send a packet, it makes a system call, and the context switches to kernel mode, and then back to user mode. This is associated with tangible system resources.
To solve some of the problems described above, the so-called
NAPI (New API) has been added to the Linux kernel since version 2.6 of the kernel, in which the interrupt method is combined with the polling method. Consider in brief how it works.
At first, the network card operates in the interrupt mode, but as soon as the packet arrives at the network interface, it registers itself in the poll list and disables the interrupts. The system periodically checks the list for new devices and picks up packages for further processing. Once the packets have been processed, the card will be removed from the list, and the interrupts will be turned on again.
We have described the packet processing process very fluently. A more detailed description can be found, for example, in a
series of articles in the blog of the company Private Internet Access . However, even a brief review is enough to see problems that cause packet processing to slow down. In the next section, we describe how these problems are solved using DPDK.
DPDK: how it works
In outline
Consider the following illustration:
The left hand side shows the process of packet processing in the "traditional" way, and the right side using DPDK. As you can see, in the second case, the kernel is not involved at all: interaction with the network card is carried out through specialized drivers and libraries.
If you have already read about DPDK or have at least some experience with it, then you know that the ports of the network card that will receive traffic will need to be removed from Linux management altogether - this is done using the dpdk_nic_bind command (or dpdk-devbind ), and in earlier versions) - ./dpdk_nic_bind.py.
How are ports transferred to DPDK? Every driver in Linux has so-called bind and unbind files. They have a network card driver:
ls /sys/bus/pci/drivers/ixgbe bind module new_id remove_id uevent unbind
To detach the device from the driver, you need to write the bus number of this device in the unbind-file. Accordingly, to transfer the device under the control of another driver, you will need to record the bus number in its bind file. More details about this can be found in
this article .
The installation instructions for DPDK
indicate that the ports need to be brought under the control of the vfio_pci, igb_uio or uio_pci_generic drivers.
All these drivers (we will not discuss their features in detail in this article; we refer readers to articles on kernel.org:
1 and
2 ) that make it possible to interact with devices in the user space. Of course, they include the kernel module, but its functions are reduced to initializing the devices and providing a PCI interface.
All further work on organizing the communication of the application with the network card is taken by the PMD driver included in the DPDK (short for poll mode driver). DPDK has PMD-drivers for all supported network cards, as well as for virtual devices (thanks to
T0R thanks).
To work with DPDK, you must also configure large pages of memory (hugepages). It is necessary that
reduce the load on the TLB.
We will discuss all the nuances in more detail below, but for now let's briefly describe the main stages of packet processing using DPDK:
- Received packets fall into the ring buffer (we will analyze its device in the next section). The application periodically checks this buffer for new packages.
- If there are new packet descriptors in the buffer, the application accesses the DPDK packet buffers located in a dedicated memory pool through pointers in packet descriptors.
- If there are no packets in the ring buffer, the application polls the network devices that are running DPDK, and then calls the ring again.
Consider the internal structure of the DPDK in more detail.
EAL: environment abstraction
EAL (Environment Abstraction Layer, environment abstraction layer) is the central concept of DPDK.
EAL is a set of software tools that enable DPDK to work in a specific hardware environment and under control of a specific operating system. In the official DPDK repository, the libraries and drivers included in EAL are stored in the rte_eal directory.
Drivers and libraries for Linux and BSD systems are stored in this directory. There are also sets of header files for different processor architectures: ARM, x86, TILE64, PPC64.
The programs included in EAL are addressed when building DPDK from the source code:
make config T=x86_64-native-linuxapp-gcc
In this command, as it is not difficult to guess, we indicate that DPDK should be built for the architecture x86_84, OS Linux.
It is EAL that provides the “binding” of DPDK to applications. All applications that use DPDK (see examples
here ) must include the header files that are included with EAL.
We list the most common ones:
- rte_lcore.h - control functions for processor cores and sockets;
- rte_memory.h - memory management functions;
- rte_pci.h - functions that provide the interface access to the PCI address space;
- rte_debug.h - tracing and debugging functions (logging, dump_stack and others);
- rte_interrupts.h - interrupt handling functions.
More information about the device and EAL functions can be found in the
documentation .
Queue management: rte_ring library
As we said above, the packet arriving at the network card gets into the receiving queue, which is a circular buffer. In DPDK, newly arrived packages are also placed in a queue implemented on the basis of the rte_ring library. All of the following descriptions of this library are based on the developer’s guide, as well as on comments to the source code.
When developing rte_ring, the
implementation of the ring buffer for FreeBSD was taken as the basis. If you take a look at the
sources , then pay attention to the following comment: Derived from FreeBSD's bufring.c.
The queue is a non-blocking ring buffer, organized according to the
FIFO (First In, First Out) principle. The ring buffer is a table of pointers to objects stored in memory. All pointers are divided into four types: prod_tail, prod_head, cons_tail, cons_head.
Prod and cons are abbreviations from the producer and consumer.
The producer is the process that writes data to the buffer at the current moment, and the consumer is the process that is currently taking the data from the buffer.
Tail (tail) is the place where the current recording in the ring buffer. The place from where the current buffer reading is performed is called the
head .
The meaning of the operation of queuing and taking out of the queue is as follows: when adding a new object to the queue, in the end everything should turn out so that the ring-> prod_tail pointer points to the place where the ring-> prod_head had previously indicated.
Here we give only a brief description; For more information about the scenarios of the ring buffer can be found in
the developer’s guide on the DPDK website .
From the advantages of this approach to queue management, first of all, it is necessary to single out a higher write speed to the buffer. Secondly, when performing bulk enqueue operations and mass withdrawals from the queue, cache misses occur much less frequently, because pointers are stored in a table.
The disadvantage of implementing a circular buffer in DPDK is the fixed size, which cannot be increased on the fly. In addition, much more memory is used to work with a ring structure than to work with a linked list: the maximum possible number of pointers is always used in a ring.
Memory management: rte_mempool library
We have already said that DPDK needs large memory pages (HugePages) to work. Installation instructions recommend creating HugePages of 2 megabytes each.
These pages are combined into segments, which are then divided into zones. Objects created by applications or other libraries — for example, queues and packet buffers — are already being placed in zones.
The number of such objects includes the memory pools created by the rte_mempool library. These are pools of fixed-size objects that use rte_ring to store free objects and can be identified by a unique name.
Alignment techniques can be used to improve performance.
Despite the fact that access to free objects is organized on the basis of a ring buffer without locks, the cost of system resources can be very large. Several processor cores have access to the ring, and whenever the core accesses the ring, it is necessary to perform a
comparison operation
with the exchange (compare and set, CAS).
To prevent the ring from becoming a bottleneck, each core receives an additional local cache in the memory pool. The kernel has full access to the cache of free objects using the lock mechanism. When the cache is full or released, the memory pool communicates with the ring. This provides the kernel with access to frequently used objects.
Buffer management: rte_mbuf library
In the Linux network stack, as noted above, the sk_buff structure is used to represent all network packets. DPDK uses the rte_mbuf structure for this purpose, described in the
rte_mbuf.h header file.
The approach to managing buffers in DPDK is a lot like the one used in FreeBSD: instead of one large sk_buff structure, there are many small rte_mbuf buffers. Buffers are created before running an application that uses DPDK, and are stored in memory pools (for memory allocation, the rte_mempool library is used).
In addition to the packet data itself, each buffer also contains metadata (message type, length, address of the beginning of the data segment). The buffer also contains a pointer to the next buffer. This is necessary for working with packages containing a large amount of data - in this case, packages can be combined (just as it is done in FreeBSD - you can read more about this, for example,
here ).
Other libraries: quick overview
In the previous sections, we have described only the most basic DPDK libraries. But there are many other libraries, which are hardly possible to talk about in one article. Therefore, we confine ourselves to a brief overview.
Using the
LPM library in DPDK, the
Longest Prefix Match (LPM) algorithm is used, which is used to forward packets depending on their IPv4 address. The main functions of this library are to add and remove IP addresses, as well as to search for a new address using the LPM algorithm.
For IPv6 addresses, similar functionality is implemented on the basis of the
LPM6 library.
In other libraries, similar functionality is implemented using hash functions. With
rte_hash, you can search through a large set of entries using a unique key. This library can be used, for example, to classify and distribute packages.
The
rte_timer library provides asynchronous function execution. The timer can be executed either once or periodically.
Conclusion
In this article we tried to talk about the internal structure and how DPDK works. We tried, but did not tell to the end - this topic is so complex and extensive that one article is clearly not enough. Therefore, wait for the continuation: in the next article we will talk in more detail about the practical aspects of using DPDK.
In the comments, we will be happy to answer all your questions. And if any of you have experience using DPDK, we will be grateful for any comments and additions.
For anyone who wants to learn more, here are useful links on the topic:
If for some reason you can not leave comments here - welcome to
our corporate blog.