Monitoring and Configuring Linux Network Stack: Getting Data

In this article, we will look at how packets are received on computers running the Linux kernel, as well as analyze the monitoring and configuration of each component of the network stack as packets move from the network to user space applications. Here you will find a lot of source code, because without a deep understanding of the processes you will not be able to configure and monitor the Linux network stack.

We also recommend that you read the illustrated guide on the same topic , there are explanatory diagrams and additional information.

Content
')
1. General advice on monitoring and configuring the Linux network stack
2. Overview of issues
3. Detailed analysis
3.1. Network device driver
3.2. Softirq
3.3. Linux Networking Subsystem
3.4. Receive Packet Steering (RPS) mechanism
3.5. Receive Flow Steering (RFS) mechanism
3.6. Accelerated Receive Flow Steering (aRFS) hardware accelerated control
3.7. Raising (moving up) the network stack using netif_receive_skb
3.8. netif_receive_skb
3.9. Log level logging
3.10. Additional Information
4. Conclusion

1. General advice on monitoring and configuring the network stack in Linux

The network stack is complex and there is no universal solution for all occasions. If productivity and correctness are critical for you or your business when working with a network, then you will have to invest a lot of time, effort and money in understanding how the different parts of the system interact with each other.

Ideally, you should measure packet loss at each level of the network stack. In this case, you need to choose which components need to be configured. It is at this moment, as it seems to me, many surrender. This assumption is based on the fact that the sysctl settings or the / proc values can be used repeatedly and en masse. In some cases, the system is likely to be so permeated with interconnections and filled with nuances that if you wish to implement useful monitoring or perform tuning, you will have to deal with the functioning of the system at a low level. Otherwise, just use the default settings. This may be sufficient until further optimization is needed (and attachments to keep track of these settings).

Many of the example settings in this article are used solely as illustrations, and are not a recommendation for or against the use as a specific configuration or default settings. So before applying each setting, first think about what you need to monitor in order to identify a significant change.

It is dangerous to apply network settings by connecting to the machine remotely. You can easily block your access or even drop the network management system altogether. Do not apply the settings on the working machines, first run them in, if possible, on the new ones, and then apply them in production.

2. Overview of issues

You may want to have on hand a copy of the device’s data sheet. This article will discuss the Intel I350 controller, controlled by the igb driver. Download the specification from here .
The high-level path that the packet passes from arriving to the receiving socket buffer looks like this:

The driver is loaded and initialized.
The packet comes from the network to the network card.
The packet is copied (via DMA) to the kernel's circular memory buffer.
A hardware interrupt is generated so that the system knows about the appearance of the packet in memory.
The driver calls NAPI to start a poll loop, if it has not already started.
On each CPU of the system, ksoftirqd processes are running. They are registered at boot time. These processes pull packets out of the ring buffer by calling the NAPI poll function registered by the device driver during initialization.
Clears (unmapped) those memory areas in the ring buffer to which network data was written.
Data sent directly to memory (DMA) is transmitted for further processing to the network layer in the form of 'skb'.
If packet management is enabled, or if there are multiple receive queues in the network card, the incoming network data frames are distributed across multiple system CPUs.
Network data frames are transferred from the queue to the protocol layers.
Protocol levels process data.
Data is added to receive buffers attached to sockets by protocol levels.

Next we look at the whole stream in detail. As a protocol layer, IP and UDP levels will be considered. Most of the information is true for other protocol levels.

3. Detailed analysis

We will consider the Linux kernel version 3.13.0. Also throughout the article uses code examples and links to GitHub.

It is very important to understand exactly how packets are received by the kernel We will have to carefully read and understand the work of the network driver, so that later it would be easier to understand the description of the work of the network stack.

Igb will be considered as a network driver. It is used in a fairly common server network card, Intel I350. So let's start by analyzing the work of this driver.

3.1. Network device driver

Initialization

The driver registers an initialization function that is called by the kernel when the driver loads. Registration is done using the module_init macro.
You can find the igb initialization function (igb_init_module) and register it with module_init in drivers / net / ethernet / intel / igb / igb_main.c . It's pretty simple:

/** * igb_init_module –  (routine)   * * igb_init_module —   ,    . *       PCI. **/ static int __init igb_init_module(void) { int ret; pr_info("%s - version %s\n", igb_driver_string, igb_driver_version); pr_info("%s\n", igb_copyright); /* ... */ ret = pci_register_driver(&igb_driver); return ret; } module_init(igb_init_module);

As we will see, the main part of the initialization work of the device occurs when pci_register_driver is called.

PCI initialization

The Intel I350 network card is a device with a PCI express interface.

PCI devices identify themselves using a series of registers in the PCI configuration space .

When the device driver is compiled, the macro MODULE_DEVICE_TABLE (from include / module.h ) is used to export the table of identifiers of PCI devices that the driver can control. Below we will see that the table is also registered as part of the structure.

This table is used by the kernel to determine which driver to load to control the device. Thus, the operating system understands which device is connected and which driver allows you to interact with it.

You can find the table and identifiers of the PCI devices for the igb driver, respectively, here drivers / net / ethernet / intel / igb / igb_main.c and here drivers / net / ethernet / intel / igb / e1000_hw.h :

 static DEFINE_PCI_DEVICE_TABLE(igb_pci_tbl) = { { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_1GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_SGMII) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_2_5GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I211_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_FIBER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SGMII), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER_FLASHLESS), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES_FLASHLESS), board_82575 }, /* ... */ }; MODULE_DEVICE_TABLE(pci, igb_pci_tbl);

As we saw above, pci_register_driver is called by the initialization driver function.

This function registers the structure of pointers. Most of them are function pointers, but the table of identifiers of a PC device is also registered. The kernel uses driver-registered functions to launch a PCI device.

From drivers / net / ethernet / intel / igb / igb_main.c :

 static struct pci_driver igb_driver = { .name = igb_driver_name, .id_table = igb_pci_tbl, .probe = igb_probe, .remove = igb_remove, /* ... */ };

PCI Probe

When a device is identified by its PCI ID, the kernel can select the appropriate driver. Each driver registers the probe-function in the PCI kernel system. The kernel calls this function for those devices for which drivers have not yet claimed. When one of the drivers claims the device, the others are no longer polled. Most drivers contain a lot of code that runs to prepare the device for use. Procedures performed vary greatly depending on the driver.

Here are some typical procedures:

Turn on the PCI device.
Querying areas of memory and I / O ports .
DMA Mask Setup
The driver-supported ethtool functions are registered (to be described below).
Watchdog timers are running (for example, the e1000e has a timer that checks if the hardware is stuck).
Other procedures specific to this device. For example, bypassing or enabling hardware frills, and the like.
Creating, initializing, and registering a struct net_device_ops structure. It contains pointers to various functions needed to open the device, send data to the network, configure the MAC address, and so on.
Creating, initializing, and registering a high-level struct net_device structure representing a network device.

Let's go over some of these procedures for the igb driver and the igb_probe function.

A quick look at PCI initialization

The code below from the igb_probe function performs the basic PCI configuration. Taken from drivers / net / ethernet / intel / igb / igb_main.c :

 err = pci_enable_device_mem(pdev); /* ... */ err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)); /* ... */ err = pci_request_selected_regions(pdev, pci_select_bars(pdev, IORESOURCE_MEM), igb_driver_name); pci_enable_pcie_error_reporting(pdev); pci_set_master(pdev); pci_save_state(pdev);

Initially, the device is initialized using pci_enable_device_mem. If the device is in sleep mode, it wakes up, memory sources are activated, and so on.

Then the DMA mask is configured. Our device can read and write to 64-bit memory addresses, so dma_set_mask_and_coherent is called using DMA_BIT_MASK (64).

By calling pci_request_selected_regions, memory areas are reserved. The PCI Express Advanced Error Reporting service starts if its driver is loaded. Using the pci_set_master call activates the DMA, and the PCI configuration space is saved using the pci_save_state call.

Fuh.

Additional information about the PCI driver for Linux

A full review of the work of a PCI device is beyond the scope of this article, but you can read these materials:

Network Device Initialization

The igb_probe function does the important work of initializing the network device. In addition to the procedures specific to PCI, it also performs more general operations for networking and operating a network device:

Registers a struct net_device_ops.
Logs ethtool operations.
Receives the default MAC address from the network card.
Set the net_device property flags.
And does much more.

We will need all this later, so let's take a quick run.

struct net_device_ops

struct net_device_ops contains function pointers to many important operations required by the network subsystem to control the device. We will mention this structure more than once in the article.

The net_device_ops structure is attached to a struct net_device in igb_probe. Taken from drivers / net / ethernet / intel / igb / igb_main.c :

 static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ netdev->netdev_ops = &igb_netdev_ops;

In the same file, function pointers are stored that are stored in the net_device_ops structure. Taken from drivers / net / ethernet / intel / igb / igb_main.c :

 static const struct net_device_ops igb_netdev_ops = { .ndo_open = igb_open, .ndo_stop = igb_close, .ndo_start_xmit = igb_xmit_frame, .ndo_get_stats64 = igb_get_stats64, .ndo_set_rx_mode = igb_set_rx_mode, .ndo_set_mac_address = igb_set_mac, .ndo_change_mtu = igb_change_mtu, .ndo_do_ioctl = igb_ioctl, /* ... */

As you can see, the struct has several interesting fields, for example, ndo_open, ndo_stop, ndo_start_xmit and ndo_get_stats64, which contain the addresses of the functions implemented by the igb driver. Some of them will be discussed later.

Register ethtool

ethtool is a command-line program. With it, you can get and configure various drivers and hardware options. Under Ubuntu, this program can be installed like this: apt-get install ethtool.

Typically, ethtool is used to collect detailed statistics from network devices. Other uses will be described below.

The program communicates with the drivers using the ioctl system call. The device driver registers a series of functions performed for ethtool operations, and the kernel provides glue.

When ethtool calls ioctl, the kernel finds the ethtool structure registered by the appropriate driver and performs the registered functions. Implementing the ethtool driver function can do anything — from changing a simple software flag in the driver to controlling how the physical NIC equipment works by writing registers to the device.

The igb driver using igb_set_ethtool_ops calls logs ethtool operations on igb_probe:

 static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ igb_set_ethtool_ops(netdev);

The entire ethtool code of the igb driver along with the igb_set_ethtool_ops function can be found in the file drivers / net / ethernet / intel / igb / igb_ethtool.c .

Taken from drivers / net / ethernet / intel / igb / igb_ethtool.c :

 void igb_set_ethtool_ops(struct net_device *netdev) { SET_ETHTOOL_OPS(netdev, &igb_ethtool_ops); }

In addition, you can find the igb_ethtool_ops structure with the ethtool functions supported in the appropriate fields supported by the igb driver.

Taken from drivers / net / ethernet / intel / igb / igb_ethtool.c :

 static const struct ethtool_ops igb_ethtool_ops = { .get_settings = igb_get_settings, .set_settings = igb_set_settings, .get_drvinfo = igb_get_drvinfo, .get_regs_len = igb_get_regs_len, .get_regs = igb_get_regs, /* ... */

Each driver, at its discretion, decides which ethtool functions are relevant and which need to be implemented. Unfortunately, not all drivers implement all ethtool functions.

The function get_ethtool_stats is quite interesting, which (if it is implemented) creates detailed statistical counters that are tracked either by the software driver or by the device itself.

In the dedicated monitoring part we will look at how to use ethtool to get these statistics.

IRQ

When a data frame is stored in memory using DMA, how does the network card inform the system that the data is ready for processing?

Typically, the card generates an interrupt , indicating the arrival of data. There are three common types of interrupts: MSI-X, MSI and Legacy IRQ. Soon we will consider them. The interrupt generated when writing data to memory is quite simple, but if a lot of frames arrive, a large amount of IRQ is generated. The more interrupts, the less CPU time available for servicing higher-level tasks, such as user processes.

New Api (NAPI) was created as a mechanism to reduce the number of interrupts generated by network devices as packets arrive. But still, NAPI cannot completely save us from interruptions. We will find out later why.

NAPI

For a number of important features, NAPI is different from the legacy data collection method. It allows the device driver to register the poll function called by the NAPI subsystem to collect a data frame.

The algorithm for using NAPI network device drivers is as follows:

The driver includes NAPI, but initially it is in an inactive state.
A packet arrives and the network card directly sends it to memory.
The network card generates the IRQ by running the interrupt handler in the driver.
The driver wakes up the NAPI subsystem using SoftIRQ (for more details, see below). That begins to collect packets, causing the poll function registered by the driver in a separate thread of execution.
The driver should disable the subsequent generation of interrupts by the network card. This is necessary in order to allow the NAPI subsystem to process packets without interference from the device.
When all work is done, the NAPI subsystem turns off and the device generates interrupts again.
The cycle is repeated starting from point 2.

This method of collecting data frames has reduced the load compared to the legacy method, since many frames can be simultaneously accepted without the need to simultaneously generate an IRQ for each of them.

The device driver implements the poll function and registers it with NAPI by calling netif_napi_add. The driver also sets the weight. Most drivers hardcode value 64. Why him, we will see further.

Usually, drivers register their NAPI poll functions during driver initialization.

NAPI initialization in igb driver

The igb driver does this with a long call chain:

igb_probe calls igb_sw_init.
igb_sw_init calls igb_init_interrupt_scheme.
igb_init_interrupt_scheme calls igb_alloc_q_vectors.
igb_alloc_q_vectors calls igb_alloc_q_vector.
igb_alloc_q_vector calls netif_napi_add.

The result is a series of high-level operations:

If MSI-X is supported, then it is enabled by calling pci_enable_msix.
Calculated and initialized various settings; for example, the number of transmit and receive queues that the device and driver will use to send and receive packets.
igb_alloc_q_vector is called once for each transmit and receive queue created.
Each time igb_alloc_q_vector is called, netif_napi_add is also called to register the poll function for a specific queue. When the poll function is called to collect packets, it will be given an instance of struct napi_struct.

Let's take a look at igb_alloc_q_vector to understand how a callback poll and its private data are registered.

Taken from drivers / net / ethernet / intel / igb / igb_main.c :

 static int igb_alloc_q_vector(struct igb_adapter *adapter, int v_count, int v_idx, int txr_count, int txr_idx, int rxr_count, int rxr_idx) { /* ... */ /*    q_vector   (rings) */ q_vector = kzalloc(size, GFP_KERNEL); if (!q_vector) return -ENOMEM; /*  NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64); /* ... */

The above is the location code in memory of the receive queue and registration of the igb_poll function using the NAPI subsystem. We get the reference to the struct napi_struct associated with this newly created receive queue (& q_vector-> napi). When it comes time to collect packets from the queue and the NAPI subsystem will be called igb_poll, this link will be passed to it.

We will understand the importance of the described algorithm when we study the data flow from the driver to the network stack.

Download (bring up) network device

Remember the net_device_ops structure that registered the set of functions for booting a network device, sending packets, setting up a MAC address, and so on?

When the network device is loaded (for example, using ifconfig eth0 up), the function that is attached to the ndo_open field of the net_device_ops structure is called.

The ndo_open function usually does the following:

Allocates memory for receive and transmit queues.
Includes NAPI.
Register an interrupt handler.
Enables hardware interrupts.
And much more.

In the case of the igb driver, igb_open calls the function attached to the ndo_open field of the net_device_ops structure.

Preparing to receive data from the network

Most modern network cards use DMA to write data directly to memory, from which the operating system can extract it for further processing. The structure most often used for this is similar to a queue created on the basis of a ring buffer.

First, the device driver must, together with the OS, reserve in memory the area that will be used by the network card. Next, the card is informed about the allocation of memory, where later incoming data will be recorded, which can be taken and processed using the network subsystem.

It looks simple, but what if the packet rate is so high that one CPU does not have time to process them? The data structure is based on a fixed-size memory area, so packets will be dropped.

In this case, Receive Side Scaling (RSS) , a multi-queue system, can help.

Some devices can simultaneously write incoming packets to several different areas of memory. Each area serves a separate queue. This allows the OS to use multiple CPUs for parallel processing of incoming data at the hardware level. But not all network cards can do this.

Intel I350 - is able. We see evidence of this skill in the igb driver. One of the first things that it does after loading is the function call igb_setup_all_rx_resources . This function calls once for each reception queue another function - igb_setup_rx_resources, which organizes the DMA memory into which the network card will write incoming data.

If you are interested in details, read github.com/torvalds/linux/blob/v3.13/Documentation/DMA-API-HOWTO.txt .

Using ethtool, you can customize the number and size of receive queues. Changing these parameters can significantly affect the ratio of processed and dropped frames.

To determine which queue to send data to, the network card uses a hash function in the header fields (source, destination, port, and so on).

Some network cards allow you to adjust the weight of the receive queues, so that you can send more traffic to specific queues.
Less common is the ability to customize the hash function itself. If you can customize it, you can direct a specific flow to a specific queue, or even drop packets at the hardware level.
Below we will look at how the hash function is configured.

Enable NAPI

When the network device is loaded, the driver usually includes NAPI. We have already seen how drivers using NAPI register poll functions. Normally, NAPI does not turn on until the device is booted.

Turn it on is pretty simple. A call to napi_enable signals the struct napi_struct that NAPI is enabled. As noted above, after enabling NAPI is in an inactive state.

In the case of the igb driver, NAPI is enabled for each q_vector initialized after the driver is loaded, or when the counter or queue size is changed using ethtool.

Taken from drivers / net / ethernet / intel / igb / igb_main.c :

 for (i = 0; i < adapter->num_q_vectors; i++) napi_enable(&(adapter->q_vector[i]->napi));

Register Interrupt Handler

After NAPI is enabled, you must register an interrupt handler. A device can generate interrupts in various ways: MSI-X, MSI, and Legacy interrupts. Therefore, the code may be different, depending on the supported methods.

The driver must determine which method is supported by this device and register the corresponding handler function that is executed when an interrupt is received.

Some drivers, including igb, try to register a handler for each method, in case of failure, go to the next untested one.

It is preferable to use MSI-X interrupts, especially for network cards that support multiple receive queues. The reason is that each queue is assigned its own hardware interrupt, which can be processed by a specific CPU (using irqbalance or modifying / proc / irq / IRQ_NUMBER / smp_affinity). As we will soon see, the interrupt and the packet are processed by the same CPU. Thus, incoming packets will be processed by different CPUs across the entire network stack, starting at the hardware interrupt level.

If MSI-X is not available, then the driver uses MSI (if supported), which still has advantages over legacy interrupts. Read more about this in the English Wikipedia .

In the igb driver, MSI-X, MSI, and Legacy interrupt handlers are functions of igb_msix_ring, igb_intr_msi, igb_intr, respectively.

The driver code that tries each method can be found in drivers / net / ethernet / intel / igb / igb_main.c :

 static int igb_request_irq(struct igb_adapter *adapter) { struct net_device *netdev = adapter->netdev; struct pci_dev *pdev = adapter->pdev; int err = 0; if (adapter->msix_entries) { err = igb_request_msix(adapter); if (!err) goto request_done; /*   MSI */ /* ... */ } /* ... */ if (adapter->flags & IGB_FLAG_HAS_MSI) { err = request_irq(pdev->irq, igb_intr_msi, 0, netdev->name, adapter); if (!err) goto request_done; /*    */ /* ... */ } err = request_irq(pdev->irq, igb_intr, IRQF_SHARED, netdev->name, adapter); if (err) dev_err(&pdev->dev, "Error %d getting interrupt\n", err); request_done: return err; }

As you can see, the driver first tries to use the igb_request_msix handler for MSI-X interrupts; if it fails, it goes to MSI. To register the MSI handler igb_intr_msi, request_irq is used. If this does not work either, the driver proceeds to legacy interrupts. To register igb_intr, request_irq is used again.

igb , , .

. . , igb __igb_open, igb_irq_enable.

:

 static void igb_irq_enable(struct igb_adapter *adapter) { /* ... */ wr32(E1000_IMS, IMS_ENABLE_MASK | E1000_IMS_DRSTA); wr32(E1000_IAM, IMS_ENABLE_MASK | E1000_IMS_DRSTA); /* ... */ }

. , , , , . .

.

, . .

ethtool -S

ethtool Ubuntu : sudo apt-get install ethtool.
, -S , .

(, ) `ethtool -S`.

 $ sudo ethtool -S eth0 NIC statistics: rx_packets: 597028087 tx_packets: 5924278060 rx_bytes: 112643393747 tx_bytes: 990080156714 rx_broadcast: 96 tx_broadcast: 116 rx_multicast: 20294528 ....

. , . , , .

, “drop”, “buffer”, “miss” . . , (, ), . , , . , ethtool, .

sysfs

sysfs , , .

, , eth0, cat .

sysfs:

 $ cat /sys/class/net/eth0/statistics/rx_dropped 2

: collisions, rx_dropped, rx_errors, rx_missed_errors .

, , , . , , — .

, , , .

/proc/net/dev

/proc/net/dev, .

/proc/net/dev, :

 $ cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0: 110346752214 597737500 0 2 0 0 0 20963860 990024805984 6066582604 0 0 0 0 0 0 lo: 428349463836 1579868535 0 0 0 0 0 0 428349463836 1579868535 0 0 0 0 0 0

, sysfs. .

, , . , , , , FIFO.

RSS ( ), ethtool ( , RX-channels).

:

 $ sudo ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 8 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 4

( ) .

: .

, , :

 $ sudo ethtool -l eth0 Channel parameters for eth0: Cannot get device channel parameters : Operation not supported

, ethtool get_channels. , RSS, .

, , sudo ethtool -L.

: , — — .

ethtool -L 8 :

 $ sudo ethtool -L eth0 combined 8

, 8 :

 $ sudo ethtool -L eth0 rx 8

: , . .

. , , , ethtool . . , , .

ethtool –g:

 $ sudo ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512

, 4096 , 512.

4096:

 $ sudo ethtool -G eth0 rx 4096

: , . .

.

, :

(flow indirection).
get_rxfh_indir_size get_rxfh_indir ethtool.
ethtool, -x -X, (indirection table).

 $ sudo ethtool -x eth0 RX flow hash indirection table for eth3 with 2 RX ring(s): 0: 0 1 0 1 0 1 0 1 8: 0 1 0 1 0 1 0 1 16: 0 1 0 1 0 1 0 1 24: 0 1 0 1 0 1 0 1

— 0 1. 2 0, 3 — 1.

: :

 $ sudo ethtool -X eth0 equal 2

, , ( , CPU), ethtool –X:

 $ sudo ethtool -X eth0 weight 6 2

0 6, 1 — 2. , 0.

, , .

ethtool , , RSS.

C ethtool -n , UPD:

 $ sudo ethtool -n eth0 rx-flow-hash udp4 UDP over IPV4 flows use these fields for computing Hash flow key: IP SA IP DA

eth0, UDP- IPv4 . :

 $ sudo ethtool -N eth0 rx-flow-hash udp4 sdfn

sdfn . ethtool man.

, ntuple .

ntuple

« ntuple» (ntuple filtering). ( ethtool) , . , , TCP-, , 1.

Intel Intel Ethernet Flow Director . .

, ntuple — , Accelerated Receive Flow Steering (aRFS). ntuple, . aRFS .

, (data locality) (hit rates) CPU . , -, 80:

CPU 2.
IRQ CPU.
TCP- 80 «» ntuple CPU 2.
80 CPU, .
, .

, ntuple ethtool, , . :

 $ sudo ethtool -k eth0 Offload parameters for eth0: ... ntuple-filters: off receive-hashing: on

, ntuple-filters off.

ntuple-:

 $ sudo ethtool -K eth0 ntuple on

, , :

 $ sudo ethtool -u eth0 40 RX rings available Total 0 rules

. ethtool. , TCP-, 80, 2:

 $ sudo ethtool -U eth0 flow-type tcp4 dst-port 80 action 2

ntuple . - IP-. man ethtool.

ntuple- ethtool -S [device name]. , Intel fdir_match fdir_miss . .

3.2. SoftIRQ

, SoftIRQ Linux.

SoftIRQ?

, . , . , . , .

, . SoftIRQ.

SoftIRQ ( CPU), -, SoftIRQ-. - ksoftirqd/0 , SoftIRQ, CPU 0.

(, ) SoftIRQ open_softirq. , SoftIRQ-. , SoftIRQ.

ksoftirqd

SoftIRQ , ksoftirqd .

kernel/softirq.c , , ksoftirqd:

 static struct smp_hotplug_thread softirq_threads = { .store = &ksoftirqd, .thread_should_run = ksoftirqd_should_run, .thread_fn = run_ksoftirqd, .thread_comm = "ksoftirqd/%u", }; static __init int spawn_ksoftirqd(void) { register_cpu_notifier(&cpu_nfb); BUG_ON(smpboot_register_percpu_thread(&softirq_threads)); return 0; } early_initcall(spawn_ksoftirqd);

struct smp_hotplug_thread, : ksoftirqd_should_run run_ksoftirqd.

kernel/smpboot.c -, (event loop).

kernel/smpboot.c ksoftirqd_should_run, , SoftIRQ. , run_ksoftirqd, , __do_softirq.

__do_softirq

__do_softirq :

SoftIRQ.
SoftIRQ.
SoftIRQ.
SoftIRQ ( open_softirq).

CPU softirq si, CPU, .

Monitoring

/proc/softirqs

softirq , /proc/softirqs. , SoftIRQ .

/proc/softirqs, SoftIRQ:

 $ cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 0 0 0 0 TIMER: 2831512516 1337085411 1103326083 1423923272 NET_TX: 15774435 779806 733217 749512 NET_RX: 1671622615 1257853535 2088429526 2674732223 BLOCK: 1800253852 1466177 1791366 634534 BLOCK_IOPOLL: 0 0 0 0 TASKLET: 25 0 0 0 SCHED: 2642378225 1711756029 629040543 682215771 HRTIMER: 2547911 2046898 1558136 1521176 RCU: 2056528783 4231862865 3545088730 844379888

, (NET_RX). , CPU , . , Receive Packet Steering / Receive Flow Steering. , : NET_RX, . SoftIRQ NET_RX , .

, , , /proc/softirqs.

, .

3.3. Linux

SoftIRQ, . .

(netdev) net_dev_init. .

struct softnet_data

net_dev_init struct softnet_data CPU . :

NAPI, CPU.
Backlog .
(processing weight).
receive offload .
( Receive packet steering ).
And much more.

SoftIRQ

net_dev_init SoftIRQ , . :

 static int __init net_dev_init(void) { /* ... */ open_softirq(NET_TX_SOFTIRQ, net_tx_action); open_softirq(NET_RX_SOFTIRQ, net_rx_action); /* ... */ }

, «» ( ) net_rx_action, SoftIRQ NET_RX_SOFTIRQ.

- !

, , DMA. (, MSI-X, , ).

, . , .

MSI-X. , .

drivers/net/ethernet/intel/igb/igb_main.c :

 static irqreturn_t igb_msix_ring(int irq, void *data) { struct igb_q_vector *q_vector = data; /*   ITR,    . */ igb_write_itr(q_vector); napi_schedule(&q_vector->napi); return IRQ_HANDLED; }

, .

igb_write_itr . . “Interrupt Throttling” ( « », Interrupt Coalescing), CPU. , ethtool IRQ.
napi_schedule, NAPI, . , SoftIRQ, . , .

, . , .

NAPI napi_schedule

, napi_schedule .

, NAPI , . , poll (bootstrapped) . , NAPI , , . NAPI . , NAPI , , .

poll , napi_schedule. -, , __napi_schedule.

net/core/dev.c :

 /** * __napi_schedule –   * @n:    * *       */ void __napi_schedule(struct napi_struct *n) { unsigned long flags; local_irq_save(flags); ____napi_schedule(&__get_cpu_var(softnet_data), n); local_irq_restore(flags); } EXPORT_SYMBOL(__napi_schedule);

softnet_data, CPU, __get_cpu_var. ____napi_schedule struct napi_struct. .

____napi_schedule, net/core/dev.c :

 /*    IRQ */ static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi) { list_add_tail(&napi->poll_list, &sd->poll_list); __raise_softirq_irqoff(NET_RX_SOFTIRQ); }

struct napi_struct, , poll_list, softnet_data, CPU.
__raise_softirq_irqoff SoftIRQ NET_RX_SOFTIRQ. net_rx_action, , .

, SoftIRQ - net_rx_action NAPI poll.

CPU

, , SoftIRQ, , CPU.

IRQ- , SoftIRQ- CPU, IRQ-. , CPU IRQ: CPU , SoftIRQ NAPI.

, ( Receive Packet Steering ) CPU .

: . NAPI. .

/proc/interrupts, :

 $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 46 0 0 0 IR-IO-APIC-edge timer 1: 3 0 0 0 IR-IO-APIC-edge i8042 30: 3361234770 0 0 0 IR-IO-APIC-fasteoi aacraid 64: 0 0 0 0 DMAR_MSI-edge dmar0 65: 1 0 0 0 IR-PCI-MSI-edge eth0 66: 863649703 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 67: 986285573 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 68: 45 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 69: 394 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 NMI: 9729927 4008190 3068645 3375402 Non-maskable interrupts LOC: 2913290785 1585321306 1495872829 1803524526 Local timer interrupts

/proc/interrupts, . , CPU. , , , , NAPI. , (interrupt coalescing) , . , .

, /proc/softirqs /proc. We will discuss this below.

CPU, .

« » , . , CPU. : , CPU.

, igb, e1000 InterruptThrottleRate. generic ethtool.

IRQ:

 $ sudo ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 ...

ethtool generic- . , . , , . ethtool: «, , ».

« /» (adaptive RX/TX IRQ coalescing). . , , - (bookkeeping) ( igb).

, .

:

 $ sudo ethtool -C eth0 adaptive-rx on

ethtool -C . :

rx-usecs: , .
rx-frames: , .
rx-usecs-irq: , , .
rx-frames-irq: , , .

.

, . .

, , . include/uapi/linux/ethtool.h , ethtool ( , ).

: . , , . .

IRQ

RSS, , CPU .

CPU. , .

IRQ, , irqbalance. CPU, . irqbalance, --banirq IRQBALANCE_BANNED_CPUS, irqbalance , CPU.

/proc/interrupts . , /proc/irq/IRQ_NUMBER/smp_affinity, , CPU . , , CPU .

: IRQ 8 CPU 0:

 $ sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity'

SoftIRQ- , SoftIRQ , net_rx_action .

net_rx_action, , , .

net_rx_action

net_rx_action , DMA.

NAPI, CPU, .

NAPI- poll. :

(work budget) ( ),
.

net/core/dev.c :

 while (!list_empty(&sd->poll_list)) { struct napi_struct *n; int work, weight; /*    SoftIRQ -  *      ,    *     1.5/. */ if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) goto softnet_break;

CPU. budget — , NAPI-, CPU.

, IRQ. , CPU, , SoftIRQ. CPU .

, , , CPU NAPI-. CPU «» .

CPU , net_rx_action budget, CPU . CPU ( sitime si top ), , .

: CPU jiffies , .

NAPI- poll weight

, poll netif_napi_add. , igb :

 /*  NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);

NAPI-, 64. , net_rx_action.

net/core/dev.c :

 weight = n->weight; work = 0; if (test_bit(NAPI_STATE_SCHED, &n->state)) { work = n->poll(n, weight); trace_napi_poll(n); } WARN_ON_ONCE(work > weight); budget -= work;

, NAPI, poll, NAPI ( igb_poll).

poll . work, budget.

Assume:

64 ( Linux 3.13.0 ),
budget 300.

, :

igb_poll 5 ( , , ),
2 jiffies.

NAPI

, NAPI . NAPI.

poll (64), NAPI. net_rx_action.
poll , NAPI. NAPI IRQ, napi_schedule.

, net_rx_action . , poll, , .

net_rx_action

net_rx_action , NAPI. net/core/dev.c :

 /*      NAPI,   *    .       * «»  NAPI, , ,  *       . */ if (unlikely(work == weight)) { if (unlikely(napi_disable_pending(n))) { local_irq_enable(); napi_complete(n); local_irq_disable(); } else { if (n->gro_list) { /*     *  HZ < 1000,   . */ local_irq_enable(); napi_gro_flush(n, HZ >= 1000); local_irq_disable(); } list_move_tail(&n->poll_list, &sd->poll_list); } }

, net_rx_action :

(, ifconfig eth0 down).
, , generic receive offload (GRO). ( timer tick rate ) >= 1000, GRO, , . GRO. NAPI- CPU, NAPI-.

poll, . , .

net_rx_action , :

poll, CPU, NAPI- (!list_empty(&sd->poll_list)),
<= 0,
jiffies.

  /*    SoftIRQ - . *      ,    *     1.5/. */ if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) goto softnet_break;

label softnet_break, - . net/core/dev.c :

 softnet_break: sd->time_squeeze++; __raise_softirq_irqoff(NET_RX_SOFTIRQ); goto out;

struct softnet_data SoftIRQ NET_RX_SOFTIRQ. time_squeeze — , net_rx_action , , . . . NET_RX_SOFTIRQ , . , , , CPU.

(label) out. out , NAPI- , , , NAPI, net_rx_action .

net_rx_action, out : net_rps_action_and_irq_enable. ( Receive Packet Steering ), CPU, .

RPS. , net_rx_action, «» NAPI- poll, .

NAPI- poll

, , . , , .

igb ?

igb_poll

- igb_poll. . drivers/net/ethernet/intel/igb/igb_main.c :

 /** * igb_poll – NAPI Rx polling callback * @napi:   (polling) NAPI * @budget:   ,    **/ static int igb_poll(struct napi_struct *napi, int budget) { struct igb_q_vector *q_vector = container_of(napi, struct igb_q_vector, napi); bool clean_complete = true; #ifdef CONFIG_IGB_DCA if (q_vector->adapter->flags & IGB_FLAG_DCA_ENABLED) igb_update_dca(q_vector); #endif /* ... */ if (q_vector->rx.ring) clean_complete &= igb_clean_rx_irq(q_vector, budget); /*     ,      */ if (!clean_complete) return budget; /*      ,     */ napi_complete(napi); igb_ring_irq_enable(q_vector); return 0; }

( Direct Cache Access (DCA) ), CPU «», RX-. . .
igb_clean_rx_irq, . About this below.
clean_complete, , , . , ( 64). net_rx_action NAPI- poll.
NAPI napi_complete, igb_ring_irq_enable . NAPI.

, igb_clean_rx_irq .

igb_clean_rx_irq

igb_clean_rx_irq — , , budget .

:

, . IGB_RX_BUFFER_WRITE (16) .
skb.
, “End of Packet”. , . skb. , .
(layout) .
skb->len .
skb , , , VLAN id . . , csum_error. , UDP TCP, skb CHECKSUM_UNNECESSARY. , . eth_type_trans skb.
skb napi_gro_receive.
.
, .

, .

, . -, , SoftIRQ . -, Generic Receive Offloading (GRO). , napi_gro_receive, .

/proc/net/softnet_stat

, net_rx_action, , , SoftIRQ. struct softnet_data, CPU. /proc/net/softnet_stat, , , . .

Linux 3.13.0 , /proc/net/softnet_stat . net/core/net-procfs.c :

 seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", sd->processed, sd->dropped, sd->time_squeeze, 0, 0, 0, 0, 0, /* was fastroute */ sd->cpu_collision, sd->received_rps, flow_limit_count);

. , . squeeze_time net_rx_action, , .

/proc/net/softnet_stat, :

 $ cat /proc/net/softnet_stat 6dcad223 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6f0e1565 00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000 660774ec 00000000 00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000 61c99331 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6794b1b3 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6488cb92 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000

/proc/net/softnet_stat:

/proc/net/softnet_stat struct softnet_data, CPU.
.
, sd->processed, — . , Ethernet (Ethernet bonding). , sd->processed .
, sd->dropped, — . We will talk about this below.
, sd->time_squeeze, — , net_rx_action - , . , budget.
0.
, sd->cpu_collision, — , . , .
, sd->received_rps, — , CPU .
, flow_limit_count, — , (flow limit). — ( Receive Packet Steering ), .

, , . .

net_rx_action

net_rx_action , NAPI-, CPU. sysctl net.core.netdev_budget.

: 600.

 $ sudo sysctl -w net.core.netdev_budget=600

/etc/sysctl.conf , . Linux 3.13.0 300.

Generic Receive Offloading (GRO)

Generic Receive Offloading (GRO) — , Large Receive Offloading (LRO).

, , , « » . CPU. , , . , , . . , .

. - , . LRO .
GRO LRO, .

: - tcpdump , GRO . , tap' , GRO.

GRO ethtool

ethtool , GRO, .

:

 $ ethtool -k eth0 | grep generic-receive-offload generic-receive-offload: on

generic-receive-offload. GRO:

 $ sudo ethtool -K eth0 gro on

: , . .

napi_gro_receive

napi_gro_receive GRO ( GRO ) . dev_gro_receive.

dev_gro_receive

, GRO. , : offload-, , GRO. , , , , , GRO. , TCP- , / .

, , net/core/dev.c :

 list_for_each_entry_rcu(ptype, head, list) { if (ptype->type != type || !ptype->callbacks.gro_receive) continue; skb_set_network_header(skb, skb_gro_offset(skb)); skb_reset_mac_len(skb); NAPI_GRO_CB(skb)->same_flow = 0; NAPI_GRO_CB(skb)->flush = 0; NAPI_GRO_CB(skb)->free = 0; pp = ptype->callbacks.gro_receive(&napi->gro_list, skb); break; }

, GRO-, . napi_gro_complete, callback gro_complete , netif_receive_skb.

net/core/dev.c :

 if (pp) { struct sk_buff *nskb = *pp; *pp = nskb->next; nskb->next = NULL; napi_gro_complete(nskb); napi->gro_count--; }

, napi_gro_receive .

MAX_GRO_SKBS (8) GRO-, gro_list NAPI- CPU .

net/core/dev.c :

 if (NAPI_GRO_CB(skb)->flush || napi->gro_count >= MAX_GRO_SKBS) goto normal; napi->gro_count++; NAPI_GRO_CB(skb)->count = 1; NAPI_GRO_CB(skb)->age = jiffies; skb_shinfo(skb)->gso_size = skb_gro_len(skb); skb->next = napi->gro_list; napi->gro_list = skb; ret = GRO_HELD;

GRO Linux.

napi_skb_finish

dev_gro_receive napi_skb_finish, , , netif_receive_skb ( GRO MAX_GRO_SKBS).

(Receive Packet Steering (RPS)).

3.4. Receive Packet Steering (RPS)

, , NAPI- poll? NAPI SoftIRQ, CPU. , CPU, , SoftIRQ- .

, CPU poll.

( Intel I350) . , , . NAPI-. CPU.

Receive Side Scaling (RSS).

Receive Packet Steering (RPS) — RSS. , , . , , RPS , DMA- .

, CPU poll, , , CPU .

RPS , , CPU . (backlog) . backlog (IPI), , . /proc/net/softnet_stat , softnet_data IPI ( received_rps).

, netif_receive_skb RPS CPU.

RPS

RPS ( Ubuntu 3.13.0), , , CPU .

. , :

 /sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus

, eth0 0 /sys/class/net/eth0/queues/rx-0/rps_cpus , , CPU 0 eth0. , RPS .

: RPS CPU, , CPU SoftIRQ `NET_RX`, `si` `sitime` CPU. «» «», , RPS .

3.5. (Receive Flow Steering (RFS))

Receive flow steering (RFS) RPS. RPS CPU, CPU. , RFS, CPU.

RFS

RFS , . RFS - . sysctl net.core.rps_sock_flow_entries.

RFS:

 $ sudo sysctl -w net.core.rps_sock_flow_entries=32768

. rps_flow_cnt .

: 2048 eth0 0:

 $ sudo bash -c 'echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt'

3.6. (Accelerated Receive Flow Steering (aRFS))

RFS . , CPU . , . ndo_rx_flow_steer, aRFS.

aRFS

, . :

RPS.
RFS.
CONFIG_RFS_ACCEL. , Ubuntu 3.13.0.
ntuple, . , , ethtool.
IRQ , CPU, .

, aRFS , CPU, . ntuple .

3.7. (moving up) netif_receive_skb

, netif_receive_skb, . ( ):

napi_skb_finish — GRO-,
napi_gro_complete — , ,

: netif_receive_skb SoftIRQ. top sitime si.

netif_receive_skb sysctl , , backlog-. , , RPS ( backlog-, CPU). , . RPS CPU, .

:

sysctl net.core.netdev_tstamp_prequeue.

:

 $ sudo sysctl -w net.core.netdev_tstamp_prequeue=0

1. .

3.8. netif_receive_skb

, netif_receive_skb , , RPS. , RPS .

RPS ( )

__netif_receive_skb, - (bookkeeping), __netif_receive_skb_core, .

, __netif_receive_skb_core, RPS, __netif_receive_skb_core.

RPS

, netif_receive_skb , backlog- CPU . get_rps_cpu. net/core/dev.c :

 cpu = get_rps_cpu(skb->dev, skb, &rflow); if (cpu >= 0) { ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); rcu_read_unlock(); return ret; }

get_rps_cpu RFS aRFS, , enqueue_to_backlog backlog CPU.

enqueue_to_backlog

softnet_data CPU, input_pkt_queue. input_pkt_queue CPU. net/core/dev.c :

 qlen = skb_queue_len(&sd->input_pkt_queue); if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {

input_pkt_queue netdev_max_backlog. , . , , . softnet_data. , CPU, . /proc/net/softnet_stat.

enqueue_to_backlog RPS, netif_rx. netif_rx, netif_receive_skb. RPS netif_rx, backlog' , .

: . netif_receive_skb RPS, netdev_max_backlog , input_pkt_queue.

, input_pkt_queue ( ), . , :

: , NAPI CPU. , , IPI. , IPI , ____napi_schedule NAPI. .
, , .

- goto , . net/core/dev.c :

 if (skb_queue_len(&sd->input_pkt_queue)) { enqueue: __skb_queue_tail(&sd->input_pkt_queue, skb); input_queue_tail_incr_save(sd, qtail); rps_unlock(sd); local_irq_restore(flags); return NET_RX_SUCCESS; } /* Schedule NAPI for backlog device * We can use non atomic operation since we own the queue lock */ if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) { if (!rps_ipi_queued(sd)) ____napi_schedule(sd, &sd->backlog); } goto enqueue;

RPS CPU, . , backlog. .

if net/core/dev.c skb_flow_limit :

 if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {

, . . ( RPS).

input_pkt_queue

. /proc/net/softnet_stat. dropped — , input_pkt_queue CPU.

Customization

netdev_max_backlog

, .

RPS netif_rx, enqueue_to_backlog netdev_max_backlog.

: backlog 3000:

 $ sudo sysctl -w net.core.netdev_max_backlog=3000

1000.

NAPI backlog poll

NAPI backlog' net.core.dev_weight sysctl. , poll backlog' (. net.core.netdev_budget).

: backlog' poll:

 $ sudo sysctl -w net.core.dev_weight=600

64.

, backlog' SoftIRQ poll. , .

-

 $ sudo sysctl -w net.core.flow_limit_table_len=8192

4096.

. , .

/proc/sys/net/core/flow_limit_cpu_bitmap, RPS, , CPU .

NAPI backlog-

Backlog- CPU NAPI , . poll, SoftIRQ. , weight.

NAPI .

net_dev_init net/core/dev.c :

 sd->backlog.poll = process_backlog; sd->backlog.weight = weight_p; sd->backlog.gro_list = NULL; sd->backlog.gro_count = 0;

NAPI-c backlog' weight: . .

process_backlog

process_backlog — , , ( ) backlog' .

backlog- __netif_receive_skb. , RPS. , __netif_receive_skb __netif_receive_skb_core, .

process_backlog NAPI, : NAPI , . ____napi_schedule enqueue_to_backlog, .

, net_rx_action ( ) ( net.core.netdev_budget, ).

__netif_receive_skb_core packet taps

__netif_receive_skb_core . , - packet taps, . – AF_PACKET, libpcap .

tap, , .

packet tap

. net/core/dev.c :

 list_for_each_entry_rcu(ptype, &ptype_all, list) { if (!ptype->dev || ptype->dev == skb->dev) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }

, pcap, net/packet/af_packet.c .

, , __netif_receive_skb_core . (deliver functions), .

__netif_receive_skb_core net/core/dev.c :

 type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }

ptype_base - net/core/dev.c :

 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;

-, ptype_head:

 static inline struct list_head *ptype_head(const struct packet_type *pt) { if (pt->type == htons(ETH_P_ALL)) return &ptype_all; else return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK]; }

dev_add_pack. .
, .

3.9.

, . IP, .

IP

IP - ptype_base, , .

inet_init, net/ipv4/af_inet.c :

 dev_add_pack(&ip_packet_type);     IP-,   <a href="https://github.com/torvalds/linux/blob/v3.13/net/ipv4/af_inet.c#L1673-L1676">net/ipv4/af_inet.c</a>: static struct packet_type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, };

__netif_receive_skb_core deliver_skb ( ), func ( – ip_rcv).

ip_rcv

ip_rcv . , , .

ip_rcv ip_rcv_finish netfilter . , iptables , IP, , .

, netfilter ip_rcv net/ipv4/ip_input.c :

 return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);

netfilter iptables

netfilter, iptables conntrack.

: NF_HOOK_THRESH , - IP, netfilter , , iptables conntrack.

: netfilter iptables, , SoftIRQ, . , , .

ip_rcv_finish

, netfilter , , ip_rcv_finish. , netfilter'.

ip_rcv_finish . , dst_entry . early_demux , , .

early_demux — , dst_entry. , dst_entry .

, net/ipv4/ip_input.c :

 if (sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) { const struct net_protocol *ipprot; int protocol = iph->protocol; ipprot = rcu_dereference(inet_protos[protocol]); if (ipprot && ipprot->early_demux) { ipprot->early_demux(skb); /*   iph, skb->head   */ iph = ip_hdr(skb); } }

, sysctl_ip_early_demux. early_demux. , .

( ), , dst_entry .

, dst_input(skb). , , dst_entry, .

— , ip_local_deliver dst_entry.

early demux IP

early_demux:

 $ sudo sysctl -w net.ipv4.ip_early_demux=0

1; early_demux .

sysctl 5% early_demux.

ip_local_deliver

IP:

ip_rcv (bookkeeping).
netfilter , callback', .
ip_rcv_finish — callback, .

ip_local_deliver . net/ipv4/ip_input.c :

 /* *  IP-     . */ int ip_local_deliver(struct sk_buff *skb) { /* *  IP-. */ if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER)) return 0; } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish); }

, netfilter , ip_local_deliver_finish. , netfilter'.

ip_local_deliver_finish

ip_local_deliver_finish , net_protocol , handler .

.

IP

/proc/net/snmp, IP:

 $ cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 12987882 51 1 10129840 2196520 1 0 0 0 ...

. IP. .

IP , -. enum- /proc/net/snmp, , include/uapi/linux/snmp.h :

 enum { IPSTATS_MIB_NUM = 0, /*    ,     - */ IPSTATS_MIB_INPKTS, /* InReceives */ IPSTATS_MIB_INOCTETS, /* InOctets */ IPSTATS_MIB_INDELIVERS, /* InDelivers */ IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */ IPSTATS_MIB_OUTPKTS, /* OutRequests */ IPSTATS_MIB_OUTOCTETS, /* OutOctets */ /* ... */

/proc/net/netstat, IP:

 $ cat /proc/net/netstat | grep IpExt IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT0Pktsu InCEPkts IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0 0 0 0 0

/proc/net/snmp, IpExt.

:

InReceives: IP-, ip_rcv .
InHdrErrors: IP- . , , IP .
InAddrErrors: IP-, .
ForwDatagrams: IP-, (forwarded).
InUnknownProtos: IP- , .
InDiscards: IP-, , .
InDelivers: IP-, . , , IP.
InCsumErrors: IP- .

IP. , . , IP, , .

UDP, TCP , UDP.

net/ipv4/af_inet.c , - UDP, TCP ICMP IP. net/ipv4/af_inet.c :

 static const struct net_protocol tcp_protocol = { .early_demux = tcp_v4_early_demux, .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol udp_protocol = { .early_demux = udp_v4_early_demux, .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol icmp_protocol = { .handler = icmp_rcv, .err_handler = icmp_err, .no_policy = 1, .netns_ok = 1, };

inet. net/ipv4/af_inet.c :

 /* *    . */ if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0) pr_crit("%s: Cannot add ICMP protocol\n", __func__); if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0) pr_crit("%s: Cannot add UDP protocol\n", __func__); if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) pr_crit("%s: Cannot add TCP protocol\n", __func__);

UDP. , handler UDP udp_rcv. UPD, IP.

UDP

UDP : net/ipv4/udp.c .

udp_rcv

udp_rcv , __udp4_lib_rcv .

__udp4_lib_rcv

__udp4_lib_rcv UDP, UDP-, . .

, , IP, , dst_entry , ( — UDP).

dst_entry, __udp4_lib_rcv :

 sk = skb_steal_sock(skb); if (sk) { struct dst_entry *dst = skb_dst(skb); int ret; if (unlikely(sk->sk_rx_dst != dst)) udp_sk_rx_dst_set(sk, dst); ret = udp_queue_rcv_skb(sk, skb); sock_put(sk); /*   > 0    , *     –protocol  0 */ if (ret > 0) return -ret; return 0; } else {

early_demux , __udp4_lib_lookup_skb.

:

 ret = udp_queue_rcv_skb(sk, skb); sock_put(sk);

, :

 /*  .   ,     */ if (udp_lib_checksum_complete(skb)) goto csum_error; UDP_INC_STATS_BH(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); /* * .   UDP-  ,  *   .  . */ kfree_skb(skb); return 0;

udp_queue_rcv_skb

, , , . , - .
, UDP-Lite .
UDP- .

- . . net/ipv4/udp.c :

 if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf)) goto drop;

sk_rcvqueues_full

sk_rcvqueues_full backlog' sk_rmem_alloc , , sk_rcvbuf (sk->sk_rcvbuf ):

 /* *      backlog-. *    skb truesize, *       . */ static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb, unsigned int limit) { unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc); return qsize > limit; }

, .

:

sk->sk_rcvbuf ( sk_rcvqueues_full) sysctl net.core.rmem_max.

:

 $ sudo sysctl -w net.core.rmem_max=8388608

sk->sk_rcvbuf net.core.rmem_default, sysctl.

:

 $ sudo sysctl -w net.core.rmem_default=8388608

sk->sk_rcvbuf, setsockopt SO_RCVBUF. setsockopt net.core.rmem_max.

net.core.rmem_max, setsockopt SO_RCVBUFFORCE. , , CAP_NET_ADMIN.

sk->sk_rmem_alloc skb_set_owner_r, . UDP.

sk->sk_backlog.len sk_add_backlog.

udp_queue_rcv_skb

. net/ipv4/udp.c :

 bh_lock_sock(sk); if (!sock_owned_by_user(sk)) rc = __udp_queue_rcv_skb(sk, skb); else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { bh_unlock_sock(sk); goto drop; } bh_unlock_sock(sk); return rc;

, . , __udp_queue_rcv_skb. , backlog- sk_add_backlog.

backlog', release_sock .

__udp_queue_rcv_skb

__udp_queue_rcv_skb sock_queue_rcv_skb. , __udp_queue_rcv_skb .

net/ipv4/udp.c :

 rc = sock_queue_rcv_skb(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); /*  ,   ENOMEM   */ if (rc == -ENOMEM) UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,is_udplite); UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite); kfree_skb(skb); trace_udp_fail_queue_rcv_skb(rc, sk); return -1; }

UDP

UDP:

/proc/net/snmp
/proc/net/udp

/proc/net/snmp

/proc/net/snmp, UDP.

 $ cat /proc/net/snmp | grep Udp\: Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors Udp: 16314 0 0 17161 0 0

IP, , , , .

InDatagrams: , :

recvmsg .
UDP- .

NoPorts: , UDP- , .

InErrors: , :

,
,
sk_add_backlog .

OutDatagrams: , UDP- , IP .

RcvbufErrors: , sock_queue_rcv_skb ; , sk->sk_rmem_alloc sk->sk_rcvbuf.

SndbufErrors: , :

IP ,
,
.

InCsumErrors: , UDP. , , , InCsumErrors InErrors. , InErrors — InCsumErros .

/proc/net/udp

/proc/net/udp, UDP.

 $ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000 00000000 104 0 7518 2 0000000000000000 0 558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7408 2 0000000000000000 0 588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7511 2 0000000000000000 0 769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7673 2 0000000000000000 0 812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7407 2 0000000000000000 0

sl: - .
local_address: , :.
rem_address: , :.
st: . , UDP, , TCP. , 7 — TCP_CLOSE.
tx_queue: UDP.
rx_queue: UDP.
tr, tm->when, retrnsmt: UDP.
uid: , .
timeout: UDP.
inode: (inode number), . , . /proc/[pid]/fd, symlink' socket[:inode].
ref: .
pointer: struct sock .
drops: , .

net/ipv4/udp.c .

sock_queue_rcv. , :

, . , .
sk_filter Berkeley Packet Filter, .
sk_rmem_schedule , , .
skb_set_owner_r . sk->sk_rmem_alloc.
__skb_queue_tail .
, sk_data_ready , .

, . .

3.10. Additional Information

, .

, . sysctl , RPS . , RPS . .

, , .

, - !

, :

 $ sudo ethtool -T eth0 Time stamping parameters for eth0: Capabilities: software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE) software-receive (SOF_TIMESTAMPING_RX_SOFTWARE) software-system-clock (SOF_TIMESTAMPING_SOFTWARE) PTP Hardware Clock: none Hardware Transmit Timestamp Modes: none Hardware Receive Filter Modes: none

, , . , .

, SO_BUSY_POLL. , .

: , . igb 3.13.0 . ixgbe — . , ndo_busy_poll struct net_device_ops ( ), SO_BUSY_POLL.

Intel , .

, . , .

sysctl net.core.busy_poll — poll select ( ).

, CPU .

Netpoll:

Linux , . API Netpoll. , kgdb netconsole .

Netpoll . ndo_poll_controller struct net_device_ops, probe.

, Netpoll, , .

__netif_receive_skb_core net/dev/core.c :

 static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc) { /* ... */ /*      NAPI,  netpoll */ if (netpoll_receive_skb(skb)) goto out; /* ... */ }

Netpoll Linux, .

Netpoll API struct netpoll netpoll_setup. , API .

Netpoll API, netconsole , Netpoll API, 'include/linux/netpoll.h` .

SO_INCOMING_CPU

SO_INCOMING_CPU Linux 3.19, , .

, CPU , getsockopt SO_INCOMING_CPU. , CPU. CPU.

, , : patchwork.ozlabs.org/patch/408257 .

DMA

DMA — , CPU , . DMA , , CPU.

Linux DMA, . .

DMA, — Intel IOAT DMA engine .

I/O Intel (Intel's I/O Acceleration Technology (IOAT))

Intel I/O AT , . — DMA. dmesg ioatdma , . DMA , — TCP.

Intel IOAT Linux 2.6.18, 3.13.11.10 - , . 3.13.11.10 ioatdma. , .

(Direct cache access (DCA))

, Intel I/O AT — Direct Cache Access (DCA).

( ) CPU. . igb igb_update_dca , igb_update_rx_dca . igb DCA .

DCA, BIOS, , dca .

IOAT DMA

, , ioatdma, sysfs.
memcpy DMA-:

 $ cat /sys/class/dma/dma0chan0/memcpy_count 123205655

, DMA-:

 $ cat /sys/class/dma/dma0chan0/bytes_transferred 131791916307

IOAT DMA

IOAT DMA , — copybreak. , DMA .

copybreak DMA:

 $ sudo sysctl -w net.ipv4.tcp_dma_copybreak=2048

4096.

4.

Linux . ( ). sysctl.conf, , . .

, . , , , .

, .

Source: https://habr.com/ru/post/314168/

All Articles