SO_TIMESTAMPING in pictures. Reception package

It happens that an application needs to know the exact time of receiving or sending a network packet. For example, to synchronize clocks (see PTP , NTP ) or to test network delays (see RFC2544).

A naive solution is to remember the time in the application immediately after receiving a packet from the kernel (or before sending it to the kernel):

recv(sock, buffer, length, flags); clock_gettime(CLOCK_REALTIME, timespec);

It is clear that the time thus obtained may differ markedly from the time when the packet was received by the network device. For more accurate time, support from the operating system, driver, and / or network device is needed.

Starting from version 2.6.30, Linux supports the SO_TIMESTAMPING socket option. It allows a custom socket to receive timestamps for sent and received packets. Time stamps can be removed by the kernel itself, the driver, or the network device (see the list of supporting devices and drivers ). You can read about how it is and how to use it in Documentation / networking / timestamping.txt.

In this article, I will discuss how packets are delivered from a network device to a user, when time stamps are removed, how they are delivered to the user, and how accurate they are. The examples of kernel code are taken from version 4.1.

Initial knowledge

struct sk_buff

All network packets in the kernel are represented by a struct sk_buff structure, which is declared in the include / linux / skbuff.h file . Consider some of her fields:

 struct sk_buff { /*  ,    */ ktime_t tstamp; /*   ,       */ struct sock *sk; /*   ,         */ struct net_device *dev; /* L3  */ __be16 protocol; /*    head */ __u16 transport_header; __u16 network_header; __u16 mac_header; /* sk_buff_data_t -    ,    head * tail -     * end -        */ sk_buff_data_t tail; sk_buff_data_t end; /* head -        * data -    . .   *      , *         . * (:  IP     TCP/UDP, *  Ethernet -  IP ) */ unsigned char *head, *data; /*   */ atomic_t users; };

Instances of this structure, I will briefly call skb .

Together with each struct sk_buff , a buffer is allocated for headers with useful data (the same one to which skb->head , skb->tail is pointed) and the following structure immediately following them struct skb_shared_info . (see include / linux / skbuff.h ).

And several different skb can refer to one buffer and the corresponding struct skb_shared_info . This is convenient for delivering one skb to several users who are allowed to change the struct sk_buff fields, but not the data in the buffer.

The most interesting fields for us are struct skb_shared_info :

 struct skb_shared_hwtstamps { ktime_t hwtstamp; }; /* ... */ struct skb_shared_info { /*    */ __u8 tx_flags; /*    ,    */ struct skb_shared_hwtstamps hwtstamps; };

The macro skb_shinfo(skb) is used to access hwtstamps and the skb_hwtstamps(skb) function is skb_hwtstamps(skb) to access the hwtstamps field. (see include / linux / skbuff.h )

struct net_device and struct sock

`struct net_device` is allocated for each network device registered on the system. It stores device configuration, statistics, and a bunch of other data. In addition, when registering a device, the driver writes pointers to its structure to its functions, which are then called by the kernel. (see include / linux / netdevice.h )
`struct sock` is allocated for each user-created socket and initialized by the address family specified by it (AF_INET, AF_PACKET ...). Pointers to functions that implement system calls, queues are stored here.
accepted by the socket, but not yet delivered to the user skb, flags, and more. (see include / net / sock.h ).

We will not consider these structures in detail. Now it’s enough to understand that the first one allows the kernel to communicate with the device, and the second with the user socket. For a received packet: skb->dev indicates the device with which it was received; skb->sk to the socket to which it will be delivered. For the sender - the opposite.

SOFTIRQ

When the processor receives an interrupt, it calls the appropriate handler. Handler execution occurs in the context of an interrupt — for the processor serving the interrupt, ~~almost~~ all interrupts are disabled. That is, the interrupt handler will not be interrupted until it completes itself. The smaller the processor is in the interrupt context, the sooner it will be able to service new interrupts and respond to events from other devices.

Although the interrupt handler may require a lot of CPU time, usually most of its work can wait. That is why the actions of the interrupt handler can be divided into the upper (Top Half) and lower (Bottom Half) halves. Top Half in the context of an interrupt performs urgent actions and plans to perform Bottom Half. Bottom Half will be started by the kernel later outside the context of the interrupt and may be interrupted while running other interrupts.

SOFTIRQ - The kernel mechanism for scheduling a deferred function call. Often used to implement Bottom Half. In total, there are ten different SOFTIRQs in the v4.1 kernel (see the list ), the handlers for which are defined when the kernel is compiled. During its execution, the handler can only be interrupted by a hardware interrupt. For each processor, its own mask is scheduled SOFTIRQ, that is, the handler will be called on the same processor from which it was scheduled. (Actually, you can contrive and point out on which particular processor to schedule the SOFTIRQ.) The same SOFTIRQ can be scheduled and
performed independently on two different processors. When sending and receiving packets, two of them are used: NET_RX_SOFTIRQ with the net_rx_action handler and NET_TX_SOFTIRQ with the net_tx_action handler. (see net_rx_action and net_tx_action )

To execute scheduled SOFTIRQ, the do_softirq() function is do_softirq() (see kernel / softirq.c ). It calls SOFTIRQ handlers in turn, starting at the highest priority (the lower number is the higher priority). It is called after each hardware interrupt handler. In addition, the kernel process ksoftirqd spins on each processor, which periodically (as often - depending on the CPU load) causes do_softirq() .

Reception package

To communicate with network devices, Linux uses a mixture of interruptions and polling. (see NAPI )

Here's what it looks like:

Top half

When a packet is received, the device sends an interrupt to one of the processors.
The handler is called in the driver code. The task of the handler is to inform the kernel about the presence of packages in its device, for this it calls the function `napi_schedule ()`, which:
- add device to poll_list` ^*
- plans to perform NET_RX_SOFTIRQ
Note that the first operation is the addition of an element to the doubly linked list, and the second is setting the bit in the SOFTIRQ mask to be very fast. Since the kernel is already aware that the device has packages, most likely the driver will want to temporarily disable interrupts on it.
Exit the interrupt handler, immediately followed by a call to `do_softirq ()`.

poll_list ^* is a list created by the kernel for each processor core (as one of the fields of the struct softnet_data ). It stores the devices from which NET_RX_SOFTIRQ will receive packets.

Bottom half

A call to do_softirq() occurred. After executing the higher priority SOFTIRQ, the handler NET_RX_SOFTIRQ - net_rx_action() will be called.

This function traverses the poll_list and for each dev device, while there are packages on it, it calls the virtual function napi-> poll (see include / linux / netdevice.h .), Which:

Receives a packet from the Device, and forms skb
Calling `netif_receive_skb (skb)` sends them to the kernel one at a time
If there are no more packets on the device, enables an interrupt for it and reports the kernel with the function `napi_complete ()`

It should be noted that net_rx_action() allows processing no more than netdev_budget (exported to /proc/sys/net/core/netdev_budget ) packets at a time and limits its execution time to 2 / HZ seconds (on x86, the default is HZ = 1000, i.e. time limit = 2ms).

netif_receive_skb ()

netif_receive_skb() is the function from which the packet gets from the driver to the kernel. (In fact, it simply serves as a wrapper for other functions that do all the work.) Let's see what the kernel does with the resulting skb:

 #define net_timestamp_check(COND, SKB) \ if (static_key_false(&netstamp_needed)) { \ if ((COND) && !(SKB)->tstamp.tv64) \ __net_timestamp(SKB); \ } \ static int netif_receive_skb_internal(struct sk_buff *skb) { net_timestamp_check(netdev_tstamp_prequeue, skb); /* ... */ return __netif_receive_skb(skb); }

We see that the first thing after receiving a packet, the function calls the net_timestamp_check macro. Now in order:

`static_key_false (& netstamp_needed)` - check whether someone asked users to remove the timestamp immediately after receiving the package by the kernel. It is implemented using static keys, this mechanism allows you to effectively enable / disable rarely used kernel functions (see Documentation / static-keys.txt ). We will not consider it in detail.
`netdev_tstamp_prequeue` is a variable exported to` / proc / sys / net / core / netdev_tstamp_prequeue`. The default is 1. If you set it to 0, the timestamp will be taken off in the `__netif_receive_skb_core ()` function (We’ll return to it.)
The macro __net_timestamp (skb) `writes the current time to` skb-> tstamp`.

Usually the packet is processed by the processor on which the SOFTIRQ was scheduled, and this is the processor on which the interrupt arrived. Some network cards send only one interrupt to only one processor, not allowing parallelization of packet processing on multiprocessor systems. To solve this problem, Receive Packet Steering (RPS) was invented.

If RPS is enabled in the kernel, netif_receive_skb_internal() can put a packet in a queue (backlog) of another processor and schedule NET_RX_SOFTIRQ on it. After some time, another processor will start processing this packet with the function __netif_receive_skb() , which calls __netif_receive_skb_core() . Remember the netdev_tstamp_prequeue variable? In the case of RPS, it allows you to choose when to remove a timestamp: before sending a packet to another queue or after retrieving it from there.

RPS is configured via /sys/class/net/<interface>/queues/ . For more information, see the documentation on redhat.com .

__netif_receive_skb_core ()

First of all, this function removes the timestamp if it has not been removed in netif_receive_skb_internal() :

  net_timestamp_check(!netdev_tstamp_prequeue, skb);

Now we have skb, which contains:

data pointer `skb-> data`
in `skb-> tstamp` the time the kernel gets the package
in `skb_hwtstamps (skb)` the time when the network device received the packet

It remains to deliver it to all recipients.

Depending on the required L3 protocol and device, recipients can be registered in several places:

`ptype_all` - list of those who want to receive all packages from all devices. This is usually the AF_PACKET socket of some `tcpdump`.
`ptype_base [PTYPE_HASH_SIZE]` is a hash table with L3 protocol numbers as keys.
Here, those who want to receive packets of only a certain L3 protocol from all devices are registered.
For example, for all AF_INET sockets, there is one handler listed here - the function `ip_recv`.
`skb-> dev-> ptype_all` is the same list as` ptype_all`. In which those are recorded
who wants to get all the packages from the `skb-> dev` device.
`skb-> dev-> ptype_specific` - this list contains those who want to receive packets of a specific protocol
from the device `skb-> dev`.

Frequent users of all 4 lists - AF_PACKET sockets, each of them with a bind system call registers a handler in the corresponding list. (The handler is called packet_rcv ) In fact, there are still a lot of recipients, but we will limit ourselves only to those that deliver the packet to the sockets in user space.

UDP or TCP packet will be accepted by the ip_recv function. Through this feature, packets get into the TCP / IP stack processing. Processing includes checksum checks, passing through iptables tables, searching for a receiving socket by ip address and port number, deleting L2, L3, L4 snoops (which should not get into userspace).

When it became clear which socket should receive this skb, the skb is placed in the socket receive queue. When the user calls recvmsg on this socket, the packet data (those that come after the tcp / udp header) will be copied to the specified user buffer in userspace, and both timestamps will be in the control message (see man 3 cmsg and man 2 recvmsg ) in struct timespec format. (See Documentation / networking / timestamping.txt .)

Time stamps are put in the control message by the function __sock_recv_timestamp . (see net / socket.c ).

An important difference: ip_recv registered as a handler only once in ptype_base no matter how many AF_INET sockets you create, and packet_rcv registered for each AF_PACKET socket once in any of the lists.

Delay Testing

We have seen that timestamps are removed twice for received packets: a network card (T _hard ), the kernel immediately upon receipt (T _soft ). Summing up, let's see what causes the delays between them and the moment the packet is delivered to the user (T _user ):

T _soft - T _hard = [wait NET_RX_SOFTIRQ] + [kernel processing of other packages (including packages from other devices)]

If the NET_RX_SOFTIRQ call failed to process our packet (netdev_budget or time limit 2 / HZ was exceeded), the kernel will allow other processes to run until the hardware interrupt or ksoftirqd calls `do_softirq ()` again. Also do not forget that there are other SOFTIRQ, the implementation of which also takes time.

Even with the execution of NET_RX_SOFTIRQ, some packages will be processed before ours.
T _user - T _soft = [delivery to other sockets] + [delivery to our socket]
')
Depending on what these sockets are, delivery may take different times.

In order to estimate roughly how great these delays are and how predictable they are, we use the rxtest program. This program:

creates socket (packet or UDP)
sets up software and hardware timestamps for received packets via SO_TIMESTAMPING
accepts a specified number of packets from the socket along with temporary labels
considers for delays the mean value and standard deviation

The test was conducted on a Core i7 with Linux 4.0 with an Intel 82599ES network card running the ixgbe driver .

In my case, the network card has its own hardware clock and removes timestamps on them. And there is no guarantee that this clock is somehow synchronized with the core jiffies. To fix this, run the phc2sys program from linuxptp :

  #     eth5 phc2sys -s CLOCK_REALTIME -c eth5 -m

It needs to be kept open throughout the test. It will be engaged in adjusting the clock of the network card under the system time and display the current time difference. In my case, the absolute value of the discrepancy did not exceed 10 ns.

In addition to setting the SO_TIMESTAMPING settings on the socket, we need to ask the network card to remember the timestamps. Let's use for this the hwstamp_ctl utility from the same linuxptp .

  hwstamp_ctl -i eth5 -r 13

This will cause the network card to remove timestamps for all PTP Sync packets. Why choose Sync PTP? Because our network card does not know how to remove timestamps for all packages. She definitely needs to specify some type of PTP protocol packet. (This is due to the fact that support for hardware timestamps in Linux was introduced for the operation of the PTP protocol, which allows time synchronization to nanosecond accuracy.)

Start rxtest:

  rxtest packet eth5 1000

Meanwhile, the helmet from the other end of the cable is 10 PTP Sync packets per second. (I took examples of PTP packets from https://wiki.wireshark.org/Protocols/ptp and sent them using tcpreplay.)

The output of rxtest:

  hard->soft delay: packets 1000: 18.1603 +- 1.18737 microseconds soft->user delay: packets 1000: 5.54756 +- 1.88607 microseconds

At the same time, the ip-address was not assigned to the tested network interface and our packets did not get to be processed by IP protocol. This can explain a very small delay T _user - T _soft .

Adequate research of delays and their dependence on various parameters (netdev_budget, frequency of received packets, CPU load, kernel configuration, number and type of open sockets) would be enough for a whole article. The purpose of this article is to ~~pour water over to~~ consider the mechanism of package delivery and what causes delays.

That's all. I will be glad to any feedback and criticism.