It happens that an application needs to know the exact time of receiving or sending a network packet. For example, to synchronize clocks (see PTP , NTP ) or to test network delays (see RFC2544).
A naive solution is to remember the time in the application immediately after receiving a packet from the kernel (or before sending it to the kernel):
recv(sock, buffer, length, flags); clock_gettime(CLOCK_REALTIME, timespec);
It is clear that the time thus obtained may differ markedly from the time when the packet was received by the network device. For more accurate time, support from the operating system, driver, and / or network device is needed.
Starting from version 2.6.30, Linux supports the SO_TIMESTAMPING socket option. It allows a custom socket to receive timestamps for sent and received packets. Time stamps can be removed by the kernel itself, the driver, or the network device (see the list of supporting devices and drivers ). You can read about how it is and how to use it in Documentation / networking / timestamping.txt.
In this article, I will discuss how packets are delivered from a network device to a user, when time stamps are removed, how they are delivered to the user, and how accurate they are. The examples of kernel code are taken from version 4.1.
All network packets in the kernel are represented by a struct sk_buff
structure, which is declared in the include / linux / skbuff.h file . Consider some of her fields:
struct sk_buff { /* , */ ktime_t tstamp; /* , */ struct sock *sk; /* , */ struct net_device *dev; /* L3 */ __be16 protocol; /* head */ __u16 transport_header; __u16 network_header; __u16 mac_header; /* sk_buff_data_t - , head * tail - * end - */ sk_buff_data_t tail; sk_buff_data_t end; /* head - * data - . . * , * . * (: IP TCP/UDP, * Ethernet - IP ) */ unsigned char *head, *data; /* */ atomic_t users; };
Instances of this structure, I will briefly call skb
.
Together with each struct sk_buff
, a buffer is allocated for headers with useful data (the same one to which skb->head
, skb->tail
is pointed) and the following structure immediately following them struct skb_shared_info
. (see include / linux / skbuff.h ).
And several different skb can refer to one buffer and the corresponding struct skb_shared_info
. This is convenient for delivering one skb to several users who are allowed to change the struct sk_buff
fields, but not the data in the buffer.
The most interesting fields for us are struct skb_shared_info
:
struct skb_shared_hwtstamps { ktime_t hwtstamp; }; /* ... */ struct skb_shared_info { /* */ __u8 tx_flags; /* , */ struct skb_shared_hwtstamps hwtstamps; };
The macro skb_shinfo(skb)
is used to access hwtstamps
and the skb_hwtstamps(skb)
function is skb_hwtstamps(skb)
to access the hwtstamps
field. (see include / linux / skbuff.h )
We will not consider these structures in detail. Now it’s enough to understand that the first one allows the kernel to communicate with the device, and the second with the user socket. For a received packet: skb->dev
indicates the device with which it was received; skb->sk
to the socket to which it will be delivered. For the sender - the opposite.
When the processor receives an interrupt, it calls the appropriate handler. Handler execution occurs in the context of an interrupt — for the processor serving the interrupt, almost all interrupts are disabled. That is, the interrupt handler will not be interrupted until it completes itself. The smaller the processor is in the interrupt context, the sooner it will be able to service new interrupts and respond to events from other devices.
Although the interrupt handler may require a lot of CPU time, usually most of its work can wait. That is why the actions of the interrupt handler can be divided into the upper (Top Half) and lower (Bottom Half) halves. Top Half in the context of an interrupt performs urgent actions and plans to perform Bottom Half. Bottom Half will be started by the kernel later outside the context of the interrupt and may be interrupted while running other interrupts.
SOFTIRQ - The kernel mechanism for scheduling a deferred function call. Often used to implement Bottom Half. In total, there are ten different SOFTIRQs in the v4.1 kernel (see the list ), the handlers for which are defined when the kernel is compiled. During its execution, the handler can only be interrupted by a hardware interrupt. For each processor, its own mask is scheduled SOFTIRQ, that is, the handler will be called on the same processor from which it was scheduled. (Actually, you can contrive and point out on which particular processor to schedule the SOFTIRQ.) The same SOFTIRQ can be scheduled and
performed independently on two different processors. When sending and receiving packets, two of them are used: NET_RX_SOFTIRQ with the net_rx_action handler and NET_TX_SOFTIRQ with the net_tx_action handler. (see net_rx_action and net_tx_action )
To execute scheduled SOFTIRQ, the do_softirq()
function is do_softirq()
(see kernel / softirq.c ). It calls SOFTIRQ handlers in turn, starting at the highest priority (the lower number is the higher priority). It is called after each hardware interrupt handler. In addition, the kernel process ksoftirqd spins on each processor, which periodically (as often - depending on the CPU load) causes do_softirq()
.
To communicate with network devices, Linux uses a mixture of interruptions and polling. (see NAPI )
Here's what it looks like:
poll_list
* is a list created by the kernel for each processor core (as one of the fields of the struct softnet_data
). It stores the devices from which NET_RX_SOFTIRQ will receive packets.
A call to do_softirq()
occurred. After executing the higher priority SOFTIRQ, the handler NET_RX_SOFTIRQ - net_rx_action()
will be called.
This function traverses the poll_list and for each dev device, while there are packages on it, it calls the virtual function napi-> poll (see include / linux / netdevice.h .), Which:
It should be noted that net_rx_action()
allows processing no more than netdev_budget
(exported to /proc/sys/net/core/netdev_budget
) packets at a time and limits its execution time to 2 / HZ seconds (on x86, the default is HZ = 1000, i.e. time limit = 2ms).
netif_receive_skb()
is the function from which the packet gets from the driver to the kernel. (In fact, it simply serves as a wrapper for other functions that do all the work.) Let's see what the kernel does with the resulting skb:
#define net_timestamp_check(COND, SKB) \ if (static_key_false(&netstamp_needed)) { \ if ((COND) && !(SKB)->tstamp.tv64) \ __net_timestamp(SKB); \ } \ static int netif_receive_skb_internal(struct sk_buff *skb) { net_timestamp_check(netdev_tstamp_prequeue, skb); /* ... */ return __netif_receive_skb(skb); }
We see that the first thing after receiving a packet, the function calls the net_timestamp_check
macro. Now in order:
Usually the packet is processed by the processor on which the SOFTIRQ was scheduled, and this is the processor on which the interrupt arrived. Some network cards send only one interrupt to only one processor, not allowing parallelization of packet processing on multiprocessor systems. To solve this problem, Receive Packet Steering (RPS) was invented.
If RPS is enabled in the kernel, netif_receive_skb_internal()
can put a packet in a queue (backlog) of another processor and schedule NET_RX_SOFTIRQ on it. After some time, another processor will start processing this packet with the function __netif_receive_skb()
, which calls __netif_receive_skb_core()
. Remember the netdev_tstamp_prequeue
variable? In the case of RPS, it allows you to choose when to remove a timestamp: before sending a packet to another queue or after retrieving it from there.
RPS is configured via /sys/class/net/<interface>/queues/
. For more information, see the documentation on redhat.com .
First of all, this function removes the timestamp if it has not been removed in netif_receive_skb_internal()
:
net_timestamp_check(!netdev_tstamp_prequeue, skb);
Now we have skb, which contains:
It remains to deliver it to all recipients.
Depending on the required L3 protocol and device, recipients can be registered in several places:
Frequent users of all 4 lists - AF_PACKET sockets, each of them with a bind
system call registers a handler in the corresponding list. (The handler is called packet_rcv
) In fact, there are still a lot of recipients, but we will limit ourselves only to those that deliver the packet to the sockets in user space.
UDP or TCP packet will be accepted by the ip_recv
function. Through this feature, packets get into the TCP / IP stack processing. Processing includes checksum checks, passing through iptables tables, searching for a receiving socket by ip address and port number, deleting L2, L3, L4 snoops (which should not get into userspace).
When it became clear which socket should receive this skb, the skb is placed in the socket receive queue. When the user calls recvmsg on this socket, the packet data (those that come after the tcp / udp header) will be copied to the specified user buffer in userspace, and both timestamps will be in the control message (see man 3 cmsg
and man 2 recvmsg
) in struct timespec
format. (See Documentation / networking / timestamping.txt .)
Time stamps are put in the control message by the function __sock_recv_timestamp
. (see net / socket.c ).
An important difference: ip_recv
registered as a handler only once in ptype_base no matter how many AF_INET sockets you create, and packet_rcv
registered for each AF_PACKET socket once in any of the lists.
We have seen that timestamps are removed twice for received packets: a network card (T hard ), the kernel immediately upon receipt (T soft ). Summing up, let's see what causes the delays between them and the moment the packet is delivered to the user (T user ):
In order to estimate roughly how great these delays are and how predictable they are, we use the rxtest program. This program:
The test was conducted on a Core i7 with Linux 4.0 with an Intel 82599ES network card running the ixgbe driver .
In my case, the network card has its own hardware clock and removes timestamps on them. And there is no guarantee that this clock is somehow synchronized with the core jiffies. To fix this, run the phc2sys
program from linuxptp :
# eth5 phc2sys -s CLOCK_REALTIME -c eth5 -m
It needs to be kept open throughout the test. It will be engaged in adjusting the clock of the network card under the system time and display the current time difference. In my case, the absolute value of the discrepancy did not exceed 10 ns.
In addition to setting the SO_TIMESTAMPING settings on the socket, we need to ask the network card to remember the timestamps. Let's use for this the hwstamp_ctl
utility from the same linuxptp
.
hwstamp_ctl -i eth5 -r 13
This will cause the network card to remove timestamps for all PTP Sync packets. Why choose Sync PTP? Because our network card does not know how to remove timestamps for all packages. She definitely needs to specify some type of PTP protocol packet. (This is due to the fact that support for hardware timestamps in Linux was introduced for the operation of the PTP protocol, which allows time synchronization to nanosecond accuracy.)
Start rxtest:
rxtest packet eth5 1000
Meanwhile, the helmet from the other end of the cable is 10 PTP Sync packets per second. (I took examples of PTP packets from https://wiki.wireshark.org/Protocols/ptp and sent them using tcpreplay.)
The output of rxtest:
hard->soft delay: packets 1000: 18.1603 +- 1.18737 microseconds soft->user delay: packets 1000: 5.54756 +- 1.88607 microseconds
At the same time, the ip-address was not assigned to the tested network interface and our packets did not get to be processed by IP protocol. This can explain a very small delay T user - T soft .
Adequate research of delays and their dependence on various parameters (netdev_budget, frequency of received packets, CPU load, kernel configuration, number and type of open sockets) would be enough for a whole article. The purpose of this article is to pour water over to consider the mechanism of package delivery and what causes delays.
That's all. I will be glad to any feedback and criticism.
http://www.linuxfoundation.org/collaborate/workgroups/networking
SO_TIMESTAMPING
documentationSO_TIMESTAMPING
SO_TIMESTAMPING
Source: https://habr.com/ru/post/304644/
All Articles