Capture packages in Linux at speeds of tens of millions of packages per second without using third-party libraries

My article will tell you how to take 10 million packets per second without using such libraries as Netmap, PF_RING, DPDK and others. We will do this by the usual Linux kernel version 3.16 and a certain amount of code in C and C ++.

First, I would like to share a word about how pcap works — a well-known way to capture packets. It is used in such popular utilities as iftop, tcpdump, arpwatch. In addition, it has a very high load on the processor.
')
So, you opened the interface for them and wait for packages from it using the usual approach - bind / recv. The kernel, in turn, receives data from the network card and saves it in the kernel space, then it detects that the user wants to receive it in the user space and passes through the argument of the recv command the address of the buffer where to put this data. The kernel meekly copies the data (for the second time!). It’s pretty hard, but not all pcap problems.

In addition, recall that recv is a system call and we call it on every packet coming to the interface, system calls are usually very fast, but the speeds of modern 10GE interfaces (up to 14.6 million calls per second) mean that even a light call becomes very costly for the system solely because of the frequency of calls.

It is also worth noting that we usually have more than 2 logical cores on our server. And the data can fly to any of them! And an application that accepts data by pcap uses a single core. Here we have locks on the kernel side and dramatically slow down the capture process - now we are not only copying memory / processing packets, but are waiting for the release of locks occupied by other cores. Believe me, blocking can often take up to 90% of the processor resources of the entire server.

Good list of problems? So, we heroically try to solve them all!

So, for definiteness, let's note that we are working on mirrored ports (which means that from somewhere outside the network we receive a copy of all the traffic of a specific server). On them, in turn, is traffic - SYN flood packets of the minimum size at a speed of 14.6 mpps / 7.6GE.

Network ixgbe, drivers with SourceForge 4.1.1, Debian 8 Jessie. Module configuration: modprobe ixgbe RSS = 8.8 (this is important!). I have a processor i7 3820, with 8 logical cores. Therefore, wherever I use 8 (including in the code), you must use the number of cores that you have.

Distribute interrupts on existing cores

I draw attention to the fact that packets arrive to the port, the target MAC addresses of which do not coincide with the MAC address of our network card. Otherwise, the Linux stack of TCP / IP will turn on and the machine will be choked with traffic. This moment is very important, we are now discussing only the capture of someone else's traffic, and not the processing of the traffic that is intended for this machine (although for this my method works with ease).

Now let's check how much traffic we can accept if we start listening to all the traffic.

Enable promisc mode on the network card:

ifconfig eth6 promisc

After this, in the htop we will see a very unpleasant picture - the complete overload of one of the cores:

 1 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| 100.0%]     
 2 [0.0%]     
 3 [0.0%]     
 4 [0.0%]     
 5 [0.0%]
 6 [0.0%]
 7 [0.0%]
 8 [0.0%]

To determine the speed on the interface, use the pps.sh special script: gist.github.com/pavel-odintsov/bc287860335e872db9a5

The speed on the interface is rather small - 4 million packets per second:
bash /root/pps.sh eth6

 TX eth6: 0 pkts / s RX eth6: 3882721 pkts / s
 TX eth6: 0 pkts / s RX eth6: 3745027 pkts / s

To solve this problem and distribute the load across all logical cores (I have 8), run the following script: gist.github.com/pavel-odintsov/9b065f96900da40c5301 which will distribute interrupts from all 8 queues of the network card to all existing logical cores.

Fine, the speed immediately flew up to 12mpps (but this is not a capture, this is just an indication that we can read traffic at that speed from the network):

  bash /root/pps.sh eth6
 TX eth6: 0 pkts / s RX eth6: 12528942 pkts / s
 TX eth6: 0 pkts / s RX eth6: 12491898 pkts / s
 TX eth6: 0 pkts / s RX eth6: 12554312 pkts / s

And the load on the core stabilized:

 1 [|||||  7.4%]     
 2 [|||||||  9.7%]     
 3 [||||||  8.9%]    
 4 [||  2.8%]     
 5 [|||  4.1%]
 6 [|||  3.9%]
 7 [|||  4.1%]
 8 [|||||  7.8%]

Immediately I’ll draw your attention to the fact that two code examples will be used in the text, here they are:
AF_PACKET, AF_PACKET + FANOUT: gist.github.com/pavel-odintsov/c2154f7799325aed46ae
AF_PACKET RX_RING, AF_PACKET + RX_RING + FANOUT: gist.github.com/pavel-odintsov/15b7435e484134650f20

These are complete applications with the highest level of optimizations. I don’t provide intermediate (obviously slower versions of the code) - but all the checkboxes for managing all the optimizations are highlighted and declared in the code as bool - you can easily repeat my way around.

First attempt to launch AF_PACKET capture without optimizations

So, we start the application for capturing traffic with AF_PACKET:

 We process: 222048 pps
 We process: 186315 pps

And the load on the ceiling:

 1 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||||||  86.1%]     
 2 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||||  84.1%]     
 3 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||  79.8%]     
 4 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||  88.3%]     
 5 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||  83.7%]
 6 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||  86.7%]
 7 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||  89.8%]
 8 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||| 90.9%]

The reason is that the kernel has sunk into locks for which it spends all the CPU time:

 Samples: 303K of event 'cpu-clock', Event count (approx.): 53015222600
   59.57% [kernel] [k] _raw_spin_lock
    9.13% [kernel] [k] packet_rcv
    7.23% [ixgbe] [k] ixgbe_clean_rx_irq
    3.35% [kernel] [k] pvclock_clocksource_read
    2.76% [kernel] [k] __netif_receive_skb_core
    2.00% [kernel] [k] dev_gro_receive
    1.98% [kernel] [k] consume_skb
    1.94% [kernel] [k] build_skb
    1.42% [kernel] [k] kmem_cache_alloc
    1.39% [kernel] [k] kmem_cache_free
    0.93% [kernel] [k] inet_gro_receive
    0.89% [kernel] [k] __netdev_alloc_frag
    0.79% [kernel] [k] tcp_gro_receive

Optimize AF_PACKET capture with FANOUT

So what to do? Just think a little :) Locks occur when several processors try to use one resource. In our case, this is due to the fact that we have one socket and one application serves it, which forces the remaining 8 logical processors to stand in constant waiting.

Here we will come to the aid of a great function - FANOUT, and if in Russian - branching. For AF_PACKET, we can run several (of course, the most optimal in our case will be the number of processes equal to the number of logical cores). In addition, we can specify an algorithm by which data will be distributed over these sockets. I chose the PACKET_FANOUT_CPU mode, since in my case the data is very evenly distributed across the queues of the network card and this, in my opinion, is the least resource-intensive version of balancing (although I can’t vouch here - I recommend looking in the kernel code).

Adjust the bool use_multiple_fanout_processes = true;

And run the application again.

Oh miracle! 10 times acceleration:

 We process: 2250709 pps
 We process: 2234301 pps
 We process: 2266138 pps

Processors, of course, are still fully loaded:

 1 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||| 92.6%]     
 2 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||| 93.1%]     
 3 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||| 93.2%]     
 4 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||| 93.3%]     
 5 [|||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||| 93.1%]
 6 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||| 93.7%]
 7 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||| 93.7%]
 8 [|||||||||||||||||||||||||||||||||||||||||||||||| |||||||||| 93.2%]

But the perf top map looks completely different - there are no more locks:

 Samples: 1M of event 'cpu-clock', Event count (approx.): 110166379815
   17.22% [ixgbe] [k] ixgbe_clean_rx_irq      
    7.07% [kernel] [k] pvclock_clocksource_read          
    6.04% [kernel] [k] __netif_receive_skb_core    
    4.88% [kernel] [k] build_skb     
    4.76% [kernel] [k] dev_gro_receive    
    4.28% [kernel] [k] kmem_cache_free 
    3.95% [kernel] [k] kmem_cache_alloc 
    3.04% [kernel] [k] packet_rcv 
    2.47% [kernel] [k] __netdev_alloc_frag 
    2.39% [kernel] [k] inet_gro_receive
    2.29% [kernel] [k] copy_user_generic_string
    2.11% [kernel] [k] tcp_gro_receive
    2.03% [kernel] [k] _raw_spin_unlock_irqrestore

In addition, the sockets (although I'm not sure about AF_PACKET) have the opportunity to set the receive buffer, SO_RCVBUF, but on my test bench it did not give any results.

Optimization of AF_PACKET capture using RX_RING - ring buffer

What to do and why still slow? The answer here is in the build_skb function, it means that there are still two copies of memory inside the kernel!

Now let's try to figure out how to allocate memory by using RX_RING.

And hooray 4 MPPS top is taken !!!

 We process: 3582498 pps
 We process: 3757254 pps
 We process: 3669876 pps
 We process: 3757254 pps
 We process: 3815506 pps
 We process: 3873758 pps

This speed increase was ensured by the fact that the memory is copied from the network card buffer now only once. And when transferring from the kernel space to the user space, repeated copying is not performed. This is provided by the common buffer allocated in the kernel and skipped in the user space.

The approach to work is also changing - we can no longer hang and listen when the packet arrives (remember, this is an overhead projector!), Now with the help of the poll call we can wait for a signal when the whole block is filled! And then begin processing it.

Optimize AF_PACKET capture with RX_RING using FANOUT

But still we have problems with locks! How to beat them? The old method is to turn on FANOUT and allocate a block of memory for each handler thread!

 Samples: 778K of event 'cpu-clock', Event count (approx.): 87039903833
   74.26% [kernel] [k] _raw_spin_lock
    4.55% [ixgbe] [k] ixgbe_clean_rx_irq
    3.18% [kernel] [k] tpacket_rcv
    2.50% [kernel] [k] pvclock_clocksource_read
    1.78% [kernel] [k] __netif_receive_skb_core
    1.55% [kernel] [k] sock_def_readable
    1.20% [kernel] [k] build_skb
    1.19% [kernel] [k] dev_gro_receive
    0.95% [kernel] [k] kmem_cache_free
    0.93% [kernel] [k] kmem_cache_alloc
    0.60% [kernel] [k] inet_gro_receive
    0.57% [kernel] [k] kfree_skb
    0.52% [kernel] [k] tcp_gro_receive
    0.52% [kernel] [k] __netdev_alloc_frag

So, connect the FANOUT mode for the RX_RING version!

HOORAY! RECORD!!! 9 MPPS !!!

 We process: 9611580 pps
 We process: 8912556 pps
 We process: 8941682 pps
 We process: 8854304 pps
 We process: 8912556 pps
 We process: 8941682 pps
 We process: 8883430 pps
 We process: 8825178 pps

perf top:

 Samples: 224K of event 'cpu-clock', Event count (approx.): 42501395417
   21.79% [ixgbe] [k] ixgbe_clean_rx_irq
    9.96% [kernel] [k] tpacket_rcv
    6.58% [kernel] [k] pvclock_clocksource_read
    5.88% [kernel] [k] __netif_receive_skb_core
    4.99% [kernel] [k] memcpy
    4.91% [kernel] [k] dev_gro_receive
    4.55% [kernel] [k] build_skb
    3.10% [kernel] [k] kmem_cache_alloc
    3.09% [kernel] [k] kmem_cache_free
    2.63% [kernel] [k] prb_fill_curr_block.isra.57

By the way, for the sake of justice, the update to the kernel 4.0.0 of the branch did not give any special acceleration. The speed was kept in the same range. But the load on the core dropped significantly!

 1 [|||||||||||||||||||||||||||||||||||||  55.1%]     
 2 [||||||||||||||||||||||||||||||||||||  52.5%]     
 3 [|||||||||||||||||||||||||||||||||||||||||||  62.5%]     
 4 [|||||||||||||||||||||||||||||||||||||||||||  62.5%]     
 5 [|||||||||||||||||||||||||||||||||||||||  57.7%]
 6 [|||||||||||||||||||||||||||||||||  47.7%]
 7 [|||||||||||||||||||||||||||||||||||||||  55.9%]
 8 [|||||||||||||||||||||||||||||||||||||||||  61.4%]

In conclusion, I would like to add that Linux is simply a terrific platform for analyzing traffic, even in an environment where no specialized kernel module can be built. It is very, very happy. It is hoped that in the nearest kernel versions it will be possible to process 10GE on a full wire-speed of 14.6 million / second packets using an 1800 megahertz processor :)

Recommended reading materials:
www.kernel.org/doc/Documentation/networking/packet_mmap.txt
man7.org/linux/man-pages/man7/packet.7.html

Source: https://habr.com/ru/post/261161/

All Articles