Large traffic flows and interrupt management in Linux

In this post I will describe methods for increasing the performance of the Linux router. For me, this topic has become relevant when the passing network traffic through a single Linux router has become quite high (> 150 Mbit / s,> 50 Kpps). In addition to routing, the router is also engaged in shaping and acts as a firewall.

For high loads, you should use Intel network cards based on 82575/82576 chipsets (Gigabit), 82598/82599 (10 Gigabit), or the like. Their beauty is that they create eight interrupt queues for one interface — four for rx and four for tx (perhaps RPS / RFS technologies that appeared in the 2.6.35 kernel will do the same for regular network cards). Also, these chips speed up the processing of traffic at the hardware level.
First, look at the contents of /proc/interrupts , in this file you can see what causes interrupts and which kernels are involved in their processing.

  # cat / proc / interrupts 
            CPU0 CPU1 CPU2 CPU3       
   0: 53 1 9 336 IO-APIC-edge timer
   1: 0 0 0 2 IO-APIC-edge i8042
   7: 1 0 0 0 IO-APIC-edge    
   8: 0 0 0 75 IO-APIC-edge rtc0
   9: 0 0 0 0 IO-APIC-fasteoi acpi
  12: 0 0 0 4 IO-APIC-edge i8042
  14: 0 0 0 127 IO-APIC-edge pata_amd
  15: 0 0 0 0 IO-APIC-edge pata_amd
  18: 150 1497 12301 473020 IO-APIC-fasteoi ioc0
  21: 0 0 0 0 IO-APIC-fasteoi sata_nv
  22: 0 0 15 2613 IO-APIC-fasteoi sata_nv, ohci_hcd: usb2
  23: 0 0 0 2 IO-APIC-fasteoi sata_nv, ehci_hcd: usb1
  45: 0 0 0 1 PCI-MSI-edge eth0
  46: 138902469 21349 251748 4223124 PCI-MSI-edge eth0-rx-0
  47: 137306753 19896 260291 4741413 PCI-MSI-edge eth0-rx-1
  48: 2916 137767992 248035 4559088 PCI-MSI-edge eth0-rx-2
  49: 2860 138565213 244363 4627970 PCI-MSI-edge eth0-rx-3
  50: 2584 14822 118410604 3576451 PCI-MSI-edge eth0-tx-0
  51: 2175 15115 118588846 3440065 PCI-MSI-edge eth0-tx-1
  52: 2197 14343 166912 121908883 PCI-MSI-edge eth0-tx-2
  53: 1976 13245 157108 120248855 PCI-MSI-edge eth0-tx-3
  54: 0 0 0 1 PCI-MSI-edge eth1
  55: 3127 19377 122741196 3641483 PCI-MSI-edge eth1-rx-0
  56: 2581 18447 123601063 3865515 PCI-MSI-edge eth1-rx-1
  57: 2470 17277 183535 126715932 PCI-MSI-edge eth1-rx-2
  58: 2543 16913 173988 126962081 PCI-MSI-edge eth1-rx-3
  59: 128433517 11953 148762 4230122 PCI-MSI-edge eth1-tx-0
  60: 127590592 12028 142929 4160472 PCI-MSI-edge eth1-tx-1
  61: 1713 129757168 136431 4134936 PCI-MSI-edge eth1-tx-2
  62: 1854 126685399 122532 3785799 PCI-MSI-edge eth1-tx-3
 NMI: 0 0 0 0 Non-maskable interrupts
 LOC: 418232812 425024243 572346635 662126626 Local timer interrupts
 SPU: 0 0 0 0 Spurious interrupts
 PMI: 0 0 0 0 Performance monitoring interrupts
 PND: 0 0 0 0 Performance pending work
 RES: 94005109 96169918 19305366 4460077 Rescheduling interrupts
 CAL: 49 34 39 29 Function call interrupts
 TLB: 66588 144427 131671 91212 TLB shootdowns
 TRM: 0 0 0 0 Thermal event interrupts
 THR: 0 0 0 0 Threshold APIC interrupts
 MCE: 0 0 0 0 Machine check exceptions
 MCP: 199 199 199 199 Machine check polls
 ERR: 1
 MIS: 0

In this example, Intel 82576 network cards are used. Here you can see that network interrupts are distributed evenly across the cores. However, by default it will not. It is necessary to scatter interruptions on processors. To do this, you need to execute the echo N > /proc/irq/X/smp_affinity , where N is the processor mask (determines which processor will get the interrupt), and X is the interrupt number, visible in the first column of the / proc / interrupts output. To determine the processor mask, you need to raise 2 to the power cpu_N (processor number) and transfer it to the hexadecimal system. With bc calculated like this: echo "obase=16; $[2 ** $cpu_N]" | bc echo "obase=16; $[2 ** $cpu_N]" | bc . In this example, the distribution of interrupts was made as follows:

  # CPU0
 echo 1> / proc / irq / 45 / smp_affinity
 echo 1> / proc / irq / 54 / smp_affinity

 echo 1> / proc / irq / 46 / smp_affinity
 echo 1> / proc / irq / 59 / smp_affinity
 echo 1> / proc / irq / 47 / smp_affinity
 echo 1> / proc / irq / 60 / smp_affinity

 # CPU1
 echo 2> / proc / irq / 48 / smp_affinity
 echo 2> / proc / irq / 61 / smp_affinity
 echo 2> / proc / irq / 49 / smp_affinity
 echo 2> / proc / irq / 62 / smp_affinity

 # CPU2
 echo 4> / proc / irq / 50 / smp_affinity
 echo 4> / proc / irq / 55 / smp_affinity
 echo 4> / proc / irq / 51 / smp_affinity
 echo 4> / proc / irq / 56 / smp_affinity

 # CPU3
 echo 8> / proc / irq / 52 / smp_affinity
 echo 8> / proc / irq / 57 / smp_affinity
 echo 8> / proc / irq / 53 / smp_affinity
 echo 8> / proc / irq / 58 / smp_affinity

Also, if the router has two interfaces, one to the input, the other to the output (the classical scheme), then rx from one interface should be grouped with the tx of the other interface on the same processor core. For example, in this case, interrupts 46 (eth0-rx-0) and 59 (eth1-tx-0) were defined per core.
Another very important parameter is the delay between interrupts. You can view the current value with ethtool -c ethN , the rx-usecs and tx-usecs parameters . The larger the value, the higher the latency, but the lower the CPU load. Try to reduce this value during peak hours down to zero.
When preparing the server with Intel Xeon E5520 (8 cores, each with HyperThreading), I chose the following interrupt allocation scheme:

  # CPU6
 echo 40> / proc / irq / 71 / smp_affinity
 echo 40> / proc / irq / 84 / smp_affinity

 # CPU7
 echo 80> / proc / irq / 72 / smp_affinity
 echo 80> / proc / irq / 85 / smp_affinity

 # CPU8
 echo 100> / proc / irq / 73 / smp_affinity
 echo 100> / proc / irq / 86 / smp_affinity

 # CPU9
 echo 200> / proc / irq / 74 / smp_affinity
 echo 200> / proc / irq / 87 / smp_affinity

 # CPU10
 echo 400> / proc / irq / 75 / smp_affinity
 echo 400> / proc / irq / 80 / smp_affinity

 # CPU11
 echo 800> / proc / irq / 76 / smp_affinity
 echo 800> / proc / irq / 81 / smp_affinity

 # CPU12
 echo 1000> / proc / irq / 77 / smp_affinity
 echo 1000> / proc / irq / 82 / smp_affinity

 # CPU13
 echo 2000> / proc / irq / 78 / smp_affinity
 echo 2000> / proc / irq / 83 / smp_affinity

 # CPU14
 echo 4000> / proc / irq / 70 / smp_affinity
 # CPU15
 echo 8000> / proc / irq / 79 / smp_affinity

/ proc / interrupts on this server without load can be found here . I do not give it in the article because of the bulkiness

UPD:
If the server works only as a router, then the TCP stack tuning does not really matter. However, there are sysctl options that allow you to increase the size of the ARP cache, which may be relevant. If there is a problem with the size of the ARP cache in dmesg, the message “Neighbor table overflow” will appear.
For example:

net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

Description of parameters:
gc_thresh1 - the minimum number of entries that should be in the ARP cache. If the number of entries is less than this value, the garbage collector will not clear the ARP cache.
gc_thresh2 - soft limit on the number of entries in the ARP cache. If the number of records reaches this value, the garbage collector will start within 5 seconds.
gc_thresh3 - a hard limit on the number of entries in the ARP cache. If the number of records reaches this value, the garbage collector will immediately start.

Source: https://habr.com/ru/post/108240/

All Articles

Large traffic flows and interrupt management in Linux

More articles: