Increasing network device performance: fast path technology in Marvell Kirkwood processors

High-speed data networks surround us everywhere: at work at the computer, in telephone calls, digital TV, ATMs and in other situations when you need to transmit digital information. And the greater the volume of this information and the number of its customers, the more stringent requirements are placed on speed and throughput.

In the process of developing telecom and datacom devices, we are constantly confronted with the tasks of ensuring high data transfer speeds. In this article, we will discuss how to solve them. As an example, let us analyze the work of the fast path technology in the processors of the Marvell Kirkwood line, measure the network parameters of the network and show how to improve the performance of various devices for routing traffic.
')
We invite engineers and programmers to cut - everyone who designs hardware and develops software for network routers. Our recipes can be used in the SOHO sector (small office / home office) and in the Enterprise segment (development of high-performance network devices).

In local and global computer networks, the de facto standard is data transmission using Ethernet and TCP / IP protocols. These protocols provide different topologies with the separation of the original large networks into subnets using routers. The simplest way to build a network is presented below:

When transmitting information flow from computer A to computer B, traffic in the form of packets arrives at the interface of the eth0 router, from where the packet is sent to the operating system, where it sequentially passes through different levels of the TCP / IP protocol stack and is decoded to determine the further route of the packet. After receiving the destination address and determining the redirection rule, the operating system re-packs the packet, depending on the protocols used, and displays it through the eth1 interface. At the same time, most of the package remains unchanged; only some of its header fields change. The faster the packet goes through all these stages, the more bandwidth the router can provide. And if at the time of networks with a capacity of 100 Mbps, the performance problem of routers did not stand so sharply, then with the advent of gigabit speeds there was a need to improve the efficiency of equipment.

It is easy to see that such full traffic processing is redundant for most packets of a known type. By eliminating and redirecting packets at an early stage, which are not intended for the device itself, it is possible to significantly reduce the processing time of passing traffic. Such processing is most often carried out before the operating system, which reduces time delays. Due to the minimization of the packet path, this technology is called the fast path. Since this method of acceleration depends on the low-level part of the network stack and provides for the exchange of information with the network driver, the specific implementation of the fast path technology depends on the equipment used.

Marvell Kirkwood processors

Marvell Kirkwood is a system-on-chip based on ARMv5TE-compatible architecture Sheeva. These processors are designed specifically for use in network devices such as routers, access points, STB devices, network drives, media servers and plug computers.

The Kirkwood line consists of processors with one or two cores and an extensive set of peripherals. The operating frequencies are from 600 MHz to 2 GHz, the entire line carries 256 KB of L2 cache on board. Older dual-core models boast the presence of FPU.

The main characteristics of Marvell Kirkwood processors are presented in the table:

CPU	Clock frequency	Number of Cores	Memory interface	Ethernet ports	PCI-Express	USB 2.0	SATA 2.0
88F6321	800 MHz	2	32/40-bit DDR2 up to 800 MHz	2GE	one	one	0
88F6322	800 MHz	2	32/40-bit DDR2 up to 800 MHz	2GE	2	2	0
88F6323	1.0 GHz	2	32/40-bit DDR2 up to 800 MHz	3GE	2	3	one
88F6282	1.6—2.0 GHz	one	16-bit DDR2 / 3 up to 1066 MHz	2GE	2	one	2
88F6283	600 MHz	one	16-bit DDR2 / 3 up to 1066 MHz	2GE	2	one	2
88F6281	1.0–1.2 GHz	one	16-bit DDR2 up to 800 MHz	2GE	one	one	2
88F6280	1.0 GHz	one	16-bit DDR2 up to 400 MHz	1GE	0	one	0
88F6192	800 MHz	one	16-bit DDR2 up to 400 MHz	2GE	one	one	2
88F6190	600 MHz	one	16-bit DDR2 up to 600 MHz	1FE + 1GE	one	one	one
88F6180	600-800 MHz	one	16-bit DDR2 up to 600 MHz	1GE	one	one	0

Network Fast Processing component

Since the line of Kirkwood processors is focused on the use of traffic redirection devices, Marvell also faced the need to implement the fast path technology in their devices. To solve this problem, Network Fast Processing component, or shortly NFP, was added to the HAL-part of the platform support driver in the Linux 2.6.31.8 kernel.

The relationship of Marvell NFP with the rest of the Linux operating system can be represented as follows:

NFP is implemented as a “layer” between the gigabit interface driver and the operating system's network stack. In short, the basic principle of accelerating the passage of traffic is to screen incoming routed packets and output them through the necessary interface bypassing the OS. Those packages that are intended for the local interface, or which cannot be processed within the fast path, are sent to the Linux kernel for processing by standard means.

The fast path technology implemented by Marvell does not process all possible packet formats, but only the most common protocols up to the transport layer of the OSI / ISO model. The chain of supported protocols can be represented as follows:

Ethernet (802.3) → [ VLAN (802.1) ] → [ PPPoE ] → IPv4 → [ IPSEC ] → TCP/UDP

Support for higher-level protocols is not necessary, since this information is not used to route traffic. The analysis of transport protocol headers is necessary for the operation of NAT.

Due to the modular structure, the settings of the parts used can be made at the compilation stage of the Linux kernel. The following optional parts can be distinguished:

FDB_SUPPORT is a hash table for matching MAC addresses and interfaces.
PPP - PPPoE protocol support.
NAT_SUPPORT - support for IP address translation.
SEC - support for IPSec encryption protocol.
TOS - replacing the type of service field in the IP header based on iptables rules.

FDB (forwarding database) - a traffic redirection database located in the Linux kernel. Unlike the routing table, FDB is optimized for quick search of records. Marvell’s fast path implementation uses its own local ruleDb rule table, which, like the deletion, writes from the OS network stack (the corresponding changes have been made to the stack code).

For quick lookup, ruleDB is a hash table with key-value pairs, where the value is the packet forwarding rule with a specific destination address, and the key for quick indexing of this rule is the index generated from source and destination addresses using a special hash function. The optimally constructed hash function guarantees maximum chances that one rule will correspond to one index.

Since initially FDB (and, therefore, ruleDb) is empty, every first packet (packet without an existing entry in FDB) is sent to the OS kernel, where after processing a rule is created. After a certain timeout has elapsed, this entry will be removed from the FDB and from ruleDB to the NFP.

Consider the processing of traffic in more detail:

The raw data of the received packet is transferred to the NFP input.
If the packet is destined for a multicast MAC address, it is sent to the OS TCP / IP stack.
If FDB is used and there is no entry in the table for this MAC address, the packet is sent to the OS stack.
An entry for this MAC address is retrieved from FDB. If the address is not marked as local, the system recognizes it as connected in bridge mode and sends the packet through the interface specified in the FDB table entry.
If a VLAN or PPPoE header is detected, it is discarded and a link to the beginning of the IP header is calculated.
Packets marked as fragments are transferred to the OS network stack.
If the packet contains ICMP protocol data, the packet is sent to the stack.
Packets with an expired lifetime are sent to the OS stack. Of course, such packets should be dropped, but the ICMP TTL expired response should be generated by the operating system.
There is a check for the presence of the IPSec protocol header and the corresponding processing of such packets with certificate verification.
Next, we look for the Destination NAT rule to determine the destination IP address for this packet.
If there is no route for the existing destination address, the packet is sent to the network stack. Such packets must also be discarded, but a corresponding ICMP response must be generated.
Next, we look for the Source NAT rule and update the IP and TCP / UDP header fields, taking into account the DNAT and SNAT rules.
Based on the routing table, the interface is calculated through which the packet must be output.
If PPP tunneling is used for the output interface, the IP packet wraps around the PPPoE header, first reducing TTL and updating the Ethernet header. Since in this case the checksum of the IP packet cannot be calculated in hardware, it is necessary to recalculate the checksum. But since the old checksum and the change in the packet data are known, the sum is not calculated completely, but is only adjusted to the required value.
If the packet size exceeds the maximum - the packet is sent to the OS stack.
In other cases, the Ethernet header, checksum, and type of service fields are updated (if necessary, and there is a record in iptables).
The received Ethernet packet is output via the appropriate network interface.

This sequence of checks can be represented graphically in the form of a diagram:

^{Chart “Package Processing in NFP”}

It is easy to see that the processing of traffic in NFP is a set of checks for the most common special cases; it is not a universal solution for all types of packets. However, in most cases such a set of protocols is sufficient for a noticeable increase in the performance of routing in the network.

As for the shortcomings of the implementation of the fast path technology in Marvell, one cannot help but notice cases of sending traffic to the OS kernel in case of any need for ICMP packet generation. This will lead to increased load on the router in case of network attacks or any other increased amount of ICMP traffic.

In case of a large amount of multicast traffic, the router will also experience increased load, since this traffic is not processed by NFP and passes through the OS network stack.

Also, this implementation does not support IPv6, but the developers have provided the possibility of its support in the future.

As for the shortcomings of the fast path technology as a whole, one can notice the fact that in any case it shares processor time with the operating system, which means it does not use all possible resources. This problem is easily solved by Marvell multiprocessor solutions, such as Armada XP quad-core processors.

Router Performance Measurement

How much does Network Fast Processing have a real impact on the performance of a router? To answer this question, we estimate the speed of packets passing through the router with NFP turned on and off.

As a test device, we take a router based on a Marvell Kirkwood 88F6282 system on a chip with a clock frequency of 1 GHz. This processor has two 1000Base-TX network interfaces on board, which makes it a good choice for this type of device.

^{In the diagram: the architecture of the Markell Kirkwood 88F6282 SoC}

Network traffic in most networks is not consistent over time, so a hardware or software traffic generator is needed to evaluate the actual performance. Let's consider several possible options for software package generation.

PackETH is a GUI utility for generating Ethernet frames, there are versions for it under Linux, Windows and Mac. This is one of the easiest to use tools for generating traffic, it has the following features:

Frame generation Ethernet II, Ethernet 802.3, 802.1q, QinQ or user-defined.
Supports ARP, IPv4, IPv6, UDP, TCP, ICMP, ICMPv6, IGMP, RTP (with the ability to set the payload) or user-defined.
Jumbo frame generation (if supported by the driver).
Sending a queue of packets with a custom delay and the number of packets.
Ability to save settings.

The graphical interface of the utility looks like this:

^{PackETH utility interface}

iperf is another solution for generating traffic, it is much more common, but it almost does not offer any options for setting the packet format. This console utility can measure network bandwidth by generating and receiving TCP and UDP packets.

To use it, just launch one copy of the application in server mode with the command:

 # iperf -s

And another copy on the second machine in client mode with the address or server name:

 # iperf -c server_host

Within 10 seconds, the program will measure the network bandwidth and give the result.

The ability to directly generate UDP traffic is also provided by the pktgen kernel module. You can configure the parameters of the generated packages in the procfs file system’s / proc / net / pktgen directory. The simplest configuration is defined as follows:

 # echo "add_device eth0" > /proc/net/pktgen/kpktgend_0 # echo "count 1000" > /proc/net/pktgen/eth0 # echo "dst 192.168.1.1" > /proc/net/pktgen/eth0 # echo "pkt_size 1000" > /proc/net/pktgen/eth0 # echo "delay 50" > /proc/net/pktgen/eth0

Run the generator:

 # echo "start" > /proc/net/pktgen/pgctrl

After completion of the generator in its status / proc / net / pktgen / eth0, the sending speed will be displayed.

The main advantage of pktgen is that it generates a packet for transmission only once, and then sends copies of this packet, which allows it to achieve higher speeds.

There are other solutions for generating traffic and measuring network bandwidth, such as brute, netperf, mpstat or sprayd.

Since we do not have the task of verifying all possible cases, iperf capabilities will suffice. We will send TCP and UDP packets of 1400 bytes in two modes - with Network Fast Processing turned off and on. The NFP can be directly managed via procfs using the manager /proc/net/mv_eth_tool. For example, in order to disable the NFP, it is enough to send the command “c 0” to the manager:

 # echo "c 0" > /proc/net/mv_eth_tool

Where "c" is the command code, "0" is the NFP status to be set.

Let's measure network performance in these modes and enter the results:

Package type	Tcp	UDP
Package size, byte	1400	1400
Bandwidth without NFP, Mbps	281	338
Bandwidth with NFP, Mbps	551	552

Since the actual bandwidth strongly depends on the device configuration and running applications, it is not worthwhile to focus on the absolute values obtained. But first of all, we are interested in productivity growth with the inclusion of NFP. As you can see, in the case of TCP traffic, the bandwidth has almost doubled (by 96%), which is quite noticeable. For UDP packets, the effect is not so strong - an increase of 63% was recorded, but this is also a good result.

Examples of our development

One of the examples of our developments in which the fast path technology was used is the thin-client AK1100, which we already talked about in Habré . The hardware platform of this device is based on Marvell Kirkwood 88F6282 (Sheeva core; 1.6 GHz). This processor has two Gigabit Ethernet (connected to Marvell's external PHY 88E1121R) and two PCIe ports: the first port is used to connect the GPU, and the second is connected to the internal mini-PCI connector, to which additional external devices or WI-FI modules can be connected . In more detail about the project from the technical point of view it is told here: development of the thin-client .
Another example is the mini-server IP-Plug, the first commercial plug-in computer in Russia and our first project on the Marvell processor. The device was designed based on Marvell Kirkwood 88F6283 and Linux Debian 6.0. We will also tell on Habré how NetBSD was installed on this plug-in, for now attentive readers can get acquainted with the detailed description of the device here: mini-server development .

findings

It would seem that at the current level of technology development, such a problem as a lack of productivity should become irrelevant. However, in some projects, the developers of the electronics center Promwad still face the need to control energy consumption or a rigid cost framework. In such cases, NFP helps us to significantly increase the efficiency of the device exclusively by software.

Of course, Network Fast Processing is not a universal solution, and in the case of using non-standard protocols it can even prevent the correct routing of traffic. But in most cases, software engineers can optimize NFP to specified conditions and get all the advantages of fast path technology in developed devices.

Source: https://habr.com/ru/post/243447/

All Articles