📜 ⬆️ ⬇️

Implementing a NetFlow sensor on an FPGA + CPU - flexibly and quickly

Good day!


As you understood from the title, another article about NetFlow awaits you, but this time from an unusual side - from the implementation of the NetFlow sensor on the FPGA.




Introduction


Yes, there are many articles on the topic of NetFlow on Habré: starting with a detailed analysis and a HOWTO on tuning , to an applied application in catching a virus attack and traffic metering .


But this article is not about how to use NetFlow, it is about how to implement it.


The task of creating a NetFlow sensor is interesting in that you need flexibility at the same time to support NetFlow templates that can be changed in real time and at the same time very high performance in order to efficiently handle traffic and work with memory.


Where I work (I am an FPGA programmer at the STC Metrotek), we use platforms that allow us to choose between software and hardware implementations.


But for the most part of our tasks, the software part deals mainly with management, and the FPGA assumes the main work. That is why the creation of NetFlow sensor, as something more interesting in terms of Software Hardware Co-design, seemed to us a good reason to share our work with you by writing this article.


About NetFlow


In order not to lose the thread of the story and decide on the terms, I’ll tell you about NetFlow, albeit very briefly, because this information is everywhere.


NetFlow is a protocol that Cisco Systems has come up with.


image

What for? To be able to remotely monitor the traffic on the network.


That is, in the local network there is some L3-switch with the function of the NetFlow sensor and somewhere else a collector who knows everything about what is happening in the network.


In this case, there should not be an extra load on the network with data that we send to the collector, that is, simple traffic mirroring to the analyzer is not suitable.


Items in NetFlow:



What else you need to know about NetFlow:



From these fields we conclude:



And do not forget - this flow definition is not strictly defined by the standard. As far as our needs are concerned, we can change the concept of flow, for example, by aggregating flows by directions.


Design and survey work


Now that everything is known about NetFlow, we formulate the statement of work:


You need a NetFlow sensor, such that:



Target platforms are both heterogeneous: CPU + FPGA.


Probes are based on the Altera SoC (Cyclone V) ARM processor + small FPGA. All this on one crystal.


The B100 is a powerful processor (Intel Core I7) + a large FPGA (Stratix V). Between them, PCIe.


Therefore, we have a lot of freedom in terms of Hardware / Software Co-design.


But let's start by explaining why obvious implementation options are not appropriate.


SW Only


We do not use FPGA at all. We use software solution for NetFlow.


We get a lot of flexibility, but what about performance?


  1. Reception and processing of the input packet. It will be very difficult to achieve the processing of userspace traffic line rate by the utility. Even if we had achieved the FPGA functionality corresponding to the cards from Intel and used the DPDK libraries, it would be almost impossible to achieve the speed of the network stack, which would receive lossless packets at a line rate of 100G of 64 bytes, at least in real time. Even on the 1G link on ARM, there are very serious questions about how long it will take to optimize the stack.
  2. CPU load. With the B100 platform, we can assume that having solved the problem of the network stack, we will have enough CPU time to do something other than receiving packets, but on a small ARM there are concerns about whether the system will pull out a constant check of flows in memory, even without receiving packets.

Thus, we can conclude - yes, such an implementation is possible, but:


  1. If we are talking about real terms of development - sampling of input packets is inevitable.
  2. On the SoC platform, there is a very serious question about the maximum number of threads that we can store in memory, because all these threads will need to be constantly checked.
  3. We can in FPGA. Even if we consider that software implementation is possible, it is quite obvious that many things on the FPGA will be implemented many times more efficiently.

But we nevertheless learned something useful from consideration of the software implementation - projects in which you can peep at the nuances of the implementation of the NetFlow sensor:



FPGA only


Now let's go to the other extreme - all on FPGA:


  1. Reception and processing of the input packet. Maximum speed, line rate on B100 and on SoCs. Price is development time, but we will not write this from scratch - we already have a lot of groundwork in this area.
  2. CPU usage is not relevant, and the number of threads is limited only by memory size. With adequate numbers (1-2 million threads), we do not rest on the number of readings that will occur when checking threads. The interface problem remains - the bandwidth that we will divide between adding threads and checking them, but we can always say that checking is a less priority task and do it only when we don’t add a new packet.

But there is a very important question about the flexibility of creating NetFlow packages.


Option 1 - we refuse to modify the templates NetFlow v9. We are writing a generator to an FPGA that can only send data using one template.
If we want to change the template, we will have to rewrite this FPGA module. This approach does not suit us, because different collectors may work differently with NetFlow and sometimes do not quite understand different fields: for example, some NetFlow v9 package fields have a size that can be redefined by the template, but not all collectors are ready for this, they waiting for this field to be a constant size.


Option 2 - we implement dynamic template changes in FPGA. This is possible, but such a module will take up a lot of resources. In addition, its development will take a very long time both on the module itself and on its debugging.


And one more minus of such an implementation: the standard says that not only UDP can be used for transport of NetFlow export packets, but SCTP. This means that if we want to support such functionality, we will have to implement half the network stack on the FPGA.


SW + HW


Now, when we are convinced that only FPGA or only software implementations for this task are not suitable, we will move on to a joint solution.


Functional blocks


To make it easier to find a solution, we will do the following:



image

This diagram shows the main functional blocks of the NetFlow sensor:



What gave us such a partition:



Now, keeping in mind this schema, let's move on to the options:


Option 1


For now we know the following exactly:



All part of the accumulation from receiving a packet to writing to memory should occur on the FPGA.


Suppose this is where all the work of the FPGA ends: the memory where the streams are stored is shared. It has access to soft and he himself checks the threads in the memory and their subtraction.


In this case, we still have a problem of a large load on the CPU with a large number of threads, but in addition to this, a new problem appears - memory access collisions: FPGAs may want to update information about the stream that soft was about to read, in this case it will be very simple lose some flow information.


This problem has solutions, but they will require additional complexity of the architecture.


Option 2


Then we go a little further - the memory remains common, but the FPGA (regularly reads and checks the timers of the threads) checks the threads. As soon as the stream needs to be exported, the FPGA reports a soft pointer to this stream. We removed the load from the CPU, but the collision problem did not dare.


Option 3


We go even further - when the FPGA read the stream from the memory and realized that it was the stream for export - it itself removes the stream from the memory and sends it to soft (using any available interface).
In this case, the memory should no longer be shared, we can solve conflicts inside the FPGA at the stage of memory access resolution.


And the software part, which can be changed to work with any protocol, is engaged in the formation of export NetFlow packages.


The latest version of us gave us. You can proceed to creating a proof of concept based on the SoC platform.


Implementation


Before implementation, we will present some requirements to it:



All this led us to the following implementation:


image
')

What is Avalon-ST?

Avalon Streaming Interface - an interface for transferring data streams inside an FPGA. Used in Altera's IP-core. It can be used in the continuous data stream or packet data transfer (the beginning / end of the packet signals are used). You can read more here: Avalon Interface Specifications.pdf



Stream data storage structure


The simplest structure for data storage is the hash table without a collision resolution mechanism.


image

The result of the calculation of the hash function of the stream, we consider as a shift in memory, on which it is necessary to arrange the relevant information.


When adding a new stream, we consider a hash from it, read from the memory by the required index and check that it is really it. If we make a mistake (a hash collision has occurred), then we force the NetFlow scan engine to subtract this stream from memory. Yes, his timers have not expired yet and he should still be stored in memory, but this is our easiest method of dealing with collisions. Cons of this approach:



The first problem is solved: the lower the probability of collisions, the smaller the possible surge in load.


The second problem can be solved in the program part, if there is implemented its own small buffer for storing such premature shipments.


We select a hash


As a hash function is currently used CRC32 from the polynomial 0x04C11DB7 (this is the one used in Ethernet). Of course, we had an idea to choose a more interesting hash function that would give less collisions. But the simulations showed that CRC32 was fine.


For modeling, we made a simple python hash table that contains 220positions (this size seemed to be the most suitable for the first implementation, therefore we test hash functions on it). That is, from the resulting hash value, only 20 bits are used to determine the position in the table.


We also wrote a script generator that creates 220unique words of 17 bytes each (this is how much data is obtained from fields unique to the stream).


We add this data to our table, with different hash functions:



The CRC32 polynomials 0xEDB88321 and 0x82608EDB are transformations of the polynomial 0x04C11DB7. In one of the past projects, we used such CRC functions to quickly generate a large number of different hashes. We decided to check them at the same time here.


LookUp3 was chosen because this function was specially created for hash tables and contains in itself quite tolerable for FPGA operations - addition and shift.


MurMurHash3 was chosen not only because of its cool name, but also as an example of a multiplicative function, which would be quite expensive to implement in FPGA, but which should, in theory, give a better filling of the table.


After adding each new word to each of the tables, we kept the table full in order to build such a graph:


image

As you can see from the graph, there are no significant differences on our data.


So CRC32 is quite suitable for us until we decide to change the size of the data from which we consider the hash, or until we want to protect ourselves from an intentional attack on filling our table.


For the test platform was chosen one of the platforms with SoC'om on board and two network interfaces, looking in FPGA.


image

Packets that fall into NetFlow are taken from transit — when a packet is sent from one port to another via FPGA. Turning on and off the transfer has already been implemented on this platform.


Thus, you can create a load of up to 2 Gb / s and at the same time control the number of packets sent to NetFlow.


On the selected platform, the FPGA and CPU share memory. To separate the memory with which only FPGA will work (for storing packets), the Linux kernel is limited to the visible memory when loading, so that we have enough ~ 150 MB of storage remaining for 1 million threads.


To create the export packages themselves, they wrote their simple python utility, which with the help of the scapy library listens to the network interface on which the stream data comes. Using the same library, it forms NetFlow Export packages and sends them.


Conclusion


Performance


This NetFlow implementation, in theory, "rests" only on the memory bandwidth (in one of the articles, my colleague Des333 made a practical calculation of this bandwidth on the SoC platform and got the figures at 20 Gb / s, you can read more in his article here )


Do we have enough 20 GB / s?


Adding each packet to us is reading 45 bytes from memory and writing 76.


In the worst case, Ethernet traffic on the 1G interface can create a load of 1488095 packets per second (packets of 64 bytes per line rate).


Thus, we will create a load on the memory of 1.44 Gb / s. The rest of the bandwidth can be given to check the flow.


But in the current implementation, the performance is much lower: in practice, we are not coping with the worst case line rate (packets of 64 bytes) and some of the packets are not included in the statistics. You can estimate the problem in numbers in a simulation, where the target, with the current memory settings, is a lossless load of ~ 1420000 packets per second. This corresponds to the line rate in the case of 69-byte packets.


This is due to the fact that the battery works consistently - it first processes each packet completely, and only then takes on the next one.


In addition, the DDR3 memory delay was quite large (interconnect automatically generated by the development environment + using IP-core Stream SDRAM Ctrl gave ~ 15 latencies
between the request to read and receive data at a frequency of 62.5 MHz).


The solution to this problem is to use a pipeline when adding data to the memory. That is, we request data for reading, and while waiting for a response, we request a read for the next stream (and so up to 15 times).


You can also increase the frequency at which Simple NetFlow IP-core works and all memory access.


Known issues


This is only PoC and therefore there are a number of limitations, such as:




On this, in general, that's all. Thank you for reading to the end.


I will be glad to answer your questions in the comments.


PS


By the way, my colleagues from the system group managed to convert those sketches from python utilities for testing into a decent driver for Linux and a daemon on Rust, which builds packages for NetFlow.


Maybe sometime they will write something about it. But while they are resting after the article about our other collaboration, acceleration of AES encryption. If interested, you can read here .

Source: https://habr.com/ru/post/327894/


All Articles