Good day!
As you understood from the title, another article about NetFlow awaits you, but this time from an unusual side - from the implementation of the NetFlow sensor on the FPGA.
Yes, there are many articles on the topic of NetFlow on Habré: starting with a detailed analysis and a HOWTO on tuning , to an applied application in catching a virus attack and traffic metering .
But this article is not about how to use NetFlow, it is about how to implement it.
The task of creating a NetFlow sensor is interesting in that you need flexibility at the same time to support NetFlow templates that can be changed in real time and at the same time very high performance in order to efficiently handle traffic and work with memory.
Where I work (I am an FPGA programmer at the STC Metrotek), we use platforms that allow us to choose between software and hardware implementations.
But for the most part of our tasks, the software part deals mainly with management, and the FPGA assumes the main work. That is why the creation of NetFlow sensor, as something more interesting in terms of Software Hardware Co-design, seemed to us a good reason to share our work with you by writing this article.
In order not to lose the thread of the story and decide on the terms, I’ll tell you about NetFlow, albeit very briefly, because this information is everywhere.
NetFlow is a protocol that Cisco Systems has come up with.
What for? To be able to remotely monitor the traffic on the network.
That is, in the local network there is some L3-switch with the function of the NetFlow sensor and somewhere else a collector who knows everything about what is happening in the network.
In this case, there should not be an extra load on the network with data that we send to the collector, that is, simple traffic mirroring to the analyzer is not suitable.
Items in NetFlow:
Sensor (aka Exporter) - a device that collects information about streams in the network.
This is usually an L3 switch or router that rarely (usually once per second) sends stream information to the collector.
What else you need to know about NetFlow:
Export packages are those packages that the sensor sends to the collector.
They have the whole essence of NetFlow. You have probably heard about the fact that NetFlow comes in different versions - and so, the format of these export packages is the main difference between the versions.
NetFlow versions: v1 , v5 , v8 , v9 . v9 - the most common, the rest either provide limited functionality ( v1 , v5 ), or unnecessarily complex ( v8 ).
In the v9 version, the sensor determines what information it can give, and the collector adjusts for it. The sensor, along with the data, sends a template by which it is clear how to work with this data. Templates are very flexible. Read more about the NetFlow v9 recommendation in RFC 3954 .
There are IPFIX - at the moment, the same NetFlow v9 is functionally available.
Only NetFlow is managed by Cisco, and IPFIX is standardized by RFC.
From these fields we conclude:
And do not forget - this flow definition is not strictly defined by the standard. As far as our needs are concerned, we can change the concept of flow, for example, by aggregating flows by directions.
Now that everything is known about NetFlow, we formulate the statement of work:
You need a NetFlow sensor, such that:
Target platforms are both heterogeneous: CPU + FPGA.
Probes are based on the Altera SoC (Cyclone V) ARM processor + small FPGA. All this on one crystal.
The B100 is a powerful processor (Intel Core I7) + a large FPGA (Stratix V). Between them, PCIe.
Therefore, we have a lot of freedom in terms of Hardware / Software Co-design.
But let's start by explaining why obvious implementation options are not appropriate.
We do not use FPGA at all. We use software solution for NetFlow.
We get a lot of flexibility, but what about performance?
Thus, we can conclude - yes, such an implementation is possible, but:
But we nevertheless learned something useful from consideration of the software implementation - projects in which you can peep at the nuances of the implementation of the NetFlow sensor:
Now let's go to the other extreme - all on FPGA:
But there is a very important question about the flexibility of creating NetFlow packages.
Option 1 - we refuse to modify the templates NetFlow v9. We are writing a generator to an FPGA that can only send data using one template.
If we want to change the template, we will have to rewrite this FPGA module. This approach does not suit us, because different collectors may work differently with NetFlow and sometimes do not quite understand different fields: for example, some NetFlow v9 package fields have a size that can be redefined by the template, but not all collectors are ready for this, they waiting for this field to be a constant size.
Option 2 - we implement dynamic template changes in FPGA. This is possible, but such a module will take up a lot of resources. In addition, its development will take a very long time both on the module itself and on its debugging.
And one more minus of such an implementation: the standard says that not only UDP can be used for transport of NetFlow export packets, but SCTP. This means that if we want to support such functionality, we will have to implement half the network stack on the FPGA.
Now, when we are convinced that only FPGA or only software implementations for this task are not suitable, we will move on to a joint solution.
To make it easier to find a solution, we will do the following:
This diagram shows the main functional blocks of the NetFlow sensor:
What gave us such a partition:
Now, keeping in mind this schema, let's move on to the options:
For now we know the following exactly:
All part of the accumulation from receiving a packet to writing to memory should occur on the FPGA.
Suppose this is where all the work of the FPGA ends: the memory where the streams are stored is shared. It has access to soft and he himself checks the threads in the memory and their subtraction.
In this case, we still have a problem of a large load on the CPU with a large number of threads, but in addition to this, a new problem appears - memory access collisions: FPGAs may want to update information about the stream that soft was about to read, in this case it will be very simple lose some flow information.
This problem has solutions, but they will require additional complexity of the architecture.
Then we go a little further - the memory remains common, but the FPGA (regularly reads and checks the timers of the threads) checks the threads. As soon as the stream needs to be exported, the FPGA reports a soft pointer to this stream. We removed the load from the CPU, but the collision problem did not dare.
We go even further - when the FPGA read the stream from the memory and realized that it was the stream for export - it itself removes the stream from the memory and sends it to soft (using any available interface).
In this case, the memory should no longer be shared, we can solve conflicts inside the FPGA at the stage of memory access resolution.
And the software part, which can be changed to work with any protocol, is engaged in the formation of export NetFlow packages.
The latest version of us gave us. You can proceed to creating a proof of concept based on the SoC platform.
Before implementation, we will present some requirements to it:
All this led us to the following implementation:
Avalon Streaming Interface - an interface for transferring data streams inside an FPGA. Used in Altera's IP-core. It can be used in the continuous data stream or packet data transfer (the beginning / end of the packet signals are used). You can read more here: Avalon Interface Specifications.pdf
NetDMA is also our IP-core DMA controller, which receives Avalon-ST packets and writes them to memory using handles. In order to supply this DMA with handles, we already have a driver. Now this DMA and driver are used in our network stack.
Simple NetFlow Accumulator is the same IP-core NetFlow sensor. She is engaged in flow detection. Reading and updating information in memory for each received packet, constant memory checking for the presence of streams for export and arbitration of requests to the memory when adding and checking.
The simplest structure for data storage is the hash table without a collision resolution mechanism.
The result of the calculation of the hash function of the stream, we consider as a shift in memory, on which it is necessary to arrange the relevant information.
When adding a new stream, we consider a hash from it, read from the memory by the required index and check that it is really it. If we make a mistake (a hash collision has occurred), then we force the NetFlow scan engine to subtract this stream from memory. Yes, his timers have not expired yet and he should still be stored in memory, but this is our easiest method of dealing with collisions. Cons of this approach:
The first problem is solved: the lower the probability of collisions, the smaller the possible surge in load.
The second problem can be solved in the program part, if there is implemented its own small buffer for storing such premature shipments.
As a hash function is currently used CRC32 from the polynomial 0x04C11DB7 (this is the one used in Ethernet). Of course, we had an idea to choose a more interesting hash function that would give less collisions. But the simulations showed that CRC32 was fine.
For modeling, we made a simple python hash table that contains positions (this size seemed to be the most suitable for the first implementation, therefore we test hash functions on it). That is, from the resulting hash value, only 20 bits are used to determine the position in the table.
We also wrote a script generator that creates unique words of 17 bytes each (this is how much data is obtained from fields unique to the stream).
We add this data to our table, with different hash functions:
The CRC32 polynomials 0xEDB88321 and 0x82608EDB are transformations of the polynomial 0x04C11DB7. In one of the past projects, we used such CRC functions to quickly generate a large number of different hashes. We decided to check them at the same time here.
LookUp3 was chosen because this function was specially created for hash tables and contains in itself quite tolerable for FPGA operations - addition and shift.
MurMurHash3 was chosen not only because of its cool name, but also as an example of a multiplicative function, which would be quite expensive to implement in FPGA, but which should, in theory, give a better filling of the table.
After adding each new word to each of the tables, we kept the table full in order to build such a graph:
As you can see from the graph, there are no significant differences on our data.
So CRC32 is quite suitable for us until we decide to change the size of the data from which we consider the hash, or until we want to protect ourselves from an intentional attack on filling our table.
For the test platform was chosen one of the platforms with SoC'om on board and two network interfaces, looking in FPGA.
Packets that fall into NetFlow are taken from transit — when a packet is sent from one port to another via FPGA. Turning on and off the transfer has already been implemented on this platform.
Thus, you can create a load of up to 2 Gb / s and at the same time control the number of packets sent to NetFlow.
On the selected platform, the FPGA and CPU share memory. To separate the memory with which only FPGA will work (for storing packets), the Linux kernel is limited to the visible memory when loading, so that we have enough ~ 150 MB of storage remaining for 1 million threads.
To create the export packages themselves, they wrote their simple python utility, which with the help of the scapy library listens to the network interface on which the stream data comes. Using the same library, it forms NetFlow Export packages and sends them.
This NetFlow implementation, in theory, "rests" only on the memory bandwidth (in one of the articles, my colleague Des333 made a practical calculation of this bandwidth on the SoC platform and got the figures at 20 Gb / s, you can read more in his article here )
Do we have enough 20 GB / s?
Adding each packet to us is reading 45 bytes from memory and writing 76.
In the worst case, Ethernet traffic on the 1G interface can create a load of 1488095 packets per second (packets of 64 bytes per line rate).
Thus, we will create a load on the memory of 1.44 Gb / s. The rest of the bandwidth can be given to check the flow.
But in the current implementation, the performance is much lower: in practice, we are not coping with the worst case line rate (packets of 64 bytes) and some of the packets are not included in the statistics. You can estimate the problem in numbers in a simulation, where the target, with the current memory settings, is a lossless load of ~ 1420000 packets per second. This corresponds to the line rate in the case of 69-byte packets.
This is due to the fact that the battery works consistently - it first processes each packet completely, and only then takes on the next one.
In addition, the DDR3 memory delay was quite large (interconnect automatically generated by the development environment + using IP-core Stream SDRAM Ctrl gave ~ 15 latencies
between the request to read and receive data at a frequency of 62.5 MHz).
The solution to this problem is to use a pipeline when adding data to the memory. That is, we request data for reading, and while waiting for a response, we request a read for the next stream (and so up to 15 times).
You can also increase the frequency at which Simple NetFlow IP-core works and all memory access.
This is only PoC and therefore there are a number of limitations, such as:
We do not follow the end of the stream by IP flags (only by timers), although NetFlow implies that this is our task;
On this, in general, that's all. Thank you for reading to the end.
I will be glad to answer your questions in the comments.
PS
By the way, my colleagues from the system group managed to convert those sketches from python utilities for testing into a decent driver for Linux and a daemon on Rust, which builds packages for NetFlow.
Maybe sometime they will write something about it. But while they are resting after the article about our other collaboration, acceleration of AES encryption. If interested, you can read here .
Source: https://habr.com/ru/post/327894/
All Articles