Hello! Some time ago,
ksdaemon invited me to visit the
SD podCast podcast. This turned out to be a very interesting journey through the wilds of our project and corners were opened, about which nobody had even asked before :) Excellent demonstration of the phrase - you want to understand (your own product!) Yourself - explain to others.
Since usually the audience Habra still prefers text news, we decided to also lay out the plan of the conversation, which was prepared in advance. It is very slightly different from what was said in the conversation, but it contains all the useful information that we wanted to share.
For those who are not familiar with the product to begin with, I recommend to explore our
site and the
GitHub project .
Text conversations in the Software Development podCast podcast 58.
A few words about yourself
My name is Pavel Odintsov, I am the author of the FastNetMon project for detecting DDoS attacks. Now I live in London and mainly deal with issues related to the design and programming of systems related to networks.
')
The standard question is how did you get into IT? How long and what have you been doing?
Getting into IT is hardly accidental, the love for technology has inherited from the father, attention to detail and the ability to concentrate on a specific issue - from the mother.
My childhood was spent for the most part among the kip magazines of the Technique - Youth, Radio, where it was often described how to assemble a personal computer on my own. It was at that time, of course, that was beyond my capabilities, but interest arose even then and with the advent of affordable computers (it was a Celeron 266 with 32Mb RAM) finally the opportunity to go beyond reading magazines and try everything in practice !
And then followed the years of reading various books, magazines (mostly PC World, sometimes Hacker), sitting in IRC chat rooms (hello Rusnet and DalNet!) And studying technical documentation on the Internet, which for those times I only had at the cottage and at a speed of 33k.
After some time, in my city, Samara, an affordable GPRS and Satellite Internet providers appeared and, perhaps, from that moment on, my professional practice began. It all started with the fact that an ICQ acquaintance asked me to write a simple Perl script, which I was learning at the time. At the same time with this project came the understanding that it was rather hard to develop on Windows and the decision was made to switch to Linux.
Over time, the hobby turned into quite confident knowledge of both Perl and Linux, and I got a job at the REG.RU Domain Registrar as a Perl programmer. But in practice, I was engaged in a variety of tasks related to Linux, and not just programming.
A few words about what these attacks are, what subtypes are, how the attack usually happens, how it works and why
Since the main topic of the podcast is FastNetMon project, I’ll speak in its context. There are a lot of DoS / DDoS attacks and we do not set ourselves the task of protecting the user from all its types.
First of all, we are focused on volumetric attacks carried out using L3 / L4 protocols.
These attacks are aimed at exhausting the channel capacities or equipment performance in order to interrupt the correct functioning of a service.
Often, this is an attack on a particular site, there are also cases when an attack is taking place on the infrastructure of a specific operator or on the company as a whole - this is much more dangerous.
If we talk about current types of attacks, their main types used for attacks on channel capacity are NTP, SSDP, SNMP, DNS amplifications. Their essence is quite simple, they use an intermediate node managed by a hacker, who in turn sends fake requests using the victim's address instead of their own on thousands (sometimes hundreds of thousands) of Internet nodes that have a service vulnerable to this type of attack. After receiving these requests, these (often quite legitimate) services generate a response and respond with a huge flurry of requests to the specified victim site, disabling it.
In addition to these attacks, it is worth noting the attacks that use spoofing, they are often carried out either using incorrectly configured equipment hosting providers and Data Centers or special hosting services that provide this service on the black market. It is more difficult to deal with them, they can be very sophisticated.
What are the ways to deal with such attacks? Possible solutions
The typical scenario of dealing with these attacks on the exhaustion of the channel is rather sad. Usually, they are faced not by the owner of the site, VPS or server, but by the system and network administrators of the Data Center or Hosting companies.
If the company does not have the means to ensure transparent monitoring of traffic, then the first steps will be akin to a panic, when everything is lying and it is not clear what happened.
Usually, this is followed by an attempt to capture traffic patterns from routers, servers, switches using usually tcpdump or embedded solutions from a specific vendor.
Almost always, by dumping, it is obvious that an attack is taking place on a specific IP address on the network and it is often possible to identify a certain pattern (for example, that an attack is being sent by UDP packets from port 53 - a bright DNS amplification marker).
After that, BGP Blackhole usually announces the announcement of the node, which is attacked in order to cut off parasitic traffic at the level of the higher-level operator. At the same time, if the company's network is large enough and has a large supply of capacity and modern equipment supporting the BGP Flow Spec, then there is an opportunity not to block the entire node, but to cut off the parasitic traffic and keep the service functioning.
One of the possible ways to protect against such attacks is to use "traffic filtering centers", but their use is associated with a number of problems, in particular - it still requires someone to decide what traffic and when to transfer to the filtering center.
FastNetMon has as its goal the full automation of all steps from detecting the fact of an attack, determining its type and deploying countermeasures. Usually, it takes him no more than 5 seconds without human intervention at all. Of course, we also support the option when the client uses traffic filtering centers for protection, FastNetMon can use to switch traffic to the filtering center in case of an attack.
How did the idea to write?
The idea was born at the time when I worked in the hosting industry, because the tasks described in the previous paragraph I had to solve dozens, maybe hundreds of times manually, each time determining the type of attack and manually repelling this very attack.
What were up to this implementation / alternative?
Being a lead engineer, my task was not so much to write a solution, as to find a solution that would fit the task and fit into the budget.
A lot of decisions were reviewed, but the main factor in these decisions was the price - it was completely unaffordable and exceeded the cost of the entire fleet of network equipment tenfold, which made their implementation completely unjustified.
How to solve the issues of protection against DDoS?
Manually, on the phone call the administrator on duty at night :)
The principle of the system
The key principle on which FNM is based is the concept of traffic threshold. The threshold is the amount of traffic coming or coming to a node in our network (in megabits, streams-a-second or packets-in-a-second), after which traffic is considered abnormal and poses a threat to the network. In each case, these are different values, often - also different values ​​for different nodes in the same network.
After this threshold is reached, either unconditional blocking of the node is performed, or all traffic of the given node is captured and analyzed in order to determine the target of the attack and determine the parameters of spurious packets.
Internal organization
Inside FastNetMon is a pipeline, to the input of which traffic is accepted in almost any format.
Now we support:
- sFlow v4
- sFlow v5
- Netflow v5
- Netflow v9
- IPFIX
- Span
- Mirror
- PF_RING
- Netmap
- SnabbSwitch
After that, from vendor-specific formats, the traffic is converted into an internal universal representation.
After that, for each node in the network, a set of counters is created with specification by protocol (TCP, UDP, ICMP), with specification by flags (for example, TCP SYN) or by IP options (presence of fragmentation) and a separate tracking subprocess fixes as soon as the speed a specific traffic counter per time unit exceeded the user-defined threshold.
But after this, a little magic begins, in which statistical DPI methods are involved, in order to determine the type of attack and select the most appropriate countermeasures.
And, finally, either the script is called or the BGP announcement is generated in order to block either all traffic or only parasitic - by the BGP Flow Spec.
External API
For the most part, it is not in the usual understanding of the API.
FastNetMon can export information to Graphite, InfluxDB to visualize traffic.
To receive information, we use fairly well-standardized protocols, such as sFlow, IPFIX and NetFlow, and if the vendor implements them correctly, we automatically guarantee support from our side.
Plugin system
She was born by herself when it came to the understanding that the world is very complex and cannot be dispensed with one protocol (at that time it was capturing traffic from mirror / mirror interface), then we began to add new protocols — sFlow, Netflow, and when it came to adding the fourth, we conducted a serious refactoring and solemnly isolated each traffic capture module into a separate library with a fixed external API. Now anyone can quite easily develop a plugin that implements its own, special way of capturing traffic telemetry.
Documentation
This is a sore point, really. I'm pretty sure that for many open source projects, too. Usually, there is no time for documentation, but we try to carefully consider all requests for GitHub and mailing to give the most detailed answers that we refer to in the future. Unfortunately, there is no complete documentation describing each stage in the project’s work.
Testing
Over the years of the project’s life, we have accumulated a very large number of pcap dumps for almost hundreds of different device models. We use them in our internal testing system when making changes to the parsers.
Unfortunately, these dumps almost always contain confidential information of clients and their disclosure in the form of open source is impossible, we are very careful about the storage of this information and access to it is very carefully controlled.
In addition, for critical and complex protocols (BGP Flow Spec, for example), we have unit tests.
Adaptation for work on different platforms
Now we support almost any Linux distribution, we have an official FreeBSD port, we are able to add it to the official Debian package base. Some time ago we added support for MacOS simply because I wanted to play on my laptop :)
We write code in the most portable format using the available APIs for the platforms and, for example, literally 4 functions were required to port to FreeBSD (other names of constants were used).
The main problem is a very wide range of supported distributions and the frequent absence of the required version of the library in the repository. Now it’s not very nicely solved - on each platform, dependencies are collected from the source code at the time of installation. We don’t like it, but, alas, building binary projects for almost 20 supported distributions is a very difficult task for us. We tried, but rather quickly surrendered - it turns out an extremely complex system.
What technologies is built on (languages, frameworks, modules) and why were they chosen?
The main language of the C ++ project. We are very actively using STL and Boost, where STL does not provide solutions. We do not have a lot of external dependencies in the code due to the specifics of the project and the small number of existing developments in open form, only the most necessary or connectors to the databases.
But we actively use external projects - such as ExaBGP, InfluxDB, Graphite, Grafana, GoBGP to provide visualization of traffic or interaction with the outside world.
One of the main criteria why we choose one or another project for integration is the presence of an API and friendliness to developers.
For example, such projects of working with BGP as Quagga, Bird provide very poor integration opportunities, therefore, despite their popularity, they did not suit us.
What can you say about fault tolerance, scaling and system performance? What makes these questions solved?
Basically, issues of high availability are solved at the architecture level, since the BGP protocol is inherently very well-reserved and we don’t have to make any effort on this side. Normally, FastNetMon announces nodes under attack on at least two independent routers on the network.
To ensure FastNetMon's fault tolerance, you can usually just duplicate network traffic to the second instance (usually, routers and switches support this, besides this - you can always use the samplicator) and thus ensure fault tolerance - if one instance is lost, the second will execute the task and block the traffic.
Regarding the scaling of the load, we have a proven experience of deploying on networks with almost 1.4Tb of traffic, these figures were achieved on the basis of a NetFlow v9 collector, while there were still many opportunities to increase throughput.
If this is not enough - you can always divide the traffic on any basis and install additional FastNetMon instances.
Why open source?
The project was open from the very first steps, we don’t have a moment when hundreds of thousands of lines of code were loaded by a single commit - you can follow the evolution from the very beginning!
But in general, there was no question of creation, there was a goal to create a project that would solve the problem not only in the specific case of a particular provider, but would solve it in a generalized form, without going into details.
Therefore, there was only one way - open source! Otherwise, a narrowly specialized solution would be solved that solves one very small problem in a particular sphere.
What benefits do you see in publishing an open source project? Of course - this is a huge impact. When you see that your decision is used in 103 countries including Cuba - it inspires :)
What is the difference between open-source projects and closed commercial developments?
There are “closed” open-source projects and there are very “open” commercial projects with closed code.
Here it is important not so much the openness of the code as the philosophy of the project - so that it is also open, to changes, improvements.
For many, open source is a guarantee of confidence in the future that the new management of the manufacturing company will not change the licensing policy, that the company will not go bankrupt, and if support does not help, you can always figure it out yourself or find a specialist who will improve it.
Given the large number of spyware scandals - open source looks even more attractive. Always security assurances can be confirmed by reference to the code and the possibility of independent verification.
The community around the project, how the project lives and develops. How much contribution from the side? Requests for new features and bugs.
The main contribution to the project goes in several directions:
- Integration with various vendors (A-10 Network, Radware,
- Mikrotik).
- Requests for new functionality from existing users with
- Detailed descriptions of their use cases are very important.
- information for project development
- Testing on all possible variety of network devices
- and software to them
- Integration into various distributions (AltLinux, FreeBSD, Debian)
- There is very little direct contribution to the core of the project (the attack detection and analytics modules themselves), unfortunately, and most of the development in this direction goes on its own.
Plans for the further development of the project
The main goal is to accelerate the development of the project and attract more developers to the development of the system core itself.
Now the attacking side is moving very quickly and quite difficult to keep up with the development of measures to counter the new threats, but we are trying.
As part of the implementation of this plan - a few months ago we launched the commercial version of FastNetMon Advanced, which implements a number of critical advantages for large companies and large networks of TIER-2 class and above. They mainly concern simplified deployment, operation and more flexible management in large networks. The main core of the project is used in the same product in both products.
How to develop the right architecture, which would allow to expand the project in the future, without changing drastically and without rewriting most of the code? :)
When you see a lot of requests for new functionality from customers, it’s just literally itching to grab and implement! At the first stages of the life of the project - it is worth doing that way! When it is not clear, someone needs a project and searches for its “niche”.
But then - it is worth stopping from time to time and ask yourself the question “how can this be done universally?”, “And who else will need it?”. It often helps to freeze a request for a few weeks to think about the required changes in the architecture and to make sure that the idea is correct - and only then proceed to its development.
In addition, it is worth stopping from time to time and just carefully review the code of a particular subsystem in search of a place that can be unified or improved.
Also, if there is a feeling that some subsystem has exceeded all possible limits of complexity (in our case it was with the BGP Flow Spec), then you should think about the detailed coverage of its unit tests, since careful reading is not enough.
How important is it to think over the architecture initially, and how to follow it in the future, developing the project, but without taking it to the side of monstrosity and cachylation? :)
We did not have architecture initially - only vague goals, what we want to achieve. At the first steps it was not even clear that the project would work and be able to cope with a load of tens of millions of packages — a second! It was a lot and at the beginning we couldn’t work even 120,000 packets per second!
Therefore, as I said earlier, it is worth stopping from time to time and thinking about the possibilities of dividing the system into modules.
How important is documentation and testing.
As many like to say, for an open source project, the best documentation is the code. I fundamentally disagree, because in the entire history of the project, only a few have carefully understood the code and answered their questions in this way.
But always the lack of documentation can be replaced by a responsive community and a quick developer reaction. We can boast of both! Often questions are solved long before I get to a given ticket - some community member decided to answer and helped solve the problem without the participation of the developers at all :)
Regarding testing - in my opinion, you need to keep a balance. Covering tests with obvious things is completely pointless, it is a waste of time that you can spend profitably. But if the complexity of a system is very high or the danger of failure is high when interacting with external systems, tests are absolutely necessary.
Any social aspects of open-source development: wrong PR, stupid questions and mistakes
We have a cool social aspect that we often ask for the help of our users so that a particular vendor can solve their problems.
The model of the work of many network equipment vendors on fixing bugs is such that an external company without a contract / equipment in use simply does not have the opportunity to get a bug to fix it.
As a result, we get dozens of bugs "FastNetMon does not work with the device XXX vendor YYY", at first we apologized that the vendor had a problem and could not do anything.
Now - we help to investigate and isolate the problem in as much detail as possible and after that we ask the client on his own behalf to create a request for fixing the bug - and it helps! Very many people go forward and thus solve the problem of very many users of the same vendor!