📜 ⬆️ ⬇️

Neural network against DDoS'a

Foreword


Some of you may have recently taken Stanford courses, in particular ai-class and ml-class . However, it is one thing to watch several video lectures, answer quiz questions and write a dozen programs in Matlab / Octave , and another thing is to start using this knowledge in practice. So that the knowledge gained from Andrew Ng did not fall into the same dark corner of my brain, where dft , Special Relativity and Euler Lagrange Equation got lost, I decided not to repeat the mistakes of the institute and, while the knowledge is still fresh in my memory, practice as much as possible.

And just then DDoS arrived at our site. Kicking off from which it was possible admin-programmerskimi ( grep / awk / etc) ways or resort to the use of machine learning technologies.

Next comes the story of creating a neural network in Python 2.7 / PyBrain and using it to protect against DDoS.
')

Introduction


What we have: a server under DDoS'om mediocre botnet (20,000-50,000) and access.log'and from it. Having access.log before the start of DDoS is very useful, because It describes almost 100% of legitimate clients and, therefore, an excellent dataset for training a neural network.

What we want to receive: a classifier that, according to the access.log entry, would tell us with some probability from whom the request came: from a bot or from a person. Thanks to this classifier, we can form an attack vector (a set of subnets) and send them to the firewall / hoster.

Training


To train the classifier, we need two sets of data: for "bad" and "good" queries. Yes, it's pretty funny that in order to find bots, you must first find bots. Here we just need grep, for example, in order to pull out from the log all requests from IP's with a large number of 503 errors (rate limiting nginx).

As a result, we should have 3 access.log:

Next, you need to learn to parse a so-called. combined log nginx'a. Surely already somewhere on the Internet there is a ready-made regular, but I invented my bicycle:
 (?P<ip>[0-9.:af]+) [^ ]+ [^ ]+ \[.*\] "(?P<url>.*)" (?P<code>[0-9]+) (?P<size>[0-9]+) "(?P<refer>.*)" "(?P<useragent>.*)"$ 


/ * Due to the fact that I wrote yet another log parser , I once again remembered the idea of ​​the initial serialization of the log in json , because the logs on production are mostly not for people, but for programs. * /

After we have learned to parse the log, it is necessary to select features (features / markers / attributes), compile their vocabulary and then compile a feature-vector for each dataset entry. About this next.

Machine learning in general


We stopped on the selection of features from the logs, for this we take the "good" and "bad" logs and try to find all sorts of features aka signs in them. In the combined log there are not so many of them, I took the following:


/ * By the way, the first version of the script that I used to protect against DDoS contained a malicious bug, due to which about half of the signs were not taken into account. However, even in such a state, the code managed to work tolerably well, thanks to which the bug itself was hidden for a long time. In this regard, Andrew turned out to be very right, saying that not the smartest algorithm wins, but the one that is fed more data. * /

So, we have come to the moment when we have on hand a huge list of all sorts of features that may be present in the request. This is our dictionary. The dictionary is needed in order to create a feature-vector from any possible request. Binary (consisting of zeros and ones) M-dimensional vector (where M is the length of the dictionary), which reflects the presence of each attribute from the dictionary in the query.
It is very good if, from the point of view of the data structure, the dictionary will be a hash-table, because there will be a lot of calls like if word in dictionary .
Read more about this in Lecture Notes 11: Machine Learning System Design.

An example of building a dictionary and feature-vector


Suppose we teach our neural network with just two examples: one good and one bad. Then we will try to activate it on the test record.

Record from the "bad" log:
 0.0.0.0 - - [20/Dec/2011:20:00:08 +0400] "POST /forum/index.php HTTP/1.1" 503 107 "http://www.mozilla-europe.org/" "-" 


Record from the "good" log:
 0.0.0.0 - - [20/Dec/2011:15:00:03 +0400] "GET /forum/rss.php?topic=347425 HTTP/1.0" 200 1685 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0" 


The resulting dictionary:
 ['__UA___OS_U', '__UA_EMPTY', '__REQ___METHOD_POST', '__REQ___HTTP_VER_HTTP/1.0', '__REQ___URL___NETLOC_', '__REQ___URL___PATH_/forum/rss.php', '__REQ___URL___PATH_/forum/index.php', '__REQ___URL___SCHEME_', '__REQ___HTTP_VER_HTTP/1.1', '__UA___VER_Firefox/3.0', '__REFER___NETLOC_www.mozilla-europe.org', '__UA___OS_Windows', '__UA___BASE_Mozilla/5.0', '__CODE_503', '__UA___OS_pl', '__REFER___PATH_/', '__REFER___SCHEME_http', '__NO_REFER__', '__REQ___METHOD_GET', '__UA___OS_Windows NT 5.1', '__UA___OS_rv:1.9', '__REQ___URL___QS_topic', '__UA___VER_Gecko/2008052906'] 


Test record:
 0.0.0.0 - - [20/Dec/2011:20:00:01 +0400] "GET /forum/viewtopic.php?t=425550 HTTP/1.1" 502 107 "-" "BTWebClient/3000(25824)" 


Her feature-vector :
 [False, False, False, False, True, False, False, True, True, False, False, False, False, False, False, False, False, True, True, False, False, False, False] 


Notice how sparse feature-vector is - this behavior will be observed for all requests.

Dataset partitioning


It is good practice to split the dataset into several parts. I beat in two parts in the proportion of 70/30:


Such a partition is due to the fact that the neural network with the smallest training error (error on the training set ) will produce a larger error on the new data, because we “retrained” the network, sharpening it under the training set.
In the future, if you have to bother with the choice of optimal constants, the dataset will have to be broken down into 3 parts in a 60/20/20 ratio: Training set , Test set and Cross validation . The latter will serve to select the optimal parameters of the neural network (for example, weightdecay ).

Neural network in particular


Now, when we don’t have any text logs anymore, and there are only matrices from feature-vector 's, we can start building the neural network itself.

Let's start with the choice of structure. I selected a network from one hidden layer the size of a double input layer. Why? It's simple: so bequeathed to Andrew Ng in case you do not know where to start. I think in the future you can play around with it by drawing training graphics.
The activation function for the hidden layer is the long-suffering sigmoid, and for the output layer is Softmax . The latter is chosen in case you have to do
multi-class classification with mutually exclusive classes. For example, send "good" requests to the backend, "bad" requests to ban on the firewall, and "gray" requests to solve captcha.

The neural network is prone to care in the local minimum, so in my code I build several networks and choose the one with the smallest Test error (Note that it is the test set error, not the trainig set ).

Disclaimer


I am not a real welder. About Machine Learning, I only know what I learned from ml-class and ai-class. I started programming on python relatively recently, and the code below was written in 30 minutes (the time, as you understand, was running out) and was only slightly filed later on.

Code


The script itself is small - 2-3 A4 pages depending on the font. You can view it here: github.com/SaveTheRbtz/junk/tree/master/neural_networks_vs_ddos

Also, this code is not self-sufficient. He still needs scripting. For example, if the IP made N bad requests for X minutes, then ban it on the firewall.

Performance



For the future



More to read


Machine Learning Lecture Notes - ml-class course records.
UFLDL - Unsupervised Feature Learning and Deep Learning. Also from Professor Andrew Ng.

Instead of conclusion


It is very interesting how welders, such as highloadlab , are suitable for solving the classification problem. However, something tells me that their algorithms are the secret of a series of Yandex ranking formulas.

Source: https://habr.com/ru/post/136237/


All Articles