Foreword
Some of you may have recently taken Stanford courses, in particular
ai-class and
ml-class . However, it is one thing to watch several video lectures, answer quiz questions and write a dozen programs in
Matlab
/
Octave
, and another thing is to start using this knowledge in practice. So that the knowledge gained from Andrew Ng did not fall into the same dark corner of my brain, where
dft ,
Special Relativity and
Euler Lagrange Equation got lost, I decided not to repeat the mistakes of the institute and, while the knowledge is still fresh in my memory, practice as much as possible.
And just then DDoS arrived at our site. Kicking off from which it was possible admin-programmerskimi (
grep
/
awk
/ etc) ways or resort to the use of machine learning technologies.
Next comes the story of creating a neural network in Python 2.7 /
PyBrain and using it to protect against DDoS.
')
Introduction
What we have: a server under DDoS'om mediocre botnet (20,000-50,000) and access.log'and from it. Having access.log before the start of DDoS is very useful, because It describes almost 100% of legitimate clients and, therefore, an excellent dataset for training a neural network.
What we want to receive: a classifier that, according to the access.log entry, would tell us with some probability from whom the request came: from a bot or from a person. Thanks to this classifier, we can form an attack vector (a set of subnets) and send them to the firewall / hoster.
Training
To train the classifier, we need two sets of data: for "bad" and "good" queries. Yes, it's pretty funny that in order to find bots, you must first find bots. Here we just need grep, for example, in order to pull out from the log all requests from IP's with a large number of 503 errors (rate limiting nginx).
As a result, we should have 3 access.log:
- Dataset with "good" requests from access.log before DDoS starts
- Dataset with "bad" queries that were loaded at the previous stage.
- Dataset, which we need to classify, dividing requests into good and bad. Usually this is
tail -f
access.log from servir under DDoS
Next, you need to learn to parse a so-called. combined log nginx'a. Surely already somewhere on the Internet there is a ready-made regular, but I invented my bicycle:
(?P<ip>[0-9.:af]+) [^ ]+ [^ ]+ \[.*\] "(?P<url>.*)" (?P<code>[0-9]+) (?P<size>[0-9]+) "(?P<refer>.*)" "(?P<useragent>.*)"$
/ * Due to the fact that I wrote
yet another log parser
, I once again remembered the
idea of ​​the initial serialization of the log in json , because the logs on production are mostly not for people, but for programs. * /
After we have learned to parse the log, it is necessary to select features (features / markers / attributes), compile their vocabulary and then compile a feature-vector for each dataset entry. About this next.
Machine learning in general
We stopped on the selection of features from the logs, for this we take the "good" and "bad" logs and try to find all sorts of features aka signs in them. In the combined log there are not so many of them, I took the following:
- The request itself, parsed to the request type (
HEAD
/ GET
/ POST
/ etc), url
and http_version
. Url
parsed to the protocol, host name, path and all keys from query_string
Referer
, parse similar to url
in request.User-Agent
, parsed with magic street magic, because its format varies greatly from browser to browser. RFC2616 says very little about it. Surely much better.status
, only in the case of codes 503/404/403. In general, during the DDoS server, the server likes to answer 500/502, so we will take into account only the codes described above.
/ * By the way, the first version of the script that I used to protect against DDoS contained a malicious bug, due to which about half of the signs were not taken into account. However, even in such a state, the code managed to work tolerably well, thanks to which the bug itself was hidden for a long time. In this regard, Andrew turned out to be very right, saying that not the smartest algorithm wins, but the one that is fed more data. * /
So, we have come to the moment when we have on hand a huge list of all sorts of features that may be present in the request. This is our dictionary. The dictionary is needed in order to create a
feature-vector
from any possible request. Binary (consisting of zeros and ones) M-dimensional vector (where M is the length of the dictionary), which reflects the presence of each attribute from the dictionary in the query.
It is very good if, from the point of view of the data structure, the dictionary will be a hash-table, because there will be a lot of calls like
if word in dictionary
.
Read more about this in
Lecture Notes 11: Machine Learning System Design.An example of building a dictionary and feature-vector
Suppose we teach our neural network with just two examples: one good and one bad. Then we will try to activate it on the test record.
Record from the "bad" log:
0.0.0.0 - - [20/Dec/2011:20:00:08 +0400] "POST /forum/index.php HTTP/1.1" 503 107 "http://www.mozilla-europe.org/" "-"
Record from the "good" log:
0.0.0.0 - - [20/Dec/2011:15:00:03 +0400] "GET /forum/rss.php?topic=347425 HTTP/1.0" 200 1685 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0"
The resulting dictionary:
['__UA___OS_U', '__UA_EMPTY', '__REQ___METHOD_POST', '__REQ___HTTP_VER_HTTP/1.0', '__REQ___URL___NETLOC_', '__REQ___URL___PATH_/forum/rss.php', '__REQ___URL___PATH_/forum/index.php', '__REQ___URL___SCHEME_', '__REQ___HTTP_VER_HTTP/1.1', '__UA___VER_Firefox/3.0', '__REFER___NETLOC_www.mozilla-europe.org', '__UA___OS_Windows', '__UA___BASE_Mozilla/5.0', '__CODE_503', '__UA___OS_pl', '__REFER___PATH_/', '__REFER___SCHEME_http', '__NO_REFER__', '__REQ___METHOD_GET', '__UA___OS_Windows NT 5.1', '__UA___OS_rv:1.9', '__REQ___URL___QS_topic', '__UA___VER_Gecko/2008052906']
Test record:
0.0.0.0 - - [20/Dec/2011:20:00:01 +0400] "GET /forum/viewtopic.php?t=425550 HTTP/1.1" 502 107 "-" "BTWebClient/3000(25824)"
Her
feature-vector
:
[False, False, False, False, True, False, False, True, True, False, False, False, False, False, False, False, False, True, True, False, False, False, False]
Notice how
sparse
feature-vector
is - this behavior will be observed for all requests.
Dataset partitioning
It is good practice to split the
dataset
into several parts. I beat in two parts in the proportion of 70/30:
Training set
. On it we train our neural network.Test set
. With them we check how well our neural network is trained.
Such a partition is due to the fact that the neural network with the smallest
training error
(error on the
training set
) will produce
a larger error on the new data, because we “retrained” the network, sharpening it under the training set.
In the future, if you have to bother with the choice of optimal constants, the
dataset
will have to be broken down into 3 parts in a 60/20/20 ratio:
Training set
,
Test set
and
Cross validation
. The latter will serve to select the optimal parameters of the neural network (for example,
weightdecay
).
Neural network in particular
Now, when we don’t have any text logs anymore, and there are only matrices from
feature-vector
's, we can start building the neural network itself.
Let's start with the choice of structure. I selected a network from one hidden layer the size of a double input layer. Why? It's simple:
so bequeathed to Andrew Ng in case you do not know where to start. I think in the future you can play around with it by drawing training graphics.
The activation function for the hidden layer is the long-suffering sigmoid, and for the output layer is
Softmax . The latter is chosen in case you have to do
multi-class classification with mutually exclusive classes. For example, send "good" requests to the backend, "bad" requests to ban on the firewall, and "gray" requests to solve captcha.
The neural network is prone to care in the local minimum, so in my code I build several networks and choose the one with the smallest
Test error
(Note that it is the
test set
error, not the
trainig set
).
Disclaimer
I am not a real welder. About Machine Learning, I only know what I learned from ml-class and ai-class. I started programming on python relatively recently, and the code below was written in 30 minutes (the time, as you understand, was running out) and was only slightly filed later on.
Code
The script itself is small - 2-3 A4 pages depending on the font. You can view it here:
github.com/SaveTheRbtz/junk/tree/master/neural_networks_vs_ddosAlso, this code is not self-sufficient. He still needs scripting. For example, if the IP made N bad requests for X minutes, then ban it on the firewall.
Performance
- lfu_cache. Ported with ActiveState in order to greatly speed up the processing of requests "high frequency". Down-side - increased memory consumption.
- PyBrain is suddenly written in python and therefore is not very fast, however, it can use the
arac
ATLAS-based
module if you specify Fast=True
when creating the network. You can read more about this in the PyBrain documentation . - Parallelization. I trained my neural network on a fairly “thick” server Nehalem'e, however, even there one could feel the disadvantage of single-threaded training. You can think about the parallelization of learning neural network. A simple solution is to train several neural networks at once in parallel and choose the best of them, but this will create additional load on the memory, which is also not very good. I would like a more versatile solution. Perhaps it makes sense to simply rewrite everything in C, since the whole theoretical base in ml-class has been chewed.
- Memory consumption and qty features. A good memory optimization was the transition from standard Python arrays to numpy'nie. Also, reducing the size of the dictionary and / or using the PCA can help very well, more on that below.
For the future
- Additional fields in the log. In the
combined
log you can add a lot more, you should think about what fields will help in the identification of bots. It may make sense to consider the first octet of the IP address, because in a non-international web project, Chinese users are most likely bots. - Principal Component Analysis . It will greatly reduce the dimension of
feature-vector
's, thereby reducing the training time of the neural network. - Preprocessing / Normalizing for features. For example, anything that falls under regexp
[a-fA-F0-9]{32}
can be replaced with the word __MD5__
, everything that looks like a date with the word __DATE__
, etc., thereby reducing the number of low-frequency features and respectively the size of the dictionary. - Tuning constants and strukrura neural network. The current ones were taken from the ceiling, namely from the manual for creating a classifier based on PyBrain (yes, remember, I wrote that the code was written well, soooo fast). It would make sense to draw graphics of neural network training with different values ​​of parameters, the benefit of
matplotlib
at hand. - Online training. The strategy of smart attackers often changes. It would be nice to think of ways to re-train the neural network on the go. However, this is easier said than done.
More to read
Machine Learning Lecture Notes - ml-class course records.
UFLDL - Unsupervised Feature Learning and Deep Learning. Also from Professor Andrew Ng.
Instead of conclusion
It is very interesting how welders, such as
highloadlab , are suitable for solving the classification problem. However, something tells me that their algorithms are the secret of a series of Yandex ranking formulas.