Get out of the Tarantool networks. Sync node when filtering traffic

Variti company specializes in protection against bots and DDoS attacks, and also conducts stress and stress testing. Since we operate as an international service, it is extremely important for us to ensure the smooth exchange of information between servers and clusters in real time. At the Saint HighLoad ++ 2019 conference, Variti developer Anton Barabanov told how we use UDP and Tarantool, why we took such a bundle, and how we had to rewrite the Tarantool module from Lua to C.

The link can also read the report theses , and below the spoiler - watch the video.
')

Video report

When we started to do the traffic filtering service, we immediately decided not to engage in IP transit, but to protect HTTP, API, and game services. Thus, we terminate traffic at the L7 level in the TCP protocol and transfer it further. The protection on L3 & 4 happens automatically. The diagram below shows the service scheme: requests from people pass through the cluster, that is, servers and network equipment, and bots (shown as ghosts) are filtered.

For filtering, it is necessary to break traffic into separate requests, analyze sessions accurately and quickly, and since we do not block by IP addresses, we can identify bots and people within the connection from one IP address.

What happens inside the cluster

Inside the cluster, we have independent filtering nodes, that is, each node works by itself and only with its own piece of traffic. Traffic is distributed randomly between nodes: if, for example, 10 connections are received from one user, then all of them diverge among different servers.

We have very strict performance requirements, because our customers are in different countries. And if, for example, a user from Switzerland enters a French site, he already faces 15 milliseconds of network latency due to an increase in the traffic route. Therefore, we are not entitled to add another 15-20 milliseconds inside our processing center — the request will be already critical for a long time. In addition, if we process each HTTP request for 15–20 milliseconds, then a simple attack of 20,000 RPS will add up the entire cluster. This, of course, is unacceptable.

Another requirement for us was not just tracking the request, but also understanding the context. Suppose a user opens a web page and sends a request for a slash. After that, the page is loaded, and if it is HTTP / 1.1, then the browser opens 10 connections to the backend and in 10 threads requests statics and dynamics, makes ajax-requests and subqueries. If, in the process of returning a page, instead of proxying a subquery, you start interacting with the browser and try to give it, say, the JS Challenge to a subquery, you will most likely break the page. At the very first request, you can give CAPTCHA (although this is bad) or JS Challenge, do a redirect, and then any browser will handle everything correctly. After testing, it is necessary to spread information to all clusters that the session is legitimate. If there is no information exchange between the clusters, the other nodes will receive the session from the middle and will not know whether to skip it or not.

It is also important to respond promptly to all the load jumps and changes in traffic. If something jumps on one node, then in 50-100 milliseconds a jump will occur on all the other nodes. Therefore, it is better if the nodes know about the changes in advance and set the security parameters in advance so that the jump does not happen on all the other nodes.
An additional service to protection against bots, we have a post-markup service: we put a pixel on the site, write down the bot / person information and give this data on the API. These verdicts need to be saved somewhere. That is, if earlier we talked about synchronization within the cluster, now we are adding synchronization of information between the clusters. Below is a diagram of the service at the level of L7.

Between clusters

After we made the cluster, we began to scale. We work through BGP anycast, that is, our subnets are advertised from all clusters and traffic arrives at the nearest one. Simply put, from France the request is sent to the cluster in Frankfurt, and from St. Petersburg to the cluster in Moscow. Clusters must be independent. Network flows are admissiblely independent.

Why is it important? Suppose a person drives a car, works with a site from the mobile Internet and crosses a certain rubicon, after which the traffic suddenly switches to another cluster. Or another case: the traffic route was rebuilt, because somewhere a switch or router burned, something fell, the network segment was disconnected. In this case, we supply the browser (for example, in cookies) with sufficient information so that when switching to another cluster, it is possible to inform the necessary parameters about passed or not passed checks.
In addition, you need to synchronize protection mode between clusters. This is important in the case of low volume attacks, which are most often carried out under the cover of flooding. Since the attacks are parallel, people think that the site is breaking the flood and they do not see the low volume attack. For the case when low volume comes to one cluster and flood to another, and synchronization of protection mode is necessary.

And as already mentioned, we synchronize between the clusters the same verdicts that are accumulated and sent by the API. In this case, the verdict can be many and they need to synchronize securely. In protection mode, you can lose something inside the cluster, but not between the clusters.

It is worth noting that there is a large latency between the clusters: in the case of Moscow and Frankfurt, this is 20 milliseconds. It is impossible to make synchronous requests here, all interaction should go in asynchronous mode.

Below we show the interaction between the clusters. M, l, p are some technical parameters for exchange. U1, u2 - this is the markup of users on illegitimate and legitimate.

Internal interaction between nodes

Initially, when we were doing the service, filtering at the L7 level was launched on just one node. It worked well for two clients, but no more. When scaling, we wanted to achieve maximum efficiency and minimal latency.

It was important to minimize the CPU resources spent on processing packets, so that interaction via, for example, HTTP would not be appropriate. It was also necessary to ensure a minimum invoice consumption of not only computing resources, but also the packet rate. Nevertheless, we are talking about filtering attacks, and these are situations in which we obviously lack performance. Usually, when building a web project, x3 or x4 is enough for the load, but we always have x1, because a large-scale attack can always come.

Another requirement for the interaction interface is the availability of a place where we will write information and from where we can then consider what state we are in. It is no secret that C ++ is often used to develop filtering systems. But unfortunately, programs written in C ++ sometimes fall into the crust. Sometimes such programs need to be restarted to update, or, for example, because the configuration has not been re-read. And if we restart a node under attack, then we need somewhere to take the context in which this node existed. That is, the service should not be stateless, it must remember that there are a certain number of people whom we have blocked, which we check. There must be the same internal communication so that the service can receive the primary set of information. We had thoughts to put a certain database next to it, for example, SQLite, but we quickly discarded such a solution, because it is strange to write Input-Output on each server, it will work poorly in memory.

In fact, we work with only three operations. The first function is “send”, and to all nodes. This concerns, for example, messages on synchronization of the current load: each node must know the total load on the resource within the cluster in order to track peaks. The second operation is “save”, it concerns verification verdicts. And the third operation is the “send to all” and “save” combination. Here we are talking about state change messages that we send to all nodes and then save to be able to subtract. Below is the resulting interaction scheme, in which we will have to add parameters for saving.

Variants and result

What options to save the verdicts we looked? First, we thought about the classic, RabbitMQ, RedisMQ and our own TCP-based service. We discarded these solutions because they are slow. The same TCP adds x2 to the packet rate. In addition, if we send a message to all the others from one node, then we either need to have a lot of sending nodes, or this node will be able to poison 1/16 of those messages that 16 machines can send to it. It is clear that this is unacceptable.

As a result, we took UDP multicast, because in this case the sending equipment is network equipment, which is not limited in performance and allows you to completely solve problems with the speed of sending and receiving. It is clear that in the case of UDP, we do not think about text formats, but send binary data.

In addition, we immediately added packaging and database. We took Tarantool, because, firstly, all three founders of the company had experience with this database, and secondly, it is as flexible as possible, that is, it is also an application service. In addition, Tarantool has CAPI, and the ability to write on C is a crucial point for us because of the need for maximum performance to protect against DDoS. None of the interpreted languages can provide sufficient performance, unlike C.

In the diagram below, we added a database inside the cluster in which the states for internal communication are stored.

Add a database

In the database, we store the state in the form of a log of hits. When we thought out how to save information, there were two options. It was possible to store a state with constant updates and changes, but it is rather difficult to implement. Therefore, we used a different approach.

The fact is that the structure of data sent via UDP is unified: there is timing, some kind of code, three or four data fields. So we started writing this structure in space Tarantool and added a TTL entry there, which makes it clear that the structure is outdated and needs to be removed. Thus, in Tarantool, a log of messages is accumulated, which we clean up with a given timing. To delete old data, we initially took expirationd. Subsequently, we had to give it up, because it caused certain problems, which we will discuss below. So far, the scheme: on it, two databases have been added to our structure.

As we have already mentioned, in addition to storing the states of the cluster, it is also necessary to synchronize the verdicts. We synchronize verdicts across clusters. Accordingly, it was necessary to add an additional installation of Tarantool. It would be strange to use another solution, because Tarantool is already there and it is ideal for our service. In the new installation, we began to write verdicts and replicate them with other clusters. In this case, we use not master / slave, but master / master. Now in Tarantool only asynchronous master / master, which is not suitable for many cases, but for us this model is optimal. With minimal latency between clusters, synchronous replication would be in the way, asynchronous does not cause problems.

Problems

But we had a lot of problems. The first block of difficulties is connected with UDP : it’s no secret that the protocol can beat and lose packets. We solved these problems using the ostrich method, that is, we simply hid our heads in the sand. Nevertheless, damage to packages and rearrangement of them in places is impossible because communication takes place within one switch and there are no unstable connections and unstable network equipment.

There may be a packet loss problem if the machine is stuck, Input-Output is somewhere, or the node is overloaded. If such a hangup occurred for a short period of time, say, 50 milliseconds, then this is terrible, but is solved by increased sysctl queues. That is, we take sysctl, adjust the size of the queues and get a buffer in which everything lies until the node runs again. If a longer hang happens, the problem will be not loss of connectivity, but part of the traffic that goes to the node. We still have no such cases.

Problems with asynchronous replication of Tarantool were much more difficult. Initially, we did not take master / master, but a more traditional model for operating master / slave. And everything worked smoothly until the slave took over the master for a long time. As a result, expirationd worked and deleted data on the master, but it was not on the slave. Accordingly, when we switched several times from master to slave and back, so much data accumulated on the slave that at some point everything broke. So for full failover, I had to switch to asynchronous master / master replication.

And here again there were difficulties. First, the keys may overlap between different replicas. Suppose, within the cluster, we recorded data on one master, at that moment the connection was broken, we recorded everything on the second master, and after we made asynchronous replication, it turned out that the same primary key is in space and the replication falls apart.

We solved this problem simply: we took a model in which the primary key necessarily contains the name of the Tarantool node to which we write. Thanks to this, conflicts have ceased to arise, but it has become possible that the user data is duplicated. This is an extremely rare case, so we again simply neglected it. If duplication occurs frequently, then Tarantool has many different indexes, so you can always do deduplication.

Another problem concerns the preservation of verdicts and arises when the data recorded on one master has not yet appeared on the other, and the first master request has already arrived. We, honestly, have not decided this question yet and are simply delaying the verdict. If this is unacceptable, then we will organize some kind of push for data readiness. That's how we coped with the master / master replication and its problems.

There was a block of problems directly related to Tarantool , its drivers, and the expirationd module. Some time after the launch, attacks began to come to us every day, respectively, the number of messages that we save in the database for synchronization and storage of the context has become very large. And when cleaning, so much data was being removed that the garbage collector could not cope. We solved this problem by writing in C our expirationd module, which was called IExpire.

However, another difficulty remained with expirationd, which we have not yet coped with, which is that expirationd works only on one master. And if the node with expirationd falls, the cluster will lose critical functionality. Suppose we clean all data older than one hour - it is clear that if a node is, say, five hours, then the amount of data will be x5 to normal. And if a major attack comes at that moment, that is, two bad cases coincide, the cluster will fall. We do not know how to deal with this yet.

Finally, there remained difficulties with the Tarantool driver for C. When we broke down the service (for example, because of race condition), it took a long time to find the cause and debug. Therefore, we just wrote our Tarantool driver. It took us five days to implement the protocol together with testing, debugging and launching in production, but we already had our own code for working with the network.

Problems outside

Recall that we already have Tarantool replication ready, we already know how to synchronize verdicts, but there is no infrastructure yet to send messages about attacks or problems between clusters.
As for the infrastructure, we had a lot of different thoughts, including we thought about writing our TCP service. But still there is a Tarantool Queue module from the Tarantool team. In addition, we already had Tarantool with intercluster replication, “holes” were tied up, that is, it was not necessary to go to the admins and ask to open ports or drive traffic. Again, integration into the filtering software was ready.

There was a difficulty with the host node. Suppose there are n independent nodes inside the cluster and you need to choose the one that will interact with the write queue. Because otherwise 16 messages will be sent or 16 times the same message will be read from the queue. We solved this problem simply: we register the responsible node in space Tarantool, and if the node burns, then we simply change the space, if we don’t forget. But if we forget, this is a problem that we also want to solve in the future.

Below is a detailed diagram of a cluster with an interaction interface.

What you want to improve and add

First, we want to lay out in open source IExpire. It seems to us that this is a useful module, because it allows you to do everything the same as expirationd, but with almost zero overhead. There it is worth adding a sorting index to remove only the oldest tuple. So far we have not done this, because for us the main operation in Tarantool is a “record”, and an extra index will entail an extra burden because of its support. We also want to rewrite most of the methods on CAPI to avoid folding the database.

The question remains with the choice of the logical master, but it seems that this problem cannot be solved completely. That is, if the node with expirationd falls, it remains only to manually select another node and run expirationd on it. It is unlikely to do this automatically because replication is asynchronous. Although we will probably consult on this matter with the Tarantool team.

In the case of exponential growth of clusters, we also have to ask for help from the Tarantool team. The fact is that for Tarantool Queue and intercluster verdict saving, all-to-all replication is used. This works well, as long as there are three clusters, for example, but when there are 100 of them, the number of connections that need to be monitored will be incredibly large, and there will always be something to break. Secondly, not the fact that Tarantool will withstand such a load.

findings

The first findings concern UDP multicast and Tarantool. Multicast should not be afraid of it, its use inside the cluster is good, right and fast. There are a lot of cases when there is a constant synchronization of states, and after 50 milliseconds it doesn’t matter what happened earlier. And in this case, most likely, the loss of one state will not be a problem. So using UDP multicast is justified, since you do not limit performance and get the optimal packet rate.

The second point is Tarantool. If you have a service for go, php and so on, then most likely Tarantool is applicable as it is. But if you have a heavy load, you will need a file. But let's be honest, the file in this case is needed in general for everything: for Oracle, and for PostgeSQL.

Of course, there is an opinion that it is not necessary to reinvent the wheel, and if you have a small team, then you should take a ready-made solution: Redis for synchronization, standard go, python, and so on. It is not true. If you are sure that you need a new solution, if you have worked with open source, you have found out that nothing suits you, or you know in advance that you don’t even have a reason to try, then it’s useful to cut your solution. Another conversation that is important to stop in time. That is, do not write your Tarantool, do not implement your messaging, and if you just need a broker, already take Redis, and you will be happy.

Source: https://habr.com/ru/post/459264/

All Articles