DoS on your own: What causes the uncontrolled growth of tables in the database

Billing is a complex system, one of the most important components of which is a database. It contains a variety of different tables, which eventually grow to large sizes.

In order for this growth not to slow down the operation of the database, there is an effective partitioning mechanism in Oracle, PostgreSQL and other DBMS - however, it is not always possible to use it. For example, it is absent in the relatively budget edition of the Oracle Standard Edition system.
')
Historically, in our billing for Hydra telecom operators, we did not implement our own partitioning mechanism, limiting ourselves to creating a customer guide, which described steps to track the growth of the tables and minimize possible problems with the speed of the system. As it turned out later, this was clearly not enough.

Prehistory

The Hydra billing is used by telecom operators, which means that one of the most important and large tables in the database is the one that is responsible for storing data on PPP sessions and CDR of subscribers of Internet or telephony providers. This table stores data on the beginning, end, and time of its last update, as well as information about the subscriber who initiated it, for example, the IP address from which he established the connection.

The more subscribers in the carrier's database, the faster the table with sessions data grows. The simplest solution in this situation is to simply delete old records from time to time. However, under Russian law, providers are obliged to keep data on subscribers' access to the network, which law enforcement agencies may need for three years.

In the development of billing many years ago, we took into account this point and assumed that the volume of the table with sessions can reach 15 million lines. Upon reaching this bar, the upload and archive data script should have been launched.

Nevertheless, the need to monitor the size of the table persisted - if you did not pay attention to it at all, it could grow to much larger sizes. What ultimately happened to one of our clients.

Everything is bad

For various reasons, the specialists of this provider did not set up monitoring the size of the table with sessions and did not connect our monitoring service (especially for lazy clients, we created a special tool that is able to collect system performance metrics and notify about possible problems).

The table grew, and neither the company in whose base everything happened, nor our support service knew about it. In the end, its size was 36 million lines, which more than doubled the maximum limit.

At this particular operator, subscribers got into the network via VPN, which means they had to get access to the VPN server, which had already accessed the database using the RADIUS protocol.

As a result, on the day when there was a surge in subscriber activity on the operator’s network due to the rebooting of one of the trunk switches, the billing RADIUS server was unable to authorize the flow of subscribers in a timely manner.
What happened is that the performance of the samples from the session table at a certain point has seriously fallen. This led to the fact that the authorization process of subscribers did not have time to complete in the allotted time for this. As a result, users began to receive network access denials and re-send authorization requests, which merged into a real avalanche that has swept the system.

With the request processing timeout, the VPN server sends a couple more authorization requests — and to process them again, you need to read the data from that huge table. And only after that the subscriber will see a message about the denial of access. Naturally, having received such a message, the user tries to establish the connection again, and everything repeats again. Thus, a situation arose with a real RADIUS server DoS.

What to do

Faced with serious problems of authorizing their own subscribers, the provider turned to our support team. It was possible to establish the cause of the problems very quickly - the table with the sessions was too large, besides, the very requests for reading from the session table were in the top of the IOWAIT queries. The only way to correct the situation was to clear the rows of the table in the database. At the same time, the data could not be lost, so they had to be copied somewhere beforehand.

Therefore, we uploaded entries to CSV files, thereby reducing the size of the table. In order to put some subscribers on the network during these activities, the offline mode of the RADIUS server was also activated. It consists in using a caching mechanism - data about the result of the last authorization of the subscriber is stored in the cache, so that if the connection with the billing is lost, the server can authorize users using this data.

The autonomous mode mechanism at that time was quite simple and had its drawbacks - for example, the data in the cache was not in constant replication with the database, so there could be discrepancies in them. This means that if the user last logged in a month ago and then did not pay for the Internet and had to be blocked, the system authorizes him to work offline when he is offline, or does not authorize who has the right to access but was blocked at the last connection attempt. But in the current situation, it was the lesser of evils.

As a result, when the table with sessions was reduced to 30 million lines, subscriber authorization started working again, and we were able to disable the offline mode of the RADIUS server.

Lessons

This story gave us a lot of food for thought. It turned out that it is not enough just to give recommendations to clients or to give a paid opportunity to monitor important points - if in the end the execution of actions is given to the client, then one should expect that he will not implement them. Therefore, we have implemented new monitoring schemes that allow us to track the occurrence of problems in client systems.

Now, when a potentially dangerous situation like the one described above is detected, an application is created to the Latera support service, with which we will deal in detail.

In addition, we also changed the approach to organizing work with data and the process of caching a RADIUS server. All this resulted in the creation of the archiving mechanism in the new version of billing - it provides for archiving the data of constantly growing tables into a separate Oracle scheme. This process runs in the background and does not affect the work of billing. Moreover, for the client, this mechanism works in such a way that when creating reports on sessions, it simultaneously receives data from the main tables and tables from the archive schema, without “tambourine dances”.

We also made a fully autonomous RADIUS server with a local database, which acts not as a cache, but locally stores the information replicated from the main billing base for authorizing subscribers. Thus, the problem of the possible provision of communication services to those subscribers who should be blocked and the denial of access for those who need to be authorized was solved. In the new version of the RADIUS server, mechanisms are implemented for the most painless way out of the situation when the connection with billing is broken. We described them in the article about the resiliency of billing.

If we describe the scheme briefly, then each access server consists of several components:

Database with subscriber profiles and data on consumed services.
Our application codenamed HARD. It responds to HTTP requests that come from the next component.
FreeRADIUS - directly the server implementing the standard AAA-protocol - RADIUS. It directly communicates with the subscriber network and translates requests from the binary format into plain HTTP + JSON for HARD.

The databases of all AAA servers (this is MongoDB) are combined into a group with one main node (master) and two subordinate (slave). All requests from the subscriber network go to one AAA-server, while it is not necessary and even undesirable that they be a server from the main database.

If something goes wrong, and one of the components refuses, then the subscribers will not lose access to the services. Most likely, they will not notice anything at all.

In addition, we continue to monitor the size of the tables, including tracking the volume of the session table - no one is immune from the growth of data in the non-archival period.

That's all for today, thank you for your attention! Do not forget to subscribe to our blog .

Source: https://habr.com/ru/post/277331/

All Articles

DoS on your own: What causes the uncontrolled growth of tables in the database

Prehistory

Everything is bad

What to do

Lessons

More articles: