
In our blog, we talk not only about the development of our product - 
billing for Hydra telecom operators , but also describe the difficulties and problems we encounter along the way. We have previously described a situation in which the uncontrolled growth of tables in the database of one company, the user of our system, 
led to a real DoS .
Today we will discuss another interesting case of sudden failure, which made the “April Fool's Day” on April 1 of this year, not at all ridiculous for the Latera support service. 
')
Everything is lost
One of the telecom operators and, concurrently, the customers of our company, used the Hydra billing for several years. Initially, everything was fine, but over time, problems began to arise - for example, various elements of the system could work more slowly.
However, on the morning of April 1, the quiet course of events was disturbed. Something went wrong - extremely excited customer representatives turned to technical support. It turned out that some of the subscribers who did not pay for access to the Internet received it (which is not the worst), and access to users who paid for everything on time was blocked - and this already caused just a storm of discontent.
Helpdesk was alarmed. It was necessary to quickly find and fix the problem.
Hot day
Having logged on to the billing server, first of all, technical support engineers found that the place in the main tablespace of the working database ended an hour and a half ago. As a result, several system billing processes have stopped, in particular, updating profiles in the module provisioning (responsible for managing subscriber access to services). Tech support exhaled and, anticipating a quick lunch, quickly increased the table space and launched the processes.
It did not help - the subscribers did not feel better. The client, meanwhile, began to panic: the flow of outraged calls "I paid, but nothing works," began to flood his call center. The trial continued.
On the RADIUS authorization server, which a few days before it had moved to a new “hardware”, the engineers discovered a severe slowdown. It quickly became clear that it led to the fact that the authorization server database contained irrelevant data on user profiles - this is why billing mistakenly opened access to the Internet for some subscribers and closed it for those who paid for services on time. However, the reason for such a dramatic drop in performance remained unclear.
To get rid of irrelevant data, it was decided to re-send the correct data to the RADIUS provisioning server, in effect replicate all the information again. This operation was originally conceived as a tool that is used in a desperate position in the destruction of the database.
After a few minutes, it became clear that re-sending the data did not just not help, but it significantly worsened the situation. According to our calculations, the delay from payment to the inclusion of access for subscribers has grown to several hours.
The only thing that could have been done additionally was to pre-clear the data in the cache before replication, but this would lead to denials of service for subscribers who have had everything well (and there were most of them), and we could not go for it.
As a result, it was decided to sort out the problem in steps, dividing it into steps and checking the elements of the system on each of them. This allowed us to reconstruct the course of events of that day, eventually understanding the cause and correcting the problem.
Lyrical digression: When the city falls asleep
So, on the night of April 1, billing began to generate new data for subscriber profiles to replace the old one - if the subscriber did not pay for the service, then he needed to take away access to it from him. In this step, the provisioning module generated a large number of new profiles and asynchronously sent them to the internal Oracle message queue. Since it is inconvenient to work directly with the Oracle queues from the outside with our technology stack, the interlayer is used in the form of an Apache ActiveMQ message broker, to which a RADIUS server is being written that writes data to MongoDB.
Billing must send data about the changed subscriber profiles in a strict order. In order to keep order and avoid “confusion” of profiles for disconnecting and connecting, a special process is implemented in the system, which selects data from the ActiveMQ queue and writes them to MongoDB.
Understanding comes
The system was deployed in such a way that ActiveMQ and MongoDB were on different servers, while in the ActiveMQ configuration used, there was no need to acknowledge the receipt of messages - everything sent from the queue was considered accepted by default. All that was needed was an incoming connection, then the data was sent until the limits of the receiving OS's buffers were exhausted.
Previously, sometimes there were cases of loss of the RADIUS server connection with ActiveMQ - in such a case, we provided for the functionality of reconnecting. For detecting cases of loss of communication, a special mechanism was implemented using the STOMP protocol - it has the setting for transmitting heartbeat-packets. If such a signal is not received, the connection is considered to be lost and forcibly re-established.
When we all realized this, the cause of the problem became immediately clear.
Messages from ActiveMQ are always retrieved in the same execution thread that writes data to subscriber profiles. Extraction and recording is what it does in order, and if recording is delayed, then the stream cannot “get” the next packet. And if it contained a heartbeat, it will not be received - and another thread that checks for the presence of such packets will consider the connection as lost and try to reinstall it.
When the connection is re-established, that socket and its buffer that were used are closed, and all the data that was lying there and not yet parsed by the application is lost. At the same time, ActiveMQ is confident in the success of sending, since confirmation of receipt is not required, and the receiving party regularly re-establishes the connection.
It turns out that some of the profile updates remain lost, but further system operation continues. As a result, the replicated database cache turns out to be partially invalid, although many updates did arrive.
It was possible to bring the problem out of the acute phase by increasing the heartbeat timeout to a larger number — this allowed the stream processing the data to wait for the recording and successfully process all messages. After that, the performance of the system was fully restored, and the correct information entered the database of the authorization server. In total, six hours passed from the moment the application was received from the client to the end of the proceedings.
The crisis was over, it was possible to exhale, but now it was necessary to understand what caused such incorrect work and to prevent the possibility of recurrence of such failures.
Search for reasons
As always happens at the moments of large-scale accidents, several factors superimposed on each other at once - each separately would not lead to such deplorable consequences, but the “snowball” effect worked.
One of the reasons was the desire of our client’s management to see all payments from all customers on the same day. That is why the system was set up in such a way that all write-offs and credits take place on the first day of the new month.
This process is arranged as follows: in the billing, an electronic document is maintained for each subscriber, which takes into account the fact of the service being provided. Every month such a document is created anew. A few hours before the end of the month, a balance check occurs - documents are analyzed, and based on this analysis, subscribers after midnight either re-gain access to the services or lose it.
Since such balance checks for this client occur simultaneously at the beginning of a new month, on the first day of each month the load on the system always increases. This is facilitated not only by the need to analyze all customers and disable defaulters - those of them who immediately decide to pay for services should quickly be “allowed” to the services again and correctly take this fact into account.
All this works thanks to two important system processes. One of them is responsible for the execution of commands to disable subscribers. When billing understands that the subscriber needs to be disconnected, it sends a command through the built-in provisioning module to interrupt its access session.
The second process is to prevent a subscriber from establishing a new session. To do this, in an independent database of the authorization server with user configurations, you need to replicate data from the Hydra provisioning module. This allows all parts of the system to be aware that it is not necessary to service certain subscribers now.
Important note: it is obvious that such a desire to see all the money from all subscribers on one day does not contribute to a uniform distribution of the load on the system. On the contrary, during the “Doomsday” for subscribers, it grows by several times - moreover, both for billing and payment acceptance services. In addition, this configuration contributes to an increase in the outflow of users (we told about it how to deal with it 
here and 
here ).
The second reason is the desire to save on infrastructure and the rejection of a reasonable diversity of services. For billing, a less powerful server was purchased than was necessary. He worked four years without problems, however the amount of data in the database grew, moreover, resource-intensive additional services were “hung up” on the server (external reporting system on a Java machine, new module provisioning also using a Java machine, etc. .), and the recommendations of the engineers, who spoke about the need to explode the services, were put aside in the back box with the argument “works the same”.
Two of the four prerequisites for future failure were already available. But even in this case, large-scale problems could have been avoided if it were not for the two tragic accidents that happened at the same time at the most inappropriate moment.
Shortly before April 1, finally managed to get a separate server for the authorization database with subscriber profiles. As it turned out later, the RAID-array on this server was defective, and on it the authorization at a certain level of load slowed down even more than on the old hardware. This error did not manifest itself on any of the more than 100 other installations of “Hydra” due to the fact that under normal conditions the authorization service is fast and there are enough resources with a large margin.
The immediate trigger for the accident should probably be considered as the ending place in the tablespace, which led to the fact that profile changes began to accumulate. As soon as the free space reappeared, all these changes were processed and a stream of replication messages rushed through the system queues to the defective RADIUS server, which could not be applied.
Certain suspicions about a possible bug arose even before the described situation - from time to time some individual subscribers had incorrect profiles in the database. However, it was not possible to calculate the problem before the first day of the new month - the event was very rare and extended logging did not help catch the error in time (including due to moving to a new server).
Preventing problems in the future
Eliminating the failure and sorting out its causes, already in a calm atmosphere, we made adjustments designed to prevent a repetition of the situation in the future. Here is what we did:
- First of all, the functionality of the requirement to confirm the delivery of sent data was added to ActiveMQ. At the same time, this confirmation works in the cumulative mode - the server acknowledges not every message has been received, but some of their packs (every five seconds). The message processing logic supports re-processing the queue starting from a certain point, even if some of the data has already been included in the database.
- In addition, the heartbeat sending period was increased — instead of five seconds to a few minutes. In addition to the heartbeat mechanism, the connection to the message broker was established with the keepalive option at short intervals to check the connection activity (several tens of seconds against a couple of hours, set by the operating system by default).
- Tests were also carried out, during which, when sending messages, various modules of the system were randomly restarted. In the course of one of these tests, some of the data was still lost. Then the MongoDB database engine itself was replaced - switched to using WiredTiger. We planned to do this earlier, but on the occasion of the tests we decided to combine the move.
The changes made it possible to reduce the number of lost packages to zero and prevent the occurrence of such problems in the future, even in very “aggressive” environments.
In addition, according to the recommendations of the Latera technical support engineers, the system services were distributed to different servers (the money was quickly found). We managed to do this before May 1 - the next day of mass billing operations.
Main lesson
Alarming “bells”, signaling possible problems, sounded in March, but it was not possible to identify them before the onset of a planned surge of activity. If this burst did not exist, or it would have happened in another place - not on a “braking” RADIUS server with a maximum sequential write speed of 5 MB / s, then with a high degree of probability the engineers would be able to localize the problem in April. They did not have just a few days.
All this leads to the conclusion that in the event that there are suspicions that there are not fully understandable problems with the operation of the systems, and a surge of activity is also expected, then you should also be prepared that at this very moment everything will follow the most unfavorable scenario. . In the presence of such suspicions, it is necessary to conduct a more intensive search for possible problems in order to have time to deal with them before a serious increase in load.
Other technical articles from Latera :