One afternoon, the site collapsed. Immediately after the reboot, he fell again. We knew that this was not DDOS, but organic traffic: we received typical requests, but the servers did not cope. The increase in iron power did not help. It became clear that it was time to optimize our system.
Young startups may be interested in how to cope with the increased workloads of the still weak server software.
Drimkas produces online cash registers. In 2016, a new version of the law on cash registers was adopted. One of the main innovations is that each cash desk should send sales data to the tax in real time. For us, this meant that now all cash registers had to be connected to the Internet in order to send checks to the FTS.
Since our cash registers still send sales data, it seems logical to collect this data in the cloud so that the owner always has remote access to them. So Cabinet Drimkas appeared.
Started to implement. The first idea is to create one logical server, to which all cash registers will directly contact - the developers have rejected.
The difficulty is that we have several models of cash registers on Linux in the market, one on Windows 10 and fiscal registrars without any operating system at all. There were other devices in the plans, but then nobody knew what they would work on. This meant that you need to support different versions of protocols and data formats - the development of new features is overly complicated.
We decided to create an intermediate node - “Hub”, which will support different versions, cash register models, communication transports and even encodings, if necessary. The site - “Drimkas Cabinet” - will receive normalized data and a generalized protocol for communicating with all devices.
To test the hypothesis - do users need such a service at all? - launched the first version. We received feedback - we were asked for reports, uploading data to Excel, working with products, an open API for integrators. The project turned out to be in demand, we released new features.
If the Cabinet for any reason could not take at least one check from the pack, Hub assumed that not a single task was accepted, and sent the entire pack again, until the Cabinet reported on the successful acceptance of the entire pack.
Checks are the hardest of all types of assignments. You need to go through all the positions, create new products in Postgres, and then put the entire check in MongoDB. Why the checks decided to be kept in Monga is a separate question, there are pros and cons. This is especially true now, when there is a lot of data and you need to do complex samples with aggregations.
So the system worked, while about two thousand cash desks were connected to the Cabinet, and the share of checks among the other tasks was no more than 15%.
It is interesting how the unstable Internet of our users reflected on us. When the Internet disappears in stores, the cashier continues to sell, collecting checks. As soon as the Internet appears, it sends all the data at once to both the CRF and the Hub. The load on our servers jumped. With the increase in the number of cash registers, a similar load chart became a real threat.
We began to notice jerking by day that intensified over time. Then the server just fell and could not rise, because it was immediately flooded with new tasks.
The Hub had no problems - it added all the data to the database without heavy logic and sent it further. The trouble was that a pack of tasks with a high percentage of checks can be processed longer than the timeout request for nginx. This resulted in a 500 timeout from the Cabinet, and the Hub immediately tried to feed the same data again, although the Cabinet was still processing the previous ones. As a result, the Hub just started the DOS Cabinet, until the latter falls.
If we increase the timeout by nginx or decrease the number of tasks sent at once, we will get a small delay and in a month we will return to the same problem, but the scale of the consequences will increase. There will be more connected cash registers, we will let more people down.
If you look at the average statistics - the power of iron was enough. There was little data available, and during the day, peak loads were followed by a decline. It was necessary to smooth the load so as not to fall during the peaks.
The solution is a queue of tasks, where you can add tasks at any intervals and jumps to process them as far as possible. Since the problem was on the side of the Cabinet, we decided to screw it inside it. RabbitMQ was chosen as a queue. All tasks from the Hub one by one got there and the workers processed them.
The idea turned out to be so successful that the Hub wanted to do the same at home, because we, too, sometimes could send a hundred thousand tasks at once. So we decided to use rabbitMQ as a transport between the Hub and the Cabinet.
If earlier when saving goods on the site of the Hub lay, then we threw out the error "Repeat later." Now the data will be saved in the database. The task is placed in the queue much faster, because we do not wait for confirmation from the Hub, and he himself takes it as far as possible. If suddenly the Cabinet has not finished processing the task, and the connection with the worker has been interrupted, this task will again appear in the queue. All this Rabbit does out of the box, no error handling on our side is required.
Having seen how cool it works, they decided to find out the upper limit of the strength of the transport and conducted load testing. Office network resisted, but still fell. And Rabbit just saved up all the tasks and waited for the workers to process them. Hence the conclusion - the capacity of the rabbit is enough for us, the main thing is to give it a little SSD and more RAM.
It can not be considered that the initial communication through task bundles was incorrect. In the early stages of development, the main resource is time. If we thought about too distant future and optimized everything in advance, then the release could not be reached.
Thus, the main thing is to understand in time that your service has outgrown the old solutions, and for stable operation and subsequent growth you need to slow down the development of new features and revise the architecture of the project.
Source: https://habr.com/ru/post/335964/
All Articles