From the first day we close the possibility of installing new machines. We have already stopped accepting new customers.
Existing virtual machines of existing clients will be served further without changes. Also, please do not make a "car in reserve" - ​​we stopped accepting new customers not from good circumstances.
The reason is that we have crossed the boundaries of the calculated capacities, and rewriting the architecture “on the go” is a terrible practice. In this regard, it was decided to take a timeout and stop chasing the advertising department (by the way, for this reason we fell silent on Habré - we hoped to slightly reduce the flow of visitors). However, people came - and it was ridiculous, in one of the long and carefully written out components we were laid on the ceiling of approximately 10k connections. Testing / correction (preproduction process) was delayed for a month ... And by the time we rolled out this component, it turned out that it was already “in the butt” (6-9k connections per second). But we wrote it for several months!
')
And it became obvious that we simply can not cope. The decision to stop accepting new customers was not very easy (well, you know, disputes in the style of “why should you pay salaries?”, Etc.), but common technical sense won out a healthy
greedy desire for the company's success.
How much will recycling take? The planned period - about 2-3 months, how much is really needed - I do not know. Firstly, because it is necessary to seriously rework the architecture, the centralized databases will be completely removed; the decentralization of everything and everything is an extremely nontrivial task.
With a high probability, we will not be able to change the existing configuration, so a second copy of the cloud will be launched. How the migration of clients from the first to the second will look like - again, I don’t know (I haven’t even thought so far).
read error
Now about accidents. Yes, we are so lucky that there were three incoherent accidents in a row. One on Sunday, the second on Tuesday, the third on Friday. Who is guilty? Well, it depends on who asks, but in fact - we. All failures were software related (not ours); we cannot even nod in the direction of crooked electricians, cleaners, and other scapegoats.
For those who are interested in how it looks (sorry for the quality, it was not up to high-quality shooting):
Accident 1 - 150 clients:
Uptime at the time of the accident - 4 months 24 days. From the moment of putting into operation the first failure.
Accident 2 - 391 clients:
Uptime - 6 months 4 days (since the previous accident. Then, because of a bug in the NFS server, I had to force all the virtual machines to reboot and ask people to remove the mention of NFS from / etc / fstab).
Accident 3 - 398 customers.
The same repository; Uptime at the time of the accident - 2 days 4 hours.
Eliminating such bottlenecks is the second task that we will address during the timeout taken.
In the model of client data storage that we assumed, we did not count on a complete and unconditional cessation of the operation of the system core. We have provided for a controlled reboot, a crash of specific services, the death of disks in a multiple redundant raid (and even the death of a SAS controller). But there is no such “good”.
That was our mistake. And I am responsible for this error, since I relied on the fact that at least we can learn about stopping the service. In the course of work on the cloud, this will be one of the main tasks on which I will work.
What's the problem?
When an accident happens, customers begin to make a lot of gestures. Restart the machines, try to turn them on and off repeatedly.
Visually, nothing happens, in fact, inside, the system remembers everything. As a result, the task queue for some machines reaches 50-100 tasks. And if we learned to combine the same tasks (if the client asked for a reboot three times, then you only need to reload once), then the variety of tasks is still performed as they said. Yes, if you said to restart the machine, turn it off, turn it on, restart it, turn it off and on, then that will be done.
And when there are several hundreds of such clients ... It turns out to be unpleasant. Especially when all requests come almost simultaneously. The master pool in a banal way lacked resources. That is, 800% of the processor load and a queue for several hundred tasks.
But we’re just not ready to divide the pool masters into several. For now. This is one of the tasks with which we will think.
upd: the article was published without my participation, the pictures will be tomorrow .