11:55:22 06/14/2011 I am writing:
Good day.
VPS 11111-1 does not work for us, already 4 days.
What is the problem? When will the server work?
11:56:55 06/14/2011 The support service (Alexandra M * shuk ***) writes:
Is the server already four days in the "request processing" state?
12:01:42 14.06.2011 The support service (Alexandra M * shuk ***) writes:
Check now. By ssh available, working.
14:23:25 14.06.2011 I am writing:
Now available. What was the problem?
14:27:27 14.06.2011 The support service (Alexandra M * shuk ***) writes:
The problem was on our side. Sorry for the inconvenience.
Then I found out about the compensation:
Hello.
According to the public offer contract as of May 23, the payment of compensation is carried out only at the conclusion of a contract for fault-tolerant services with Clodo. Unfortunately, we have to refuse to pay you compensation. We are very sorry for the incident. We have done everything so that nothing of the kind will happen in the future. If you are interested in a high availability solution deployed in two data centers, we will be able to provide it within a month.
Clarification on the accident 10.06.2011
As you know, Clodo is working on changing the cluster management structure. Given all the sad experience, we conduct them at night and breaking them up into atomic non-destructive actions.
The night before the incident, our engineers did work to exclude one of the InfiniBand switches from the cluster. On the one hand, this action was subjected to preliminary testing, on the other - upon its completion it was once again verified that nothing was violated. After this, no work was done.
However, after a very long time, the fall of virtual machines began. The problem arose due to the failure of the IP over Infiniband (IPoIB) driver in working with Suse Linux Enterprise Server installed on our XEN nodes, cluster controller and relays. Unfortunately, the crash was quite fatal and the virtual machines did not rise automatically, but manually. Moreover, the launch scripts had to make emergency changes, so the rise of virtual machines did not happen as quickly as we would like. A small part (10-15) of virtual machines lost the connection between the virtual machine and the disk as a result of a failure. The efficiency of these machines had to be restored longer.
The failure occurred through our fault. Its main components:
insufficient testing before the operation (time-delayed errors were not excluded);
Not fully tested Suse Linux Enterprise Server and Infiniband interaction;
unfinished in case of an accident such launch scripts.
All these errors are a consequence of the human factor. The guilty are excluded from the implementation of any actions on production-servers.
Small FAQ
Why did the support team respond so slowly?
The answer is simple: at the time of the accident, the load on the technical support service increases many times. The support service was connected to the elimination of the accident: I made lists of the victims, monitored the launch after the elimination of the problems, helped the system administrators in every possible way.
Why during the accident did not communicate with me on the forum / Habrahabré / Twitter and did not tell what was happening
All the people who were able to accurately understand and explain what was happening were engaged in the elimination of the consequences. Spending time people who solve the problem, it was extremely inappropriate.
Will something like this happen again?
At the moment, all work is suspended. The only thing that is done is the restoration of functionality that has been disabled for the duration of the work. A new training schedule has been approved, which includes testing the health of the entire system as a whole after any planned changes to its nodes.
On my own behalf, I want to apologize to those to whom I did not respond to the messages that you sent to my mailbox. I do not have time to physically process such a stream of letters.
Clodo CEO
Maxim Dyubarev
Source: https://habr.com/ru/post/121080/