⬆️ ⬇️

Clodo.ru and another mysterious fall

image

The server is with me from 13:30 no response from tech support. The forum is also not available to find out the reasons. Apparently this is their business card, they fell without warning, then we think of something.



Dear employees clodo.ru where are you? Where is the letter / message of the ticket / etc your appeal why and how much should we lie?





image

image

')

UPDATE June 15th. Compensation.



Today, one habr user sent a very interesting message about his attempt to get compensation:

11:55:22 06/14/2011 I am writing:

Good day.

VPS 11111-1 does not work for us, already 4 days.

What is the problem? When will the server work?

11:56:55 06/14/2011 The support service (Alexandra M * shuk ***) writes:

Is the server already four days in the "request processing" state?

12:01:42 14.06.2011 The support service (Alexandra M * shuk ***) writes:

Check now. By ssh available, working.

14:23:25 14.06.2011 I am writing:

Now available. What was the problem?

14:27:27 14.06.2011 The support service (Alexandra M * shuk ***) writes:

The problem was on our side. Sorry for the inconvenience.



Then I found out about the compensation:



Hello.

According to the public offer contract as of May 23, the payment of compensation is carried out only at the conclusion of a contract for fault-tolerant services with Clodo. Unfortunately, we have to refuse to pay you compensation. We are very sorry for the incident. We have done everything so that nothing of the kind will happen in the future. If you are interested in a high availability solution deployed in two data centers, we will be able to provide it within a month.


Claudo, Claudo, you are all in your repertoire, we don’t need anyone to do anything. We have excuses for excuses for everything ... in the contract. You yourself are not ashamed to work in a company like this?





UPDATE June 14: It's been 4 days.



Clarification on the accident 10.06.2011

As you know, Clodo is working on changing the cluster management structure. Given all the sad experience, we conduct them at night and breaking them up into atomic non-destructive actions.



The night before the incident, our engineers did work to exclude one of the InfiniBand switches from the cluster. On the one hand, this action was subjected to preliminary testing, on the other - upon its completion it was once again verified that nothing was violated. After this, no work was done.



However, after a very long time, the fall of virtual machines began. The problem arose due to the failure of the IP over Infiniband (IPoIB) driver in working with Suse Linux Enterprise Server installed on our XEN nodes, cluster controller and relays. Unfortunately, the crash was quite fatal and the virtual machines did not rise automatically, but manually. Moreover, the launch scripts had to make emergency changes, so the rise of virtual machines did not happen as quickly as we would like. A small part (10-15) of virtual machines lost the connection between the virtual machine and the disk as a result of a failure. The efficiency of these machines had to be restored longer.



The failure occurred through our fault. Its main components:



insufficient testing before the operation (time-delayed errors were not excluded);

Not fully tested Suse Linux Enterprise Server and Infiniband interaction;

unfinished in case of an accident such launch scripts.



All these errors are a consequence of the human factor. The guilty are excluded from the implementation of any actions on production-servers.



Small FAQ



Why did the support team respond so slowly?

The answer is simple: at the time of the accident, the load on the technical support service increases many times. The support service was connected to the elimination of the accident: I made lists of the victims, monitored the launch after the elimination of the problems, helped the system administrators in every possible way.



Why during the accident did not communicate with me on the forum / Habrahabré / Twitter and did not tell what was happening

All the people who were able to accurately understand and explain what was happening were engaged in the elimination of the consequences. Spending time people who solve the problem, it was extremely inappropriate.



Will something like this happen again?

At the moment, all work is suspended. The only thing that is done is the restoration of functionality that has been disabled for the duration of the work. A new training schedule has been approved, which includes testing the health of the entire system as a whole after any planned changes to its nodes.



On my own behalf, I want to apologize to those to whom I did not respond to the messages that you sent to my mailbox. I do not have time to physically process such a stream of letters.



Clodo CEO



Maxim Dyubarev




From myself I want to say, you really Maxim think that this is a worthy explanation ... Although what I am talking about, how you do things, you explain. Themselves have only done worse.





UPDATE 19:45: the remaining 2 servers on different accounts work for me.



UPDATE 19:15: the story continues. I have already moved the main project, but judging by the comments of the problem is still there. At the moment, the Claudo forum looks like this:



And the schedules of one of my servers for the last 6 hours are as follows:





Are you leaving and not even saying goodbye?



UPDATE 07:10: more than 18 hours. All servers are up. It's time to move! Apparently the employees of the clod have already come to work, have slept well and turned on the server :)



UPDATE 04:30: more than 15 hours. I would like to see the names of those responsible for the fall. Almost 5 o'clock in the morning, I wait until they raise the server to pick up the latest actual backups. But something tells me that the staff of the Clodod have long gone home to sleep ...



UPDATE 02:00: 13 hours we lie. Dear Claudo staff, this is just a mess and disrespect for customers! I hope after this fall all or most of you will be eaten away. For all your downtime on all servers, more money is lost than all your hosting costs! Moreover, the reputation of one project was undermined, which now can’t be restored for any money, I made compensations and bonuses as much as I could for my users for your every fall. What can I do now can not imagine.

Your cloud service costs nothing more with this uptime. To say that I hate you is to say nothing. Thank you again for your quality and attitude .



UPDATE 23:00: 10:00 we are lying. No, well, this is already a bust of gentlemen. Everything is so fault tolerant that you can’t lift for 10 hours.

image

Source: https://habr.com/ru/post/121080/



All Articles