📜 ⬆️ ⬇️

Responsibility of the vendor. Who is responsible for the accident?

Last Thursday our service experienced the biggest accident in its history. One of the installations on the M9 remained inaccessible from the external network for several hours. What happened? What to do? What should a responsible vendor do? How to preserve the reputation of the vendor of the telephone platform â„–1?

It is believed that a corrected error in time is no longer an error. Well, let's check the correctness of the aphorism in practice. Usually, marketers write a month-ahead publication plan and the ideology of each post is something like the motivating “Faster, higher, stronger”. This is the accepted way and we are not an exception, but this time we will move away from the marketing dogmas and honestly tell you about the problem that has appeared without the rustling pre-Christmas PR wrappers.



What happened?


It so happened that the cluster where ITooLabs Communications Server lives is down. True, it did not fall completely. In our cloud, more than 40 large operator platforms are “spinning” and even one segment fails - this is the absence of “Hello”, offhand, from five thousand companies. Being a vendor with 10% of the market is not only fun, it is also a huge responsibility. We have never had any special illusions, considering that our market can be relaxed, but the incident made us feel fully responsible for the “telephone” life of a huge number of subscribers. All previous accidents took place mostly transparently for our partners and were promptly eliminated, the uptime corresponded to the indicators stated in the Contracts. There were no serious reasons to believe that the resiliency of ITooLabs requires rethinking. It was not until the last event.
')
All nodes of our platform are reserved. Some are in hot standby mode, some share the total load, but there is no single point of failure. This, of course, applies to the network infrastructure - all switches are installed in pairs, all servers are included in two switches, all external links are duplicated. For a serious failure, you need either something extraordinary, such as DDoS, or the Great Blackout 2005, ... or a stupid, little mistake. We made a mistake. We will not go into technical details. For those who are in the subject, let's just say that in the descriptions of the processes for infrastructure engineers there was a record (made, of course, in blood) - “make sure that VTP is disabled on the external link”. But it so happened that around noon Moscow time, VLAN formed on two external switches at once. All VLAN. Most likely, a fragment of the problem of a higher-level operator that happened on that day arrived to us - we were able to reproduce a similar situation on the simulator, but as a result the service was no longer accessible from anywhere.

The SaaS colleagues, without a doubt, are familiar with this chilling feeling, when, suddenly, everything breaks down and only the roar of the siren remains (of course, we have a siren in our office), panic, the first call from the client and an empty head in which for some reason, only two words are beginning to trickle — starting with the letters “” and “”.

We conduct many different exercises. Knockout Fallout of the switch. Urgent introduction of a new node for the disposal of pre-New Year load. But we never had the scenario “two overlapping switches died at once”.

It took us a few hours to reach through the emergency channels to the console; to understand what happened; to finally physically get to the site. And all this time, all the forces were thrown to the search for and troubleshooting, and support could only say “We have problems with communication with the outside world. We cannot yet report the recovery time. ”This caused our operators an extreme inconvenience, and then a negative one.

After realizing the problem, the repair took a few minutes. The calls went right away, and the interfaces returned an hour later.

But the damage was done.

What have we done?


Now, in fact, the most important part of today's post-accident post. We absolutely know for sure that the blog of ITooLabs is read by all our partners and we want to report in an open format about what conclusions are made and what we are going to take. Thus, we make it clear that we are striving for openness and transparency and are not going to deal with e-mail replies, working out a soulless script “working with complaints”.








It is clear that after the availability of the service was restored, personal appeals were sent to all our partner operators and there is no special reason to apologize endlessly. However, once again we apologize to all those who are involved in the problem.

But we once again convinced of the correctness of PaaS and Revenue Sharing models. Responsibility vendor for a simple platform will always be the maximum and that it will provide the highest level of service. We witnessed many accidents at telecom operators, but the operation service could not immediately understand what had happened, waiting for the long response of the long technical support of the slow vendor. We do not want and will not be a classic vendor. We monitor all installations and, if something bad happens, we immediately react and quickly fix the problem “at the root”. This ensures the rapid development of ITooLabs and the peace of mind of telecom operators, that in case of problems there is a service for the operation of a responsible vendor that reacts ahead of the curve and eliminates all problems.

Many thanks to all who supported us in a difficult situation, to all employees of the engineering services of our partners, who retained complete calm and professionalism in a rather tense situation.

To be continued.

Source: https://habr.com/ru/post/273225/


All Articles