Responsibility of the vendor. Who is responsible for the accident?

Last Thursday our service experienced the biggest accident in its history. One of the installations on the M9 remained inaccessible from the external network for several hours. What happened? What to do? What should a responsible vendor do? How to preserve the reputation of the vendor of the telephone platform №1?

It is believed that a corrected error in time is no longer an error. Well, let's check the correctness of the aphorism in practice. Usually, marketers write a month-ahead publication plan and the ideology of each post is something like the motivating “Faster, higher, stronger”. This is the accepted way and we are not an exception, but this time we will move away from the marketing dogmas and honestly tell you about the problem that has appeared without the rustling pre-Christmas PR wrappers.

What happened?

It so happened that the cluster where ITooLabs Communications Server lives is down. True, it did not fall completely. In our cloud, more than 40 large operator platforms are “spinning” and even one segment fails - this is the absence of “Hello”, offhand, from five thousand companies. Being a vendor with 10% of the market is not only fun, it is also a huge responsibility. We have never had any special illusions, considering that our market can be relaxed, but the incident made us feel fully responsible for the “telephone” life of a huge number of subscribers. All previous accidents took place mostly transparently for our partners and were promptly eliminated, the uptime corresponded to the indicators stated in the Contracts. There were no serious reasons to believe that the resiliency of ITooLabs requires rethinking. It was not until the last event.
')
All nodes of our platform are reserved. Some are in hot standby mode, some share the total load, but there is no single point of failure. This, of course, applies to the network infrastructure - all switches are installed in pairs, all servers are included in two switches, all external links are duplicated. For a serious failure, you need either something extraordinary, such as DDoS, or the Great Blackout 2005, ... or a stupid, little mistake. We made a mistake. We will not go into technical details. For those who are in the subject, let's just say that in the descriptions of the processes for infrastructure engineers there was a record (made, of course, in blood) - “make sure that VTP is disabled on the external link”. But it so happened that around noon Moscow time, VLAN formed on two external switches at once. All VLAN. Most likely, a fragment of the problem of a higher-level operator that happened on that day arrived to us - we were able to reproduce a similar situation on the simulator, but as a result the service was no longer accessible from anywhere.

The SaaS colleagues, without a doubt, are familiar with this chilling feeling, when, suddenly, everything breaks down and only the roar of the siren remains (of course, we have a siren in our office), panic, the first call from the client and an empty head in which for some reason, only two words are beginning to trickle — starting with the letters “” and “”.

We conduct many different exercises. Knockout Fallout of the switch. Urgent introduction of a new node for the disposal of pre-New Year load. But we never had the scenario “two overlapping switches died at once”.

It took us a few hours to reach through the emergency channels to the console; to understand what happened; to finally physically get to the site. And all this time, all the forces were thrown to the search for and troubleshooting, and support could only say “We have problems with communication with the outside world. We cannot yet report the recovery time. ”This caused our operators an extreme inconvenience, and then a negative one.

After realizing the problem, the repair took a few minutes. The calls went right away, and the interfaces returned an hour later.

But the damage was done.

What have we done?

Now, in fact, the most important part of today's post-accident post. We absolutely know for sure that the blog of ITooLabs is read by all our partners and we want to report in an open format about what conclusions are made and what we are going to take. Thus, we make it clear that we are striving for openness and transparency and are not going to deal with e-mail replies, working out a soulless script “working with complaints”.

The first and most important. We do not wait for official claims from partners who have been stung by the accident, and their claims for compensation (and even more so for the SLA, this compensation is not so significant). We made a decision on monetary compensation in the amount of 3.5 million rubles - this is our price drop, which we have to pay . We hope that this will help you in part to retain customers and attract new ones. Only monetary compensation, we can save the reputation of the vendor №1.

ITooLabs, using Revenue Sharing as the main business ideology , directly depends on the success of its partners. This is our axiom, which we, however, do not tire of repeating ourselves. What happened hit us no less painful than you. We understand and understand it always. Your business is our business too, every lost customer is our customer. What happened is not a result of negligence and relaxation, it is an unexpected accident, first of all for ourselves. Now we clearly know what needs to be done so that the accident does not happen again on the current release of the platform. Work is already underway.

The launch of a new release of ITooLabs Communications Server with a redesigned fault tolerance policy is given priority number one. What we planned to do at the beginning of 2016, we will do as quickly as possible. The new platform is ready and tested. Everything works by serving on a real installation up to 1500 calls per second. We promise to implement it as soon as possible, changing the priorities in the roadmap. Please just wait a bit.

With each partner, we are ready to individually look for solutions to neutralize the arising reputational risks. Write to us, we are open for discussion. It is likely that we will be able to offer several ways on how to not only return, but also increase customer loyalty. There is an understanding of how this can be done.

With each partner, we will personally discuss ways to optimize the overall infrastructure. With whom there is no direct interface on the M9, we will offer the recommended connection schemes to our platform.

We always notified our partners in case of an emergency and reported on the actions taken. But earlier we did not have such protracted downtime (and, we hope, they will not be repeated). However, we have optimized the processes and will cover the course of events in more detail if necessary.

We pledge to help you more with our knowledge to increase sales.

It is clear that after the availability of the service was restored, personal appeals were sent to all our partner operators and there is no special reason to apologize endlessly. However, once again we apologize to all those who are involved in the problem.

But we once again convinced of the correctness of PaaS and Revenue Sharing models. Responsibility vendor for a simple platform will always be the maximum and that it will provide the highest level of service. We witnessed many accidents at telecom operators, but the operation service could not immediately understand what had happened, waiting for the long response of the long technical support of the slow vendor. We do not want and will not be a classic vendor. We monitor all installations and, if something bad happens, we immediately react and quickly fix the problem “at the root”. This ensures the rapid development of ITooLabs and the peace of mind of telecom operators, that in case of problems there is a service for the operation of a responsible vendor that reacts ahead of the curve and eliminates all problems.

Many thanks to all who supported us in a difficult situation, to all employees of the engineering services of our partners, who retained complete calm and professionalism in a rather tense situation.

To be continued.

Source: https://habr.com/ru/post/273225/

All Articles

Responsibility of the vendor. Who is responsible for the accident?

What happened?

What have we done?

More articles: