Problems in the work of services September 24-25

First of all, we want to make a formal apology for the biggest downtime in the history of Selektel. Below we will try to restore in detail the chronology of events, tell about what has been done to prevent such situations in the future, as well as compensation for clients affected by these problems.

First failure

The problems began on the evening of Monday September 24 ( downtime 22:00 - 23:10 ). From the side it looked like a loss of connectivity with the Petersburg segment of the network. This failure caused problems in all of our Internet services in St. Petersburg; Moscow network segment, as well as local ports of servers continued to work. Also DNS (ns1.selectel.org and ns2.selectel.org), which are located in St. Petersburg, Moscow DNS (ns3.selectel.org) were not available, this failure did not affect. Due to the lack of connectivity, access to the site and the control panel was lost, telephony accounted for the main load, and therefore many clients could not wait for a response.

When analyzing the situation, we were able to immediately establish that the problem was caused by the incorrect operation of the aggregation level switches, which are two Juniper EX4500s combined into a single virtual chassis. Visually, everything looked quite workable, but when connected to the console, a lot of messages were detected, which, however, did not allow to determine the exact cause of the problem.
')

 Sep 24 22:02:02 chassism [903]: CM_TSUNAMI: i2c read on register (56) failed
 Sep 24 22:02:02 chassism [903]: cm_read_i2c errno: 16, device: 292

In fact, all optical 10G Ethernet ports in the aggregation level switch chassis stopped working.

 Sep 24 22:01:49 chassisd [952]: CHASSISD_IFDEV_DETACH_PIC: ifdev_detach_pic (0/3)
 Sep 24 22:01:49 craftd [954]: Minor alarm set, FPC 0 PEM 0 Removed
 Sep 24 22:01:49 craftd [954]: Minor alarm set, FPC 0 PEM 1 Removed

After rebooting, everything worked steadily. Since the network configuration had not changed for a long time, and no work was done before the accident, we decided that this is a one-time problem. Unfortunately, after only 45 minutes, the same switches stopped responding again, and were reset again ( 23:55 - 00:05 ).

Decrease switch priority in virtual chassis

Since in both cases the first one of the two switches in the virtual chassis failed, and only the second one failed to work, the assumption was made that the problem lies in it. The virtual chassis was reconfigured in such a way that the second switch became the main one, while the other remained only as a reserve. In the intervals between operations, the switches were reset again ( 00:40 - 00:55 )

Disassembled virtual chassis, all links are transferred to one switch

After about an hour, another failure showed that the actions performed were not enough. After releasing and sealing part of the port capacity, we decided to completely disconnect the failed device from the virtual chassis and transfer all links to a “healthy” switch. At about 4:30 am it was done ( 02:28 - 03:01, 03:51 - 04:30 ).

Replacing the switch with a spare

However, an hour later, this switch stopped working. For the time while he was still working from the reserve, exactly the same completely new switch was taken, installed and configured. All traffic has been transferred to it. Connectivity has appeared - the network has earned ( 05:30 - 06:05 )

JunOS update

After 3 hours, around 9 am, everything happened again. We decided to install a different version of the operating system (JunOS) on the switch. After the update, everything worked ( 08:44 - 09:01 )

Fiber break between data centers

Closer to 12:00 all cloud servers were launched. But at 12:45 pm there was a damage to the optical signal in the cable, which combined network segments in different data centers. At this point, due to the withdrawal of one of the two reference switches, the network worked only along one main route, the backup was disconnected. This led to a loss of connectivity in the cloud between the host machines and the storage system (storage network), as well as to the inaccessibility of servers located in one of the St. Petersburg data centers.

After the emergency team arrived at the place of cable damage, it turned out that the cable had been fired from an air rifle by hooligans who had been caught and transferred to the police.

Switching to the second channel was our obvious action, without waiting for the fiber to recover through the first channel. This was done quickly enough, but everything just worked, as the switch hangs again. ( 12:45 - 13:05 )

Optical SFP + Transceivers

This time, in the new version of JunOS, intelligible messages appeared in the logs and it was possible to find a complaint about the inability to read the service information of one of the SFP + modules,

 Sep 25 13:01:06 chassism [903]: CM_TSUNAMI [FPC: 0 PIC: 0 Port: 18]: Failed to read SFP + ID EEPROM
 Sep 25 13:01:06 chassism [903]: xcvr_cache_eeprom: xcvr_read_eeprom failed - link: 18 pic_slot: 0

After removing this module, the network has recovered. We assumed that the problem was in this transceiver and the reaction to it from the switch, as this transceiver visited each of the 3 switches, which we consistently replaced before.

However, after 3 hours, the situation repeated itself. This time in the messages there was no indication of a failed module, we immediately decided to replace all the transvers with new ones from the reserve, but this did not help either. We started watching all the transceivers in turn, pulling out one at a time, another problem transceiver was found already from the new batch. Making sure that the problem with the switches was resolved, we re-crossed the intranet connections to go to the basic work scheme ( 16:07 - 16:31, 17:39 - 18:04, 18:22 - 18:27 )

Recovery of cloud servers

Since the scale of the problem was initially unclear, we tried several times to raise cloud servers. The machines located on the new storage (the beginning of the uuids for SR: d7e ... and e9f ...) survived the first accidents only as the inaccessibility of the Internet. Cloud servers on old storages, alas, received an I / O Error for disks. At the same time, very old virtual machines have switched to read only mode. Machines of a new generation have an error = panic setting in fstab, which terminates the machine in case of an error. After several restarts, unfortunately, there was a situation when preparing the hosts for launching the VM took an inadmissibly long time (the massive IO error for LVM is rather unpleasant; in some cases, the dying virtual machine turns into a zombie, and their capture and completion requires manual work every time) . It was decided to restart the hosts on power. This caused a reboot for virtual machines from new storage, which we really didn’t want to do, but allowed us to significantly (at least three times) reduce the launch time of all the others. The repositories themselves were without network activity and with intact data.

Measures taken

Despite the fact that there was a reserve of equipment in the data centers, the network was built with redundancy, as well as a number of other factors ensuring stability and uninterrupted operation, the situation became unexpected for us.

As a result, it was decided to implement the following activities:

Enhanced verification of optical transceivers and network equipment in the test environment;
Armored Kevlar fiber-optic cable in places with risk of damage due to bullying;
Accelerating the completion of cloud server infrastructure upgrades.

Compensation

A question that interests everyone. The table below shows the amount of compensation for various types of services as a percentage of the cost of the service provided per month, in accordance with the SLA.

Given that formally the downtime of the services was less than that indicated in the table (the connectivity sometimes appeared and disappeared), it was decided to round down the downtime.

Service	Downtime	Compensation
Virtual dedicated server	11 o'clock	thirty%
Dedicated server / arbitrary configuration server	11 o'clock	thirty%
Equipment placement	11 o'clock	thirty%
CMS hosting	11 o'clock	thirty%
Cloud servers	24 hours	50%
Cloud storage	11 o'clock	50%

Once again, we apologize to all who are stung by this incident. We perfectly understand how negative network inaccessibility affects clients, but somehow they could not speed up the solution of problems due to the fact that the situation was nonstandard. We took all possible actions to fix the problems as quickly as possible, but unfortunately, it was not possible to determine the exact problem analytically and had to search for it by looking at all the possible options, which in turn took a lot of time.

Source: https://habr.com/ru/post/152351/

All Articles