📜 ⬆️ ⬇️

Quality takeoff after the "fall". Or why "lay" many bytes

On August 23, 2012, from 02:00 a.m. to 4:30 p.m., a part of the MultiByte network did not function correctly, which resulted in partial or complete loss of communication from about a third of the company's customers. In order to dispel rumors immediately in hot pursuit, we decided to talk about what was and what had been done so that this would not happen again.


I'll start from afar. About a year ago, the rapid growth of traffic across the entire PolyByte network began. This was facilitated by adequate tariffs for bandwidth and traffic, and good connectivity in Russia, and the overall growth in the number of client equipment hosted in our data centers. As a result of increased traffic, the Cisco Catalyst 6500 and 7600 series switches and routers installed by us in 2007 and 2008 have become insufficient for further growth. Everything is very simple: 2x20 Gbit / s per slot and only 4 full-speed ports per slot is the limit. Therefore, at the beginning of 2012, we planned to transfer the network core to Juniper routers and a general network upgrade in order to adjust our “ring” connecting nodes on MMTS-9, MMTS-10 and data centers, in order to provide customers with connectivity at 10Gbit / seconds and, accordingly, passing traffic at the same speed to the outside world.


Juniper MX960 3D
')
By receiving the necessary equipment (DWDM multiplexers, DWDM-SFP +, 10 Gbit / s switches, Juniper routers), we transferred the “ring” to new equipment. Since July 5, 2012, we had a successful replacement of the router on our site at MMTS-9, and almost none of the clients of the data centers noticed this. Although the work was hard - the central router, though!

On August 23, 2012 we planned the next replacement of the router. Now the task was much more difficult: it was necessary to switch over a dozen access switches and about 130 client connections connected directly to the router. We prepared for work quite thoroughly: a separate switch was turned on in our ring, where clients switched in several stages. These clients were also routed by another router. On the night of August 23, we planned to switch access switches to the same “piece of ring” and take clients to other routers. The total downtime for customers would be less than an hour. A 130 direct connections are not going anywhere - they would have had to wait for the inclusion of a new Juniper. For the reader, I also note that 130 connections are not only 1 Gbit / s, but also 10 Gbit / s ports, too.


Juniper EX8216

At 02:00 according to the plan, we started working with moving client routing to another router and switching access switches. However, after transferring the connections and starting dismantling the Cisco Catalyst router, strange problems began with the backup switch: it was out of memory and the CPU was periodically heavily loaded. We tried to solve the problem, but she succumbed only partially. As a result, part of the access switches remained without access to the network. We could not return everything back. We continue to study the problem, because the same switch in the same configuration without any problems passed through about 15 Gbit / sec of traffic and did not even tense up.

Because of problems with access switches, our site, as well as telephony, were temporarily taken out of service. This is what caused customer complaints that they could not reach us. But pretty soon the problem was solved and it all worked.

The new Juniper was launched at the scheduled time. Switching of access-switches and user connections began. Along with connections, new problems have emerged, which are also being studied by us. For example, a “loop” was formed, which was not immediately caught. Catching the loop took extra time. The loop caught some customers in our other data centers.

While connecting access switches, it also turned out that Cisco Catalyst did not really want to be friends with Juniper equipment and had to jump with a tambourine console near each switch. And reconfigure it. By 11:00, with a delay of 5 hours from the scheduled time, most of the data center clients were working and had no problems.

But this was not the end of the problems of that day. Different worldviews of manufacturers Juniper Networks, Extreme Networks and Cisco Systems on the seemingly fully standardized STP and MPLS protocols left without communication some of the customers. The glitch with the passage of large packets was caught by us until 16:30. At 16:30, only slightly less than 100 connections to the access switches of the data center remained affected. Some switches of clients connected to our access switches were also incorrectly configured and affected our network. After explanatory conversations with customers, reconfiguring their equipment and installing a variety of different filters on these ports, the problem was finally resolved and around 18:30 the last affected customers got access to the network without problems.

What's next


All affected customers will receive compensation and pleasant bonuses - no doubt about it. As I said, the traffic on the PolyByte network is growing. The resulting upgrade will allow us to continue to meet the needs of our existing and new customers. By the way, we are one of the few Moscow data centers that provide connectivity for client servers at a speed of 10 Gbit / s. It is worth waiting for new tasty tariffs and a more flexible tariff policy. A blessing in disguise, as they say!

Thanks to all our customers who have stayed with us for years and showed patience on this difficult day for us and for them!

Source: https://habr.com/ru/post/150194/


All Articles