The end of the first and the beginning of the second month of the summer of 2019 were not easy and were marked by several major falls in world IT services. Of the notable ones: two serious incidents in the CloudFlare infrastructure (the first with crooked hands and negligent attitude towards BGP by some ISPs from the US; the second with the curve deployments of the CFs themselves, influenced everyone using CF, and these are many notable services) and Unstable Facebook CDN infrastructure (affected all FB products, including Instagram and WhatsApp). We also had to get under the hand, although our outage was far less noticeable on a world background. Someone has already begun to drag in black helicopters and "sovereign" conspiracies, therefore we release the public post mortem of our incident.
07.03.2019, 16:05Began to fix problems with resources similar to the violation of internal network connectivity. Having not fully verified everything, they began to sin on the performance of the external channel in the direction of DataLine, since it became clear that the problem was with the access of the internal network to the Internet (NAT), to the extent that they put the BGP session in the direction of DataLine.
07/03/2019 16:35It became obvious that the equipment performing the translation of network addresses and access from the local area network to the Internet (NAT) failed. Attempts to reboot the equipment did not lead to anything, the search for alternative options for organizing connectivity began before receiving a response from technical support, as from experience, this would most likely not help.
')
The problem was somewhat aggravated by the fact that this equipment also terminated the incoming connections of client VPN employees, and it became more difficult to carry out remote restoration work.
07.03.2019, 16:40We tried to reanimate the previously existing backup NAT scheme, which had worked hard before. But it became clear that a number of network retrofits made this scheme almost completely inoperative, since its restoration could at best not work, at worst break the already working one.
We began to work on a couple of ideas to transfer traffic to the complex of new routers serving the backbone, but they seemed inoperative due to the peculiarities of the distribution of routes in the core network.
07.03.2019, 17:05At the same time, a problem emerged in the name resolution mechanism on name-servers, which led to errors in endolving applications in applications, and began to promptly fill the hosts files with records of critical services.
07.03.2019, 17:27The limited working capacity of Habr has been restored.
07.03.2019, 17:43But in the end, a relatively secure solution was found to organize traffic transmission through one of the border routers, which was quickly done. Connectivity to the Internet has been restored.
Within the next few minutes, the mass of notifications about the recovery of the performance of the monitoring agents came from the monitoring systems, however, some services turned out to be inoperable, as the name resolution mechanism on name-servers (dns) was violated.
07.03.2019, 17:52NS have been restarted, the cache has been reset. Resolving recovered.
07.03.2019, 17:55All services except MK, Freelancim and Toaster have earned.
07.03.2019, 18:02Earned MK and Freelance.
07.03.2019, 18:07Brought back an innocent BGP session with a DataLine.
07.03.2019, 18:25They began to fix the flops on the resources, it was connected with the change of the external address of the NAT pool and its absence in the acl of a number of services, quickly corrected. Immediately earned and toaster.
07.03.2019, 20:30Noticed errors related to Telegram bots. It turned out that they forgot to register the external address in a pair of acl (proxy servers), quickly corrected it.
findings
- The equipment, which had previously raised doubts about its suitability, failed. There were plans to exclude him from work, as it hindered the development of the network and had compatibility problems, but at the same time performed a critical function, which is why any replacement was not technically easy without interrupting services. Now you can move on.
- Problems with DNS can be avoided by moving them closer to the new backbone network beyond the NAT network and at the same time with full connectivity with the gray network without translation (as planned before the incident).
- It is not necessary to use domain names when assembling RDBMS clusters, since the convenience of transparently changing the IP address is not particularly necessary, since such manipulations still require rebuilding of the cluster. This decision is dictated by historical reasons and, above all, by the obvious endpoints by name in RDBMS configurations. In general, the classic trap.
- In principle, exercises are conducted that are comparable to the “sovereignization of the RuNet,” there is something to think about in terms of increasing the possibilities of autonomous survival.