
On the night of July 15/16, 2017, one of the most memorable events happened on the Selectel network, which led to a deterioration (up to complete inaccessibility) of the connectivity of the Selectel network with the foreign Internet segment. And it was so remembered that the presentation of the Selectel technical director at the conference of network operators ENOG-14 was devoted to this occasion.
But first things first.
Network description
At that time, the Selectel network looked like this. Selectel is present in two cities of the Russian Federation - St. Petersburg and Moscow, in each city there are two border routers.
')

Each of the routers announces “to the world” the Selectel network from the autonomous system AS 49505. Uplinks and peerings are connected to each of the routers.

The largest uplinks to Selectel are TransTelecom, RETN and Rask.
The largest peering points to which the Selectel network is connected are MSK-IX, DATA-IX, DE-CIX.
At some peering points (in particular, MSK-IX, DATA-IX and others), more-specific prefixes were used to increase the volume of incoming traffic through these exchange points. DE-CIX did not use more-specific prefixes.

The autonomous system 49505 is also used to provide a distributed network of DNS servers. Servers are located in several countries on different uplinks; two / 24 subnets are announced from servers using BGP anycast.
Internet Exchange
Internet Exchange (IX), a traffic exchange point is a peer-to-peer network for different network operators, which is one logical switch that all operators are connected to, and a pair of servers that provide general distribution of routing information between participants of the traffic exchange point (routing servers, RS ).

Responsibility IX, as a rule, manifests itself in two forms - passing traffic from one participant to another through packet switching, and ensuring the correct functioning of routing servers at the traffic exchange point - filtering the received routes, if one of the participants announces incorrect routes, the transmission is correct information from one participant to another and, which is especially important for large traffic exchange points, the organization of a blackhole for routes marked by a special community.
Blackhole
Blackhole, in translation “black hole”, is a mechanism that allows telecom operators to protect themselves from DDoS on their networks. This mechanism does not allow to preserve the availability of the attacked resource, but, at least, it is aimed at preserving the operability of the operator’s network that has this resource located or through which this resource is available.
To determine the resources to which the traffic should not be transmitted, the operators use the so-called blackhole community. At the end of 2016, this community was included in RFC 7999 in the list of well-known communities (
https://tools.ietf.org/html/rfc7999 ).
Usually, operators very carefully apply the blackhole community, and, as a rule, this community only applies to / 32 prefixes. But the settings for some traffic exchange points and some operators allow reception of blackhole routes with a subnet mask other than / 32.
Alexander Ilyin, Technical Director, MSK-IX
MSK-IX introduced BGP Blackholing service one of the first in the world traffic exchange points. The problem described by the colleagues from Selectel did not affect our infrastructure, since according to the rules of providing Blackhole on our Route Servers, only routes from / 25 to / 32 are permissible. In accordance with the policies of the regional Internet registries RIR (RIPE, RADB, etc.), the Blackhole community can only be installed on the networks already announced by the Member. RS accepts from the Participant network announcements only if these networks are also announced by the Participant without an attribute and such errors will be filtered at the input. We also maintain an up-to-date database of contacts of the participants and in such cases we block the violators promptly.Incident: First Blood
So, everything happened on the night of July 15-16. That is, from Saturday to Sunday. Most network engineers, I daresay, on the night from Saturday to Sunday in the summer all the same rest. At 0:30 (time here and later Moscow time) the technical director received a call from the Selectel technical support engineers on the mobile phone: “Something strange is happening, we do not yet know what, but the symptoms - customers complain about the inaccessibility of foreign servers from Selectel servers , or servers in Selectel from foreign servers. Plus, the Slack messenger has stopped working in the office network. ”
In response, technical support officers were instructed to collect problem routes from complaining customers. An analysis of the logs and messages of the monitoring system revealed no obvious problems. The Selectel network from the networks of mobile operators was normally accessible, remote access worked without problems. A deeper analysis showed a drop in traffic on one of the uplinks, TransTelecom, without increasing the amount of traffic on other uplinks.

0:30 - a very unfortunate time to search for anomalies on the graphs download Internet channels. At this time, as a rule, CHNN (busy hour) on Internet networks ends, and the drop in traffic can be caused not only by emergency reasons, but also by normal behavior during the day, and generally not very noticeable.
Taking into account the insufficiency of the collected material and the short time of diagnosing the problem, the conclusion was made "probably something broke in the TTC". After that, the BGP session with the TTC was deactivated. The restructuring of the routes led to the correction of the problem, Slack started working, the connectivity was restored, the clients began to confirm the full serviceability of the service. Technical support sent to contact technical support for TransTelecom.
The interrupted night's sleep was continued.
Incident: main part
Around 02:00, the Slack messenger from the office network stopped working again, the situation with the inaccessibility of a significant part of foreign resources again repeated. This is when the channel from TransTelecom is disconnected, that is, or the first assumption was wrong (but then why did everything work after disconnecting the BGP session from TTC?), Or the problem was further spread. Technical support engineers on duty reported mass customer calls with complaints about resource inaccessibility. The mass of calls and analysis of the subnets issued to customers showed that the problems concern most of the IP prefixes used in the Selectel network, that is, the problem is global in nature, and is not grouped on any single prefix.
Technical support again began to collect the results of running the traceroute command, both from servers from inside Selectel, and from servers outside.
The study of traces showed that problems are localized in the area of ​​DE-CIX. Moreover, there is a significant asymmetry of problem traffic flows. If the route from Selectel to the end resource passes through DE-CIX, then there may be no problems. But if the return route passes through the DE-CIX, then the problem of inaccessibility is clearly manifested. Diagnosing such problems is often complicated by the asymmetry of traffic and the use of ECMP by operators. For example, one packet from the network of operator A to the network Selectel can be sent via DE-CIX, the second - via DATA-IX, and the third via Cloud-IX.

During attempts to localize and diagnose the problem, the session with TransTelecom was enabled.
Then the engineers began to carefully analyze the information on the looking glass of various operators. The eye caught the output of information from the TransTelecom routers.

Selectel announces the prefix 188.93.16.0/21 to its uplinks and the more-specific prefix 188.93.16.0/22 ​​to some of the traffic exchange points. In order for uplinks to not take more-specific prefixes “outside”, the same more-specific prefixes from the no-export community are announced to all uplinks; within uplink networks, these prefixes must be present in the RIB with the best route to the client session. But with a raised session with TransTelecom, it was found that at the Moscow TTC router (with which Selectel did not have a direct session at that moment) there is a selector more-specific prefix, leading through some autonomous system 2854.
The autonomous system 2854 is not listed in Selectel uplinks or peers, i.e. it should not be a transit autonomous system for Selectel prefixes. Where did this prefix come from in TTK? Unclear.
We look at the DE-CIX looking glass. Here is the route server, all the information meets expectations.

As expected, the prefix on the route server points directly to the Selectel router. There is no more-specific prefix, Selectel does not announce more-specific on DE-CIX. But something stopped from moving to the next looking glass, and itinerary information was requested from route server # 2.

Oh!
First, on route-server No. 2 from somewhere, a more-specific prefix of 188.93.16.0/22 ​​came from. Where did he come from? AS path 2854 49505. That is, again AS 2854 suddenly announces the prefixes Selectel through itself. Secondly, the community was alerted (65535, 666). This is the blackhole community! AS 2854 from somewhere takes more-specific Selectel prefixes, and then gives them away to the DE-CIX Route Server No. 2 with the blackhole community installed!
Compare the output from PC1 and PC2 to DE-CIX:

Yes. There is no more-specific on the PC1, and the route goes immediately to Selectel, as it should be. The PC2 has more-specific of 2854 with the blackhole community installed.
It is logical that the routers of the operators who take the routes from the DE-CIX route servers “see” the route in Selectel as the best with a more-specific prefix whose traffic is filtered by the DE-CIX itself.
When it became clear about what the problem of routing through DE-CIX is, the chat group of telecom operators in St. Petersburg had questions “why do we also have no connection with some foreign resources”. Diagnostics “on the beaten track” showed that again there is a problem with AS 2854, which announces through itself a lot of routes, including to operators of St. Petersburg, using the blackhole community. Broadband operators have complaints from subscribers began to acquire a mass character only by morning, this is the specificity of broadband access for individuals and the time of occurrence of problems (I remind you, the night from Saturday to Sunday, summer - many home users at this time in summer cottages, on vacation, sleep).
Who is guilty
During the diagnostics, the autonomous system number is determined, which announces incorrect routes on the PC2 DE-CIX, AS 2854. Who is this? This information is needed in order to be able to quickly contact them and say that they are doing "no business." AS 2854 is the Russian subsidiary of the global operator Equant (also known as Orange Business Services), previously in the Russian Federation this company was called Rosprint.

Naturally, a letter was sent to noc@rosprint.net with a description of the problem, and technical support engineers began to call these phones. Among the technical contacts was even found a mobile phone of one of the engineers. A call to a mobile phone engineer did not give the desired results. The engineer of the Rosprint was on vacation and generally somewhere near Lake Baikal, and not at the computer. Calls to the technical support of Rosprint first ended with a proposal to write them a letter describing the problem, and on Monday "networkers will come and maybe they will figure it out." The mailbox of noc@rosprint.net during off-hours is apparently ignored. It got to the point that technical support Rosprint, hearing "I am from the company Selectel", was just to hang up.
What to do
Apart from the fact that we were trying in every way to get in touch with the engineers of Rosprint, attempts were made to somehow fix the situation or reduce problems.
The first attempt to rectify the situation from the very Selectel. It is necessary to disable more-specific, then they will not fall on the PC2 DE-CIX, where 100% become the best routes for DE-CIX participants. Disabled. Yes, these routes have stopped falling on the PC2, but now on the PC2 there were aggregated routes to the Selectel networks. On PC1 there was a route 188.93.16.0/21 through AS 49505 (the correct route from Selectel), but on PC2 there was a route 188.93.16.0/21 with AS path 2854 49505 and the blackhole community.
Partially, the connectivity has been restored, now DE-CIX has not become more-specific routes to the Selectel networks. Connectivity has not recovered from those operators and from those resources that used routes with the PC2 as the best, despite the AS path.
OK. We can not reach the engineers Rosprint. We will knock to other engineers.
Attempting to write a letter describing the problem and then calling to increase the priority of contacting DE-CIX technical support was not successful.

DE-CIX technical support engineers began to deny their involvement in managing route information on route servers. We asked to disable Rosprint from the DE-CIX routing servers, since they send incorrect information to the routing server, and we cannot reach the Rosprint engineers.
Then we began to write letters to MSK-IX and DATA-IX with a request to disable Rosprint (since they perform incorrect actions with prefixes with IX). And it was calculated that when two large channels fell on traffic exchange points, Rosprint technical support would be obliged to inform network engineers about this incident, and those will begin to understand.
The night has passed
So, in the diagnosis and attempts to contact Rosprint night passed. Suddenly, at 11:26 (when MSK-IX and DATA-IX were also convinced of the incorrectness of the Rosprint actions and were ready to disconnect them from the routing servers), a message arrived from Rosprint:

That's all. No excuse, no contact, nothing ...
Lessons
From what happened we learned a few lessons. Unfortunately, these lessons were achieved in my experience in dealing with our problems, but I hope this text will help other operators in the future as well:
- On IX you should always watch two routing servers.
- Connectivity to your network must be monitored outside. Using not just ping / HTTP GET to monitor connectivity, but various API services, including stat.ripe.net.
- The presence of more-specific prefixes to us, contrary to the common opinion among operators about their harmfulness, helped diagnose the problem.
- Concluding contracts for the bandwidth of Internet traffic with uplinks, it is desirable to provide for the possibility of working using only one operator, and not just to provide the necessary redundancy in case of failure of one of the border routers.
- Anycast, if there are any segments, it is desirable to separate the autonomous system from the main network.
Why did the PC on DE-CIX accept routes on the Selectel network from AS 2854?

Because AS 49505 is an AS-SET AS-URAL member for anycast prefixes. And already AS 2854 can announce prefixes from AS-URAL.
Alexey Kuznetsov, Deputy Head, UIT, St. Petersburg State University
Modern IT systems and communication networks are a constant interaction of users and various IT services, which in turn depend / interact with other IT services both directly from the user (social network authentication, etc.) and directly between themselves. Disruption of IT services, their connection with users or between themselves immediately affects users, and often in an unobvious way. The option “does not work at all” is the simplest method of malfunction, while other cases require “troubleshooting” both by technical support staff who interact directly with service users / owners and engineers. What is easier, to understand in detail, or to answer, “the error is not in our area of ​​responsibility,” “this is the Internet, no one here is responsible for anything”? Yes, errors / problems of some telecoms operators can affect the work of users / services that are owned by other telecoms operators who do not have any direct relationship with the “source of the problem”, through whose networks traffic is not even transited. A separate problem is that telecom operators have personnel of different skill levels, not to mention staff working at night and on weekends, different policies for escalating the problem, since for the telecom operator, the users / services connected to it are primary, and the decision to escalate an “alien” problem by their system administrator, and maybe first by the Management, at night depends on the person on duty, on the reaction of his colleagues to previous escalations, etc. Often, the only option for rapid escalation of the problem is direct contact between managers and administrators of telecom operators, the ability to go through the chain of contacts and reach the right employee. It is not accepted to go to a foreign monastery with its own charter, so either national organizations or already recognized international organizations, such as the RIPE for the Europe-Russia region, can take the initiative by setting the “rules of the game” common for all their members.