Building and operating a fault-tolerant anycast network
Hi, Habr! The following is a transcription of Evgeny's report error2407 Bogomazov (network R & D engineer) and Dmitry h8r Shemonaev (head of NOC) from the past UPTIMEDAY. Video at the end of the post.
Today we would like to talk about what problems arise when building a network anycast. Why we got into it and what we do there.
At Qrator Labs, we build our own anycast network to solve special problems that differ from those of “ordinary” telecom operators. Accordingly, we have points of presence in these regions - we just forgot to add Russia here. And this means that we have a lot of stories about how to and should not be done. Today we will share with you some of them.
What are we going to tell, given that the topic is initially voluminous? At first we wanted to do only Q & A (questions and answers), but we were still asked to read the report. Therefore, if we don’t tell something, and we don’t have time for something, catch us after, on the sidelines. ')
As part of the plan, we will try to talk about the difference between balancing using DNS and BGP. How to choose new sites and what you need to pay attention to in order to avoid subsequent pain. How to support all this and how much this uneasy occupation will tell Dmitry. To begin, let's define how familiar you are with the subject. - Who among you knows what anycast is and why is it needed? (about a third of hands rise in the hall) - And who is familiar with the DNS and set up the server? (about the same number of hands) - A BGP (in the frame two hands)
Anyway, a lot.
- Well, the last question - who is familiar with NOC'om? Who had problems with suppliers, and who tried to solve them? (in the frame is visible the hand of the system administrator Habr)
Fine. In this case, I hope that you will go what we are going to tell. Before moving on to anycast, let's see why it is needed. You have an application with which you want to handle customer requests. You are being hosted somewhere - as yet you are not particularly thinking about it. Buy a DNS name, resolve it, and so on. Then you sign the certificate because it’s https. Your app grows. First, you must handle the load. If your application simultaneously “shoots” - that is, it becomes sharply popular, many more users come to you. You have to buy more iron and balance the load on it.
Additionally, particularly demanding customers may arise with the words: “Guys, for such money you should be available always and everywhere!” Which leads to the fact that you are laying out the redundancy of computing resources not only for the processing of peak load, but also simply as a reserve .
Additionally, today it has already been mentioned in other speeches, it is impossible to place the glands in one data center - a natural disaster may occur, which means the application will go downtime, which will cause financial and other losses. Therefore, if your application has grown enough, you should already be located in several data centers, otherwise it will be bad. The problem has a downside - if your application is time sensitive, as in financial analytics or trading, it is important for you to send inquiries to your users as soon as possible. The notorious latency, with which two moments are connected. The first is that if you want to send a request as quickly as possible, then for this the number of requests and responses to them should be as small as possible. Again, the fact is that when a user connects to you for the very first time, everything does not work and he has to go through all the circles of hell. The second point is the speed of light. A package from Western Europe to Russia can not go faster than a certain number of milliseconds, nothing can be done about it. We need to be located in several data centers because we want redundancy and we need to stay closer to potential customers. If, say, your main client region is America, then you will place your equipment there, so that traffic does not go through other countries and parts of the world.
It turns out that from a certain moment you will have to be present in all parts of the world at different sites. And this is still not anycast. So, you need several sites. Somehow you have to choose them, both initially new and understand their ability to scale - when overloaded, you will have to buy additional iron.
If you already have several sites, you need to learn how to distribute users between them. There are two chairs: BGP and DNS. Of the two points, we begin with the last. And again two main approaches. In the first, you have different sites have different IPs and, accordingly, when the user comes with a request - he gets the IP of a certain site and it is mapped.
What do we decide here? We want a user from a specific region to go to a site located in the same region. The easiest and dumbest solution is to use GeoDNS. You have an understanding of which regions are in which prefixes - you take this data, push it into the DNS server, the appropriate user map, if the source IP comes from the desired prefix, to the right site. But there is a problem - resolvers. And about 15-20% of requests come from resolvers - that is, source IP will be 8.8.8.8. Where is this mapit?
For this, there is the EDNS, which allows, within a request, to transmit the original client subnet from which it came. As you know, DNS Flag Day happened on February 1, 2019 - right from that day all DNS servers should support this extension.
In this example, you can have one or several sites with the DNS servers hosting users - and the servers themselves can be distributed around the world. And within the framework of the DNS, it is possible to use anycast - we will tell about this a bit later.
In the general scheme, you map the user to the site nearest to him, giving the address of this particular site. It is used less often.
The third approach is connected with the fact that even if the user comes from the same region where the site is located, this does not mean that the problem of delays has been solved. It may be even more profitable to transfer the user to another site, as the region may be overloaded if there are alternative ways. Would it be nice to use it? Unfortunately, there are almost no current solutions to do something like this. Facebook somehow showed a report on what it was doing - but there is no box, you have to do everything yourself. What do we have in the end with DNS?
The advantages are that different users can be given different addresses, and a specific user can be directed to a strictly defined site - that is, you can work with individual users. Well, DNS is easy to configure.
What are the cons? If you are doing a granular configuration, the config grows quite quickly, which is impossible to support with your hands. Need automation. And if it is wrong to make automation, then everything will break - if the DNS is lying, then the application is not available.
On the other hand, if you do DNS balancing, then the user is mapped to a specific site and its IP becomes vulnerable. This is the reason why we do not use DNS balancing in ourselves, since in this case all the attack traffic can fill in exactly one point, disabling it.
And as already mentioned, DNS does not support latency balancing out of the box. And doing it yourself is very hard. Let's finally move on to things more beautiful, namely BGP anycast.
This is our case. What's the point? All sites have the same IP, more precisely - they announce the same prefix. The user is mapped to the “nearest” site. “Nearest” from the point of view of BGP - such a prefix is ​​announced using different routes and, if the operator has several routes to the announced site, then most often he will choose the shortest one. Again in terms of BGP. Soon we will tell why this is bad.
Also, BGP works with the availability of prefixes, so you always operate with a subnet and cannot manipulate individual IPs.
As a result, since the same prefix is ​​announced from all sites, all users from the same region will be directed to the same site. An attacker has no way to transfer the load from one region to another, therefore it is necessary to gain as much power in each of the places as the operator who chooses this route wanted. Even if not scored - it is still possible to protect.
Announced the same prefix - what could be easier? But there are problems. The first is that because of the need to announce the same prefix around the world, you are forced to buy provider-independent addresses that are several times more expensive.
The second is that users from one region cannot simply be transferred to another, if some of them are suddenly illegitimate or with the aim of diversifying attack traffic using other sites, because some kind of pain. There are no such pens.
The third problem is that within the framework of BGP it is very easy to choose the “wrong” site and the “wrong” providers. It will seem to you that you have redundancy and availability, but in fact there will be neither one nor the other. You have several sites between which you want to scatter users. What are the pens to restrict a particular region that binds users to a specific site?
There are Geo Community. Why are they? Let me remind you - you choose the nearest route from the point of view of BGP. And you have a Tier-1 operator, for example Level3, which has its own highways around the whole world. Level3 client, if you are connected to it directly, is from you in two hopes. And some local operator - in three. Accordingly, an operator from America will be closer to you than an operator from Russia or Europe, because from a BGP point of view it is.
With the help of Geo Community, you can limit the region in which such a large and international operator will announce your route. The problem is that they are not always available (Geo Community).
We have several cases when it was our own account. Dim sum
(The word takes Dmitry Shemonaev)
“Out of the box”, many operators do not provide this, and they say that, they say, we will not limit anything for net neutrality, freedom, and so on. It is necessary to explain long and hard to the operators who we are, why we want it and why it is so important to us, as well as enlighten on why it does not concern the neutrality that they have in mind. Sometimes it works - and sometimes not, and we just refuse to cooperate with potentially interesting operators because such cooperation will lead to further pain in the operation of our service.
Also, we quite often come across the fact that there are a number of operators who have already been mentioned by Evgeny - these are Tier-1, which nobody buy traffic from and only exchange it among themselves. But, in addition to them, there is still a couple of, at least dozens of operators who are not Tier-1 - they buy traffic, but they also have networks deployed around the world. There is no need to go far - from those closest to us is Rostelecom or ReTN, a little further away there are wonderful Taipei telecom, China Unicom, Singtel and so on.
And in Asia we quite often came across such a situation that, it would seem, we have several points of presence in Asia, we are connected to several rather big operators, from the point of view of this region. However, we are constantly faced with the fact that traffic from Asia goes to our site through Europe or makes a transatlantic journey. From the point of view of BGP, this is quite normal, because it does not consider latency. But the application in such conditions suffers, its users too - in general, everyone suffers, but from the point of view of BGP everything is fine.
And you have to make some changes with your hands, do reverse-engineering of how the routing of one or another operator is arranged, sometimes agree, ask, beg, kneel. In general, do anything to solve these problems. With such our NOC faces enviable regularity.
As a rule, the operators are going to meet and in some cases are ready to provide some set ... But in general, can those who worked with BGP at the community level raise their hands? (Smiles) Great! That is, the operators are ready to provide some set of governing community in order, for example, to lower the local pref in a specific region, or add prepend, or not to announce well or something else.
Accordingly, there are two ways for load balancing in BGP. The first is as written on the slide, the so-called. prependy We can imagine the path to BGP as a small string that lists the autonomous systems through which the packet path passes from the sender to the receiver. You can add to this path even a certain number of its autonomous systems and, as a result, the path will lengthen and become not a very priority one. This is a frontal method and it does not work for everything — if you add a prepend, it is not granular, that is, everyone will see it in the cone of the operator in which you are doing this manipulation.
On the other hand, there are still the BGP community, which are like markings, in order to understand where this or that prefix comes from, who it is with respect to the operator - that is, a feast, a client or an upstream, and where it is taken. etc. And there are community managers who are sent to the operator by the router and it takes certain actions with this prefix.
Most operators have restrictive communities. Take for example the abstract Russian operator, which is connected to a number of Russian operators in a vacuum. Some of them have peer relations implying a parity exchange of traffic, some of them buy. Accordingly, they provide the community, in order to make prependy in this direction, extending the AS path, or do not announce or change the local pref. If you operate with BGP, look at the community and learn what the applicant can do to become your supplier. Sometimes the community is hidden and you have to communicate either with the managers of the operators or with the technical specialists in order to show us a certain supported set.
By default, the community, in the case of the European region, is described in the RIPE DB. That is, you make a request to the whois numbers of the autonomous system and in the Remarks field it is usually written what the operator has in terms of marking and managing the community. There is not at all, so often have to look at different interesting places.
As soon as you start to operate BGP, in essence, you say that the network is part of your application, not something abstract, so you have to consider the risks.
For example, we had a case with one Latvian financial institution, whose prefix, if turned on through our network, became unavailable in about half of Latvia. Although it would seem, nothing has changed - the same prefix that we announce in Tier-1 operators, to Europe, everything would seem to be there, including redundancy. But we could not even imagine that about half of the operators of Latvia had border devices that could not digest the full fullview volume (the entire BGP routing table), which at that time was about 650 thousand prefixes. They were standing there, well, if anyone knows what a Catalyst 3550 is, that’s where it stood, it only knew how 12000 prefixes. Well, they got some amount of prefixes from IX'a, on which we, of course, was not and default. At the same time, from another operator - Latvian Television, the prefix in which IX was not / 24, but / 22 in which this / 24 stood.
As a result, he went to where he did not know where to route this and everything flew into the pipe. In order to fix this, it took us about two days and persistent correspondence with Latvian operators until they showed us the output from their border device and we only noticed the hostname there. Hello everyone, this is sometimes so much fun.
There are a lot of operators with old iron. There are many operators with a strange understanding of how the network should work. And now this is also your problem if you are going to play with BGP. Well, in the end, many one-legged operators (one upstream provider of connectivity), so they have their own sets of crutches.
(Evgeny Bogomazov continues)
As you can see, even this topic can be developed for a long time and it is difficult to keep within 40 minutes.
So, you have pens with which you want to limit the region. Let's now look at what you need to look at and what is important to consider when you want to stay at a new site.
The best case is not to buy your hardware, but to agree with a cloud about hosting. Later, it will be possible to agree with him that you will independently connect to certain providers.
On the other hand, if all of you went down this path, you should understand approximately which region, with or without handles, will be pulled down to this site. For this you need modeling, more precisely - you need to understand that if there are several routes from different sites, which one will be chosen as the best. To do this, you must have some idea of ​​how BGP works and how routes circulate in the current situation.
The two main points are the length of the path, which is affected by the prependy and local preference, which says that routes from clients are preferable to routes from anywhere else. In principle, these two points are enough to understand which region will be pulled and where to stand.
Among other things, it is necessary to take into account a couple of things, namely, what is the connectivity of your supplier, additionally and the fact that some suppliers do not communicate with each other (“peer-to-peer wars”) and even if you connected to the regional Tier-1, this does not mean that all local users will see you.
Another thing that is often forgotten is that connectivity in IPv4 and IPv6 is completely different, not transferable. And here we come to the main point. Answering the question: “Where to get up?” The choice seems to be obvious. If you have users in the region - you get up on IX in this region and there is nothing more to think about. There is connectivity, most users, in theory, should be connected to it, and most content managers, companies such as Yandex and others, first connect to IXs and only then to suppliers. But suppliers can have unique customers, some suppliers are not present on IX'e themselves and, as a result, you cannot redirect these users to yourself - they will go to you in strange ways.
When choosing suppliers, the head cannot be turned off - we had a couple of cases when the wrong approach led to a problem. Suppliers are our choice, because if you don’t have so many connection resources, then by going to the largest regional players you end up with the same connectivity as on IX'e.
How to choose the right suppliers, Dima, tell?
(and again Dmitry Shemonaev)
Okay. Let's imagine that we have one point and the region of our interest is Russia. We have a point in a conditionally good data center in Moscow, we operate with our autonomous system, with our own set of prefixes and decide to scale using BGP anycast - stylishly and fashionably.
Business, together with technical colleagues, decided that from Vladivostok to Moscow there was a very large RTT and that was bad, that is, not good. Let's say, we will stay in Moscow and put an end to Novosibirsk, everything will be better, RTT will fall, of course. No sooner said than done.
This raises the question of the site for the placement of equipment, but this is a little beyond the scope of our conversation today, but the question of choosing an operator is completely. It would seem that the choice is obvious - in Moscow we are connected to the conditional “Moskvtelekom”, let us also stick to it in Novosibirsk. Yes, in principle, we can rely on the internal routing of the supplier, but this is not always correct - in this case we put all the eggs in one basket and we must understand that the routing according to the operator's IGP may not be optimal, to put it mildly, because it is not always clear what motivates it. Sometimes it’s understandable, sometimes not so much - now it’s not about that, besides, the management has forbidden me to swear, so I just can’t analyze some examples in detail.
Modern trends are such that even Moskvatelecom can say that the time has come for SDN and now we will install a wonderful controller that will steer the network. And at one point, such a controller can simply destroy this network. I don’t remember a case like that, specifically with an SDN controller, but just recently in America a large network operator (CenturyLink) went off to a network god and the entire network was unstable throughout the USA. Because of one network card. The NOC of this operator resolved this problem for three or four days. Because of one network card.
If you are connected to one operator, I sincerely congratulate you.
Well, it means that we have decided that we will not cooperate with a conditional single operator in Moscow and Novosibirsk. Here it is with Moskvtelekom, and there it is with Novosibirsktelekom (all coincidences are random). The sizes of the client cones of these two telecoms differ as a turtle from an elephant and you will get all your traffic to where the main client cone falls, that is, to Moscow’s Moskvatelecom. It is always desirable that the operators be parity and have peerings among themselves on the territory of the region in which your interest is located. In Russia, several years ago, the largest operators, such as Rostelecom and TTC, had peerings in Moscow, in St. Petersburg, in Nizhny Novgorod, in Novosibirsk, and in Vladivostok. Therefore, the traffic went between these operators, plus or minus, optimally.
But the operator still needs to be selected correctly. So that he has a community, so that he has a NOC. All this is really important, because last year there was a wonderful case when one, rather large Russian operator tested some of his services and at night he announced a lot of prefixes from the city of St. Petersburg with the insertion of his autonomous system into the second routing server. DE-CIX in Frankfurt. He announced it there with the blackhole community.
As a result, a lot of St. Petersburg operators and data centers faced with the inaccessibility of, for example, the TTK network. It also touched us, but we were able to bypass it, because between our points there is a network, somewhere overlay, and somewhere physical, and we smarshrutizirovali back traffic from the problem operator in where there were no problems. In general, overcame. But I am telling you so that you know that the operator’s NOC should be adequate, because in that case the NOC did not get in touch on the night from Friday to Saturday, but woke up only on Monday. Three-day partial inaccessibility for a number of operators. It is better to think three times. Let's go back to NOC. Network Operations Center is a network management center, this is a division of the company that deals with the operation of the network, work on the network, and so on. Answers to a number of tickets coming in on the network. What do you want to add? The specialists brought up by the Aichi-Sum in this room probably know all the good things about monitoring. This is really important. In some cases, you will need to monitor very specific things. Some users may complain that “everything is somehow bad” and do not have the ability to provide the diagnostics required to start work to correct the situation. There is a signal about what is bad - and what and where, it is not clear. In this case, we try to interact with the NOC of the operator in whose client cone this user is located. If it does not work out, then we look at what correlations are - the possibility of the nodes of the RIPE Atlas project inside this cone. In general, we collect as we can. We are ready for what we can not always give.
In some cases, it makes sense to monitor with which community a particular prefix comes to your border router and collect a historical retrospective, sorry for the tautology. For example, take the three operators: Megafon, Rostelecom and Transtelecom. Suppose that all of them have a peering in the territory of the Russian Federation, and you are connected to the conditional Rostelecom or it does not matter. You see the prefixes your users are in, with some sort of labeling community, in this case. They can be collected, recorded and when something happens - the community will change. For example, you receive a prefix from the community, which is a feast in Russia. Yeah, great, recorded. And then this community has changed to the fact that this is a feast in Frankfurt. What does it mean?That these operators broke the peering relationships among themselves and everything is not very good with your latency - you go through the European loop. In this case, you can do something proactively, but it is time consuming and requires determination, as well as other qualities.
And if possible, automate everything - it was hard 10 years ago, and now there are a bunch of utilities, such as Ansible, Chef, Puppet, which can interact with network elements. Why is it important to automate? I have been setting up BGP for a very long time, and the first rule sounds like this: “Whoever you set up a BGP neighborhood, you’re not set up with a nice person on the other side.” From the perspective of that person, this rule also applies to you.
I personally had a case where I, in a certain Samara operator, whose name we will not name, transferred all peerings from one border to another. I had a joint with one large content-maker, an online cinema, and I had a joint with a local Rostelecom subsidiary. I only had a gig joint with a content maker, and a 100 megabit one with my daughter. And I, as a pleasant person, transferred it all at night - I look at the charts, at the hundred-megabit joint, and I think: “Oh, how good it is!” And then I look - this one in the regiment, that one in the regiment thinks (beats his forehead) - I forgot to make filters. Here, in order to be fenced off from such actions of your own, after all everyone else accepted, from the triple strike, you just need to be protected by automation. Automation is the enemy of bad people and the friend of the good.
Then you, Marrying. (Evgeny Bogomazov continues)
So we discussed all the original theses. But on anycast, everything does not end, in addition to it, you need to keep track of other, additional, things. Let's see what else there is. You need to see how well the application integrates into the distribution - if there are several sites, you need to be able to scatter content on these sites. If this is not possible, then no matter how much your system is distributed, all users will go to the site where the application is actually located. And it is precisely RTT that you will not save.
On the other hand, if you do not have user data as such, then you can place a user application on that site - do so. If the application will support everything, then use anycast clouds, if you do not want to develop your own infrastructure - this will bring a lot of profit.
Well, if you already have several sites and the user comes to one of them, something can happen there, let's say it drops or the links break, so users will go to another site. But they should not notice it. Therefore, you should be able to transfer this traffic as quickly as possible within the framework of the internal anycast network, and in general it is necessary to consider such a problem as inevitable - therefore it is not bad to put something into the application in order to prepare for such a development.
Ideally, if you have the business metrics of the application, then if they fall, immediately make a request to the network monitoring and generate a report on the state of either the internal or external network, or better both. Business metrics usually fall because something happened somewhere, but this, of course, is utopia - even we haven't come close to that. You have an external network, but there are also internal platforms - they must communicate with each other. It’s not necessary to have your own physical infrastructure - you can use networks of third-party operators, the main thing is to set up virtual tunnels. Additionally, you must configure the routing of the internal network, because at BGP, everything does not end there. We have, due to the peculiarities of traffic processing, our own protocols, communication style and scripts.
If you have a lot of sites, you should be able to update configs in different places at the same time. You receive a new prefix - you have to announce it everywhere, updated DNS - the same. In the context of SDNs, you must collect data from the sites and aggregate them somewhere, and the changes that have occurred due to this data should be spread back to the sites. The last item is DNS. The example with Dyn was indicative - as you remember, in 2016 they were attacked, which they failed to cope with and a very large number of resources popular in the United States were no longer available. DNS must also be protected, otherwise users will not find your application on the network. The DNS caches save, in part, and there are interesting IETF jobs on this topic, but it always comes down to whether certain DNS resolvers will support it.
In any case, you must have a secure DNS. This is the first step that must work in order for the user to become familiar with your application. At the very first opening of the page, in addition to those RTTs, which we have already mentioned, there will be additional delays in DNS queries and you need to be prepared that for the first time the user’s page will open long enough, someone may not have to wait. If for you a long first opening is critical - you will have to speed up the DNS as well.
In the case of anycast, you can cache DNS responses at your points of presence, because you already have these sites and DNS responses will come fairly quickly. What is the problem? Well, latency, well, balancing. As we have already mentioned, there is still nature. Partly because of it you need to be placed in different places. This should not be forgotten, although the risk seems small.
There is also man and the factor of chance. Therefore, you should, if possible, automate everything, test the spread configs and monitor changes. Even if you fakapnuli something - it will be good if you can make changes quickly and locally.
This is 90% of cases. But the remaining 10% is if competitors decide to eliminate you. In this case, you have serious problems. Why is “Backup” highlighted in a separate large font? If you decide to put, you will need very large volumes of channels on your sites, which means you will need to negotiate with a large number of providers. Otherwise, with the current average level of attacks, you can not cope in any way. So it is better to delegate, rather than buy your hardware. Even part of the functionality that we have described today in the framework of anycast and the problems that arise are easily solved incorrectly. So if there is an opportunity to not solve these problems and shift them to third-party shoulders, then this is probably worth doing. Otherwise, you need to accurately answer the question, why do you need to implement it all.
Well, in the event of an attack, contact those clouds that specialize in solving such problems. Well, or you can still talk to us.