Cisco IP SLA based caching DNS balancing

DNS Fail

In the network of any Internet provider you can find such a mandatory element as a caching DNS service. And since the work of the Internet without the DNS service is impossible, there will be at least two such servers, with mandatory redundancy and balancing. In this article I will try to describe one of the options for load balancing between several caching servers based on Cisco IP SLA.

Possible solutions

Balancing on the client side

The easiest way to backup a caching DNS is to configure on the client side several entries with addresses of DNS servers. DNS server addresses can be communicated to the client in various ways:

through the appropriate attributes of DHCP (for those subscribers who use IPoE technology and automatic configuration);
via the LCP protocol for PPPoE subscribers (or any other PPP technology);
specifying in the annex to the contract or instructions for subscribers whose IP connection is manually configured.

The client, in case of unavailability of the first server, automatically switch to the second, etc. But this method has two obvious drawbacks.

The first is that there is no full load balancing. While the first server is available and working (and we presume that the server’s failure is an unpleasant thing, but extremely rare), the subscribers have no reason to switch to the second one, and it is idle.

The second drawback stems entirely from the first, but is of much greater significance. The criteria for switching to the backup server are fully determined by the settings on the client side. For example, here is a description of the standard DNS client timings of Windows XP . From the article it follows that Windows XP switch to using a backup server only if the main server does not respond within one second. Those. You can imagine a situation where, as a result of an overload, a configuration error or a failure, the main server does not work satisfactorily, but responds within 0.95 seconds. Obviously, with such a delay, the subscriber cannot speak about any normal operation of the Internet. But, since the response time is less than one second, subscribers "will cry, prick, but continue to eat a cactus," or rather, they will experience a degradation of the service, but will not switch to an idle standby DNS server.

Of course, this behavior of the DNS service is abnormal and must be detected by the monitoring system with the subsequent escalation of the problem. But the issue of monitoring is away from the topic of the note.

Destination NAT

Another balancing method is destination NAT. This is a solution that is actively used to distribute the load on clusters of web servers. In this case, subscribers use the only address of the DNS server that is the address of the NAT interface, and real DNS servers are privately addressed. When subscribers access the DNS server, a translation occurs in the Destination IP-address request to one of the servers. With this approach, the load can be distributed evenly. But this method is somewhat redundant, as full-fledged NAT, in the case of web servers, is justified by the fact that the transport there is provided by the TCP protocol, which is stateful, in contrast to the stateless UDP transport, usually used for DNS queries. Unlike exchanging traffic with a web server, the client’s dialogue with the DNS server is very concise and involves only a request-response pair.

Static routing

A simpler way is static routes. Imagine that we have three DNS servers with IP addresses:

10.0.0.1,
10.0.0.2,
10.0.0.3.

Loopback interface is configured on each of them. 10.10.10.10 . The DNS service is configured to receive requests for this address, and the firewall has rules for TCP / UDP ports 53.

In this case, any of these servers is ready to serve DNS requests with the destination address 10.10.10.10. This will be the IP address of the DNS server used by all our subscribers. It remains to distribute their requests between real servers. There is nothing special to do. We configure on the router where our servers are connected three static routes:

ip route 10.10.10.10 255.255.255.255 10.0.0.1 ip route 10.10.10.10 255.255.255.255 10.0.0.2 ip route 10.10.10.10 255.255.255.255 10.0.0.3

As a result, we get the following picture:

 Router#show ip route 10.10.10.10 Routing entry for 10.10.10.10/32 Known via "static", distance 1, metric 0 Routing Descriptor Blocks: * 10.0.0.1 Route metric is 0, traffic share count is 1 10.0.0.2 Route metric is 0, traffic share count is 1 10.0.0.3 Route metric is 0, traffic share count is 1

Now the standard per-flow balancing mechanism will sequentially decompose requests from different subscribers along the three available routes:

Balancing scheme

Apply IP SLAs DNS for Failover

Now that we have taken care of balancing, it remains to resolve the issue of switching traffic in case of failure of one of the servers. This will help us with the Cisco IP SLA mechanism, or rather the IP SLAs DNS Operation feature.

Description of work

Briefly, this function can be described as follows:

In the router configuration, three IP SLA records are created that periodically query each of the three DNS servers (for example, the standard A-RR request www.google.com). The result of the query determines the status of the record.
Three Track objects are associated with IP SLA records, which become DOWN if the status of the corresponding IP SLA is not OK .
Three static routes from the section above are connected to their Track object.

Now, if at a certain moment one of the servers does not respond or does not respond correctly to the DNS query, the corresponding Track object will switch to the DOWN state, and the static route associated with it will disappear from the routing table. As a result, the problem server will be excluded from the balancing mechanism.

Configuration example

IP SLA configuration:

 ip sla 1 dns www.google.com name-server 10.0.0.1 timeout 5000 frequency 9 threshold 10 ip sla 2 dns www.google.com name-server 10.0.0.2 timeout 5000 frequency 9 threshold 10 ip sla 3 dns www.google.com name-server 10.0.0.3 timeout 5000 frequency 9 threshold 10

Activating and setting tracking:

 ip sla schedule 1 life forever start-time now ip sla schedule 2 life forever start-time now ip sla schedule 3 life forever start-time now track 1 ip sla 1 track 2 ip sla 2 track 3 ip sla 3

Setting routes with reference to tracking:

 ip route 10.10.10.10 255.255.255.255 10.0.0.1 track 1 ip route 10.10.10.10 255.255.255.255 10.0.0.2 track 2 ip route 10.10.10.10 255.255.255.255 10.0.0.3 track 3

Testing

Now, besides the routes themselves, we have three IP SLAs that are currently working, and the result of the last query is OK for all.

 Router# show ip sla statistics IPSLA operation id: 1 Latest RTT: 1 milliseconds Latest operation start time: 15:33:27 UTC+7 Wed Aug 17 2016 Latest operation return code: OK Number of successes: 373 Number of failures: 0 Operation time to live: Forever IPSLA operation id: 2 Latest RTT: 1 milliseconds Latest operation start time: 15:33:27 UTC+7 Wed Aug 17 2016 Latest operation return code: OK Number of successes: 373 Number of failures: 0 Operation time to live: Forever IPSLA operation id: 3 Latest RTT: 1 milliseconds Latest operation start time: 15:33:27 UTC+7 Wed Aug 17 2016 Latest operation return code: OK Number of successes: 373 Number of failures: 0 Operation time to live: Forever

Let's try to disable the DNS service on the DNS-1 server. During the next check (they happen every nine seconds), the corresponding IP SLA will report the problem:

 Router# show ip sla statistics 1 IPSLA operation id: 1 Latest RTT: NoConnection/Busy/Timeout Latest operation start time: 15:37:48 UTC+7 Wed Aug 17 2016 Latest operation return code: Timeout Number of successes: 1 Number of failures: 1 Operation time to live: Forever

And the corresponding route with next-hop 10.0.0.1 will disappear from the routing table:

 Router#show ip route 10.10.10.10 Routing entry for 10.10.10.10/32 Known via "static", distance 1, metric 0 Routing Descriptor Blocks: * 10.0.0.2 Route metric is 0, traffic share count is 1 10.0.0.3 Route metric is 0, traffic share count is 1

Client requests are now distributed between the two remaining servers. If we restart the service again, then at the next IP check, the SLA will return the route and the server will again begin to participate in balancing.

Lack of method

The main and most significant drawback of this method is the minimum possible frequency of polling nine seconds. Very often, this "efficiency" is critical. Unfortunately, this is a limitation of Cisco functionality. If someone tells you how to get around it, I will be very grateful.

Conclusion

We examined the use of static routing and Cisco IP SLA when balancing and reserving a cached DNS service. Obvious advantages of the method:

ease of setup;
everything works on the router and does not require additional funds;
the response of the service is monitored, and not just, for example, ICMP accessibility echo request.

Disadvantage:

The minimum polling period is 9 seconds, which means the reaction time is up to 9 * 2 = 18 seconds.

Source: https://habr.com/ru/post/307932/

All Articles