I will share the problem and its sudden solution, which we faced last week, and which caused us a lot of trouble.
So, the situation is quite standard, the central office of the company is connected to communication channels with remote divisions. Communication (Internet and VPN) is provided by two operators. In order to minimize the downtime of remote subdivisions when a single channel falls on the office, 2 DMVPN tunnels are built for each subdivision. Routing within the network is dynamic, eigrp. Accordingly, in the central office 2 Cisco routers are used.
The number of remote units - about 70, respectively, each router builds the same number of tunnels. The average channel load is 40-60% of the bandwidth guaranteed by the operators.
The DMVPN setting was fairly standard, described in the primer:
hub:
')
interface Tunnel201
description - = DMVPN_201 = -
ip address 10.10.201.1 255.255.255.0
no ip redirects
ip mtu 1416
ip hold-time eigrp 1 25
no ip next-hop-self eigrp 1
ip nhrp authentication 11111
ip nhrp map multicast dynamic
ip nhrp network-id 201
no ip split-horizon eigrp 1
delay 1000
cdp enable
tunnel source GigabitEthernet0 / 0.2
tunnel mode gre multipoint
tunnel key 11111
!
end
router eigrp 1
network 10.10.201.0 0.0.0.255
network 192.168.0.0 0.0.1.255
In general, a classic scheme that works perfectly for several months in a row.
www.cisco.com/en/US/tech/tk583/tk372/technologies_configuration_example09186a008014bcd7.shtml is a similar example from Cisco.
Until one day we encountered the problem that all internal routes through eigrp did not start to fall off every 30-90 seconds. At the same time, the tunnel stood perfectly and the tunnel interface perfectly responded. In the logs there were errors like:
* Apr 23 2013 15: 19: 47.759 GMT + 11:% DUAL-5-NBRCHANGE: EIGRP-IPv4 1: Neighbor 10.10.202.9 (Tunnel202) is down: holding time expired
* Apr 23 2013 15: 19: 52.707 GMT + 11:% DUAL-5-NBRCHANGE: EIGRP-IPv4 1: Neighbor 10.10.202.9 (Tunnel202) is up: new adjacency
* Apr 23 2013 15: 23: 56.298 GMT + 11:% DUAL-5-NBRCHANGE: EIGRP-IPv4 1: Neighbor 10.10.202.57 (Tunnel202) is up: new adjacency
* Apr 23 2013 15: 24: 43.070 GMT + 11:% DUAL-5-NBRCHANGE: EIGRP-IPv4 1: Neighbor 10.10.202.9 (Tunnel202) is down: holding time expired
And the problem arose suddenly and immediately on both routers. Started dancing with a tambourine, smoking cisco.com, etc.
Overloading each of the routers separately did not solve the problem.
The overload of both routers, undertaken as an extreme measure, left the branches without communication for a rather large (up to 15-20 minutes) period, but it helped to cope with the problem. It was possible to breathe out and quietly start looking for the cause, hoping that for a few more months everything would work fine, as it had worked before.
However, we were delighted early, 3 days later the problem was repeated in exactly the same way. All recommendations from cisco.com to change mtu on the tunnel and the physical interface, as well as other shaman kamlaniya results did not bring. After a rather long period of time, in one of the very tiny forums we found a topic with a similar problem, and in the last message something like this was written:
“Thank you all, the problem is fixed. I don’t know how or why, but the inclusion of ip bandwidth-percent eigrp helped. ”
Since there was still nothing to do, without special faith in success, we prescribe the indicated command in the properties of the tunnel, specifying the number 100 as a parameter of using the channel (there is nothing to lose anyway), and - MIRACLE, everything worked like a clock.
But, interrupting a bunch of forums, we repeatedly stumbled upon this problem that arose among other colleagues, and the solution was never described.
Naturally, in the future we reduced the percentage figure.
Maybe we invented another bike, but did not find this information anywhere else. Maybe someone else will need it. Please use.