“If in a simple configuration, the microtic does not work, then you don’t know how to cook it ... or obviously missed something.”

How failover and netwatch work together. View from the inside.
Almost every more or less grown-up company begins to want the quality of communications. Among other things, a customer often wants a fail-safe “Dual WAN” and VoIP telephony. Also fault tolerant, of course. A lot of manuals and articles on each topic have been written a lot, but suddenly it turned out that not everyone can combine the first and second.
')
On Habré there is already an article
«Mikrotik. Failover. Load balancing " from
vdemchuk . As it turned out, it was the source of copy-paste code for routers for many.
Good, working solution, but SIP-clients from LAN, connecting to external IP-PBX via NAT, lost their connection when switching. The problem is known. It is connected with the work of the
Connection tracker , which remembers existing connections outside, and saves their state regardless of other conditions.
You can understand why this happens by looking at the packet flow diagram:

For transit traffic, the connection tracker processing procedure is performed in just one chain - prerouting, (i.e. before routing),
until the route and the outgoing interface are selected. At this stage, it is not yet known which interface the packet will go to the Internet, and it is impossible to track src-ip with several Wan interfaces. The mechanism fixes the established connections already post-factum. It fixes and remembers for a while until the packets go through the connection or until the specified timeout expires.
The described behavior is typical not only for MikroTik routers, but also for most Linux-based systems running NAT.
As a result, when the connection is broken through WAN1, the data flow is dutifully sent through WAN2, only the SOURCE IP of the packets that have passed through NAT remains unchanged - from the WAN1 interface, There is already an entry in the connection tracker. Naturally, the answers to such packets go to the WAN1 interface which has already lost contact with the outside world. In the end, the connection seems to be there, but in fact it is not. At the same time, all
new connections are established correctly.
Hint: you can see from which and to which addresses NAT is made in the columns “Reply Src. Address "and" Reply Dst. Address ". The display of these columns is enabled in the “connections” table using the right mouse button.

At first glance, the output looks quite simple - when switching, reset previously established SIP connections so that they are established again, already with the new SRC-IP. The benefit of a simple script for the Internet wanders.
Script:foreach i in=[/ip firewall connection find dst-address~":5060"] do={ /ip firewall connection remove $i }
Three steps to the file
Step one. Copypasters faithfully transfer the config for Failover recursive routing:
Setting up routing from the article “Mikrotik. Failover. Load balancing# Set up network providers:
/ ip address add address = 10.100.1.1 / 24 interface = ISP1
/ ip address add address = 10.200.1.1 / 24 interface = ISP2
# Set up a local interface
/ ip address add address = 10.1.1.1 / 24 interface = LAN
# hide for NAT all that comes out of the local network
/ ip firewall nat add src-address = 10.1.1.0 / 24 action = masquerade chain = srcnat
### Provide failover with deeper ### channel analysis
# using the scope parameter, specify the recursive paths to nodes 8.8.8.8 and 8.8.4.4
/ ip route add dst-address = 8.8.8.8 gateway = 10.100.1.254 scope = 10
/ ip route add dst-address = 8.8.4.4 gateway = 10.200.1.254 scope = 10
# specify 2 default gateway through nodes the path to which is specified recursively
/ ip route add dst-address = 0.0.0.0 / 0 gateway = 8.8.8.8 distance = 1 check-gateway = ping
/ ip route add dst-address = 0.0.0.0 / 0 gateway = 8.8.4.4 distance = 2 check-gateway = ping
Step two. Track the switching event. Than? "/ tool netwatch" naturally! Attempting to track down a WAN1 gateway usually looks like this:
Netwatch config/ tool netwatch
add comment = "Check Main Link via 8.8.8.8" host = 8.8.8.8 timeout = 500ms /
down script = ": log warning (" WAN1 DOWN ")
: foreach i in = [/ ip firewall connection find dst-address ~ ": 5060"] do = {
: log warning ("clear-SIP-connections: clearing connection src-address: $ [/ ip firewall connection get $ i src-address] dst-address: $ [/ ip firewall connection get $ i dst-address]")
/ ip firewall connection remove $ i} "
up-script = ": log warning (" WAN1 UP ")
: foreach i in = [/ ip firewall connection find dst-address ~ ": 5060"] do = {
: log warning ("clear-SIP-connections: clearing connection src-address: $ [/ ip firewall connection get $ i src-address] dst-address: $ [/ ip firewall connection get $ i dst-address]")
/ ip firewall connection remove $ i} "
Step three. Check.
Admin extinguishes the first uplink of WAN1 and manually runs the script. SIP clients have reconnected. Works? Works!
Admin turns WAN1 back on and manually runs the script. SIP clients have reconnected. Works? Works!
Fail
In a real situation, such a config refuses to work. Repeatedly repeating step number 3 brings the admin into a state of bitterness and we hear "Your microtic does not work!".

Debriefing
It’s all a matter of misunderstanding how
Netwatch works . With regard to recursive routing, the utility simply pings the specified host according to the main routing table using
active routes.
Let's do an experiment. Disable the main channel WAN1 and see the interface / tool netwatch. We will see that host 8.8.8.8 still has a UP state.
For comparison, the option check-gateway = ping, works for each route separately, including recursively, and makes the route itself active or inactive.
Netwatch uses routes already active at the moment. When something happens on the link to the ISP1 provider's gateway (WAN1), the route to 8.8.8.8 via WAN1 becomes inactive, and netwatch ignores it, sending packets to the new default route. Failover plays a cruel joke, and netwatch thinks it's all right.
The second netwatch behavior is double-triggering. Its mechanism is as follows: if pings from netwatch fall into
a check-gateway timeout , then for one verification cycle the host will be recognized as DOWN. The channel switch script will work. SIP connections will correctly switch to a new link. Works? Not really.
Soon the routing table will be rebuilt, host 8.8.8.8 will receive the status UP, the script for resetting SIP connections will work again. Connections are reset a second time via WAN2.
As a result, when ISP1 is returned to operation and the working traffic is transferred to WAN1, the SIP connections will remain hanging through ISP2 (WAN2). This is fraught with the fact that if there are problems on the spare channel, the system will not notice this and will not have a telephone connection.

Decision
To ensure that the traffic to the host used for monitoring 8.8.8.8 does not turn on ISP2, we need to have a spare route to 8.8.8.8. In case of falling ISP1, create a backup route with a large distance, for example distance = 10 and type = blackhole. It will become active when the link to WAN1 Gateway disappears:
/ ip route add distance = 10 dst-address = 8.8.8.8 type = blackholeAs a result, we have the addition of the config with just one line:
Corrected routing# Set up network providers:
/ ip address add address = 10.100.1.1 / 24 interface = ISP1
/ ip address add address = 10.200.1.1 / 24 interface = ISP2
# Set up a local interface
/ ip address add address = 10.1.1.1 / 24 interface = LAN
# hide for NAT all that comes out of the local network
/ ip firewall nat add src-address = 10.1.1.0 / 24 action = masquerade chain = srcnat
### Provide failover with deeper ### channel analysis
# using the scope parameter, specify the recursive paths to nodes 8.8.8.8 and 8.8.4.4
/ ip route add dst-address = 8.8.8.8 gateway = 10.100.1.254 scope = 10
/ ip route add distance = 10 dst-address = 8.8.8.8 type = blackhole
/ ip route add dst-address = 8.8.4.4 gateway = 10.200.1.254 scope = 10
# specify 2 default gateway through nodes the path to which is specified recursively
/ ip route add dst-address = 0.0.0.0 / 0 gateway = 8.8.8.8 distance = 1 check-gateway = ping
/ ip route add dst-address = 0.0.0.0 / 0 gateway = 8.8.4.4 distance = 2 check-gateway = ping
This situation is typical when the last mile falls, when the ISP1 gateway becomes inaccessible. Or using tunnels that are more susceptible to falls due to chain dependency.
I hope the article will help you avoid such mistakes. Choose fresh manuals. Stay informed, and everything will "fly up" with you.