This is a small story from real-life practice, when a small problem, well disguised by fault tolerance, turns into a headache.
Small disposition
A small filialchik, it has its own PBX (asterisk + FreePBX) based on desktop iron and the same local terminal server with 1C, file spoiler and a virtual RO domain controller. Internet distributes Mikrotik. Filialchik small, that's enough for them.
It all started with monitoring (due to lack of time and laziness, it doesn’t monitor everything), which reported that one server was overheated (with PBX) in a branch office. While the local people were solving the problem, the old man froze and broke the MySQL base a little.
Much foreshadowed trouble, but not this ...
It does not matter, the base is repaired, everything should work. But local complain calls break down. Okay - there are problems in FreePBX, I take backup, deploy, everything is OK.
')
And the trouble is on the spot, local people still complain, calls do not go normally. Before them, the call passes normally, but when they call themselves, or call each other, a delay of several seconds is obtained. I start to watch the voluminous and unintelligible logs of Asterisk and FreePBX, they cannot detect the problem. I remember there was a problem with STUN and ICE, which gave a similar delay. Disable everything to hell, the result is zero.
Despondency - the path to making bad decisions
I get discouraged, many hours tinkering PBX does not lead to anything good, already deep night, but the problem is not solved.
Left the problem until the morning, in the hope of a fresh mind. In the morning, another unsuccessful decision was made: once the system broke down (although the dependency could not be so devastating), I try to fix the system by reinstalling all the packages. The result is slightly more than zero, the delay has decreased (not significant, but success).
I accept another bad decision: if a partial repair of the OS (and the backup database) had little success, and the root of the problem is still not clear, and a lot of time has already been spent searching for the cause, then I decide to act radically: demolish the OS and we roll everything from scratch (since the automation of the process does it in a reasonable time). I roll a configuration FreePBX from a copy. Another failure. The result is zero!
Despair - mind overshadows, decisions get worse
I fall into despair. Very bad thoughts start coming, I think: maybe the konfa in the backup curve (I had it after a number of updates that did not work after them, and I could not find the reason), nothing remains: I have to roll everything from a clean sheet with my hands. What a disgrace! The result is strictly zero, and even spent a lot of time!
Acceptance - the path to awareness
In desperate attempts to understand what is happening, I begin to carefully study the logs. I notice the pattern. Extension call takes exactly 5 seconds, and for a group of calls from 3 Extension in 15! I start to google about the call delay, but already indicating a specific delay. And I come across the answer I have already found, people say that the problem is in the DNS, but I know for sure, there is no problem, all addresses are resolved!
The obvious is the unbelievable
There is nothing to do, I take into the hands of nslookup and bingo (that would be right to do it)! Primary DNS lies (virtualka with the controller), and I did not notice! There would be one DNS, there would be an error right away;)
Total
An elementary problem that monitoring could see (it still needs to be configured for all nodes) masked by the DNS fault tolerance, led to the loss of almost two working days to resolve the stupid situation. Too lazy around the smut, set up monitoring a minute - look for a problem where it is not - two days.