What is the default address of the router on the network is a big question. In fact, nothing prevents him from being any address from the subnet. And the writers of OpenStack also decided - let's be the first to suffer?
As a result, you do not have time to recover, as everything falls. Why? Because unexpectedly for all, default gw is not on the router, as it should be, but on your open-stack. Customers call, chef lutuet. And you are looking for another reason for the fall. Just a colleague uncoupled the existing address in order to replace it, and the open stack turned out to be smarter ...
In some cases, the problem occurs immediately, in some - not. Let me remind you: the old problem was that intermittently some IP-packets began to disappear.
I will try to justify a little. - Often our problems coincided with the presence of external attacks. At the same time, in many cases, it seemed that the problems were in the overloaded channels. In some cases, we exceeded the channel limit and the packets were really dropped. This was aggravated by the presence of infected machines in the platform, which generated an incredible amount of internal traffic. Plus network equipment malfunction, in which, due to programmer errors, the wrong packages were also killed. In addition, the configuration files are huge .
I am not a robot or a wizard - one can understand the functionality of options with thoughtful reading, but it was completely unclear whether they are necessary in a specific context. I had to intuitively guess the most reasonable assumptions in practice.
Therefore, it was difficult for me and my colleagues to isolate and identify the problem. Worse, there was no problem in the newly created farm. We generated three hundred machines, and everything worked like a clock. Of course, we immediately began to prepare it in production. This meant the introduction of "ragged" ranges of IP addresses. We cleaned the farm, removing these three hundred machines. And suddenly, with only three test virtual machines, the same thing happened as on the old farm - packages began to disappear in large numbers. So we decided that the problem was somewhere deep in OpenStack.
In the old farm, we found a relatively simple way to get around this problem. This was done by tearing off the internal IP address and assigning it from another subnet - we had to add new subnets frequently. The problem went away for a while. Some machines worked well.
In the course of a long investigation, interrupted for design work, distracted by problems from VIPs, we were still able to identify several errors. In addition, these same files are different if you use the controller as a compute node, and if you do not. In one of the first successful configurations, we used it. Then they refused it. Some of the settings remain. Thus, in two of the nine machines there were incorrect settings (the dvr parameter, but dvr-snat, hit the compute nodes). In the end I found the right parameter and put it in place.
Without understanding how the virtual router works - where it takes the settings, I had to configure it too. He, in theory, should be with one address and, accordingly, with one MAC address. Is it logical We talked this way and accordingly set up with a colleague.
At some point, when investigating problems with DHCP (see Part 2), I found duplicate mac addresses. Not one, two, but much more. Here is the number!
It was decided to change the settings of base_mac and dvr_base_mac. Now in each computer and in each controller these parameters are different.
We have not included l2population from the very beginning - well, we simply did not reach the hands. And in the new farm included. And look, after all such changes - it worked! Not only that - pings have ceased to disappear from the word "in general"! Previously, no, no, yes, and the bag disappears just like that - 0.1% and we thought it was generally good. Because it is much worse when a quarter, or even half, disappeared.
We waited patiently for a day (and we wanted to run, shouting “it worked!”), We applied similar changes in the old farm. The second week is normal flight.
Conclusion: of course, all this would not have happened if we set up not manually, but through an automated installer. However, the experience gained is invaluable.
Source: https://habr.com/ru/post/333872/
All Articles