Openstack. Detective story or where the connection is lost? Part three

“Who builds like this ?!”

What is the default address of the router on the network is a big question. In fact, nothing prevents him from being any address from the subnet. And the writers of OpenStack also decided - let's be the first to suffer?

As a result, you do not have time to recover, as everything falls. Why? Because unexpectedly for all, default gw is not on the router, as it should be, but on your open-stack. Customers call, chef lutuet. And you are looking for another reason for the fall. Just a colleague uncoupled the existing address in order to replace it, and the open stack turned out to be smarter ...

Life goes on

In some cases, the problem occurs immediately, in some - not. Let me remind you: the old problem was that intermittently some IP-packets began to disappear.

I will try to justify a little. - Often our problems coincided with the presence of external attacks. At the same time, in many cases, it seemed that the problems were in the overloaded channels. In some cases, we exceeded the channel limit and the packets were really dropped. This was aggravated by the presence of infected machines in the platform, which generated an incredible amount of internal traffic. Plus network equipment malfunction, in which, due to programmer errors, the wrong packages were also killed. In addition, the configuration files are huge .

I am not a robot or a wizard - one can understand the functionality of options with thoughtful reading, but it was completely unclear whether they are necessary in a specific context. I had to intuitively guess the most reasonable assumptions in practice.

Therefore, it was difficult for me and my colleagues to isolate and identify the problem. Worse, there was no problem in the newly created farm. We generated three hundred machines, and everything worked like a clock. Of course, we immediately began to prepare it in production. This meant the introduction of "ragged" ranges of IP addresses. We cleaned the farm, removing these three hundred machines. And suddenly, with only three test virtual machines, the same thing happened as on the old farm - packages began to disappear in large numbers. So we decided that the problem was somewhere deep in OpenStack.

Strange temporary solutions

In the old farm, we found a relatively simple way to get around this problem. This was done by tearing off the internal IP address and assigning it from another subnet - we had to add new subnets frequently. The problem went away for a while. Some machines worked well.

Solution somewhere nearby

In the course of a long investigation, interrupted for design work, distracted by problems from VIPs, we were still able to identify several errors. In addition, these same files are different if you use the controller as a compute node, and if you do not. In one of the first successful configurations, we used it. Then they refused it. Some of the settings remain. Thus, in two of the nine machines there were incorrect settings (the dvr parameter, but dvr-snat, hit the compute nodes). In the end I found the right parameter and put it in place.

Without understanding how the virtual router works - where it takes the settings, I had to configure it too. He, in theory, should be with one address and, accordingly, with one MAC address. Is it logical We talked this way and accordingly set up with a colleague.

At some point, when investigating problems with DHCP (see Part 2), I found duplicate mac addresses. Not one, two, but much more. Here is the number!

It was decided to change the settings of base_mac and dvr_base_mac. Now in each computer and in each controller these parameters are different.

We have not included l2population from the very beginning - well, we simply did not reach the hands. And in the new farm included. And look, after all such changes - it worked! Not only that - pings have ceased to disappear from the word "in general"! Previously, no, no, yes, and the bag disappears just like that - 0.1% and we thought it was generally good. Because it is much worse when a quarter, or even half, disappeared.

Settings that made us good - neutron.conf for controller node

root @ mama: ~ # cat /etc/neutron/neutron.conf | grep -v "^ #. *" | strings
[DEFAULT]
bind_host = 192.168.1.4
auth_strategy = keystone
core_plugin = ml2
allow_overlapping_ips = true
service_plugins = router
base_mac = fa: 17: a1: 00: 00: 00
notify_nova_on_port_status_changes = true
notify_nova_on_port_data_changes = true
advertise_mtu = true
allow_automatic_dhcp_failover = true
dhcp_agents_per_network = 3
dvr_base_mac = fa: 17: b1: 00: 00: 00
router_distributed = true
allow_automatic_l3agent_failover = true
l3_ha = true
max_l3_agents_per_router = 3
rpc_backend = rabbit
[agent]
root_helper = sudo / usr / bin / neutron-rootwrap /etc/neutron/rootwrap.conf
[database]
connection = mysql + pymysql: // neutron: ZPASSWORDZ @ mama / neutron
[keystone_authtoken]
auth_uri = mama : 5000
auth_url = mama : 35357
memcached_servers = mama: 11230
auth_plugin = password
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = ZPASSWORDZ
[nova]
auth_url = mama : 35357
auth_plugin = password
project_domain_name = default
user_domain_name = default
region_name = RegionOne
project_name = service
username = nova
password = ZPASSWORDZ
[oslo_messaging_rabbit]
rabbit_userid = openstack
rabbit_password = ZPASSWORDZ
rabbit_durable_queues = true
rabbit_hosts = mama: 5673
rabbit_retry_interval = 1
rabbit_retry_backoff = 2
rabbit_max_retries = 0
rabbit_ha_queues = false
[quotas]
quota_network = 100
quota_subnet = 200
quota_port = -1
quota_router = 100
quota_floatingip = -1
quota_security_group = -1
quota_security_group_rule = -1

Settings that did us well - neutron.conf for compute node

root @ baby: ~ # cat /etc/neutron/neutron.conf | grep -v "^ #. *" | strings
[DEFAULT]
bind_host = 192.168.1.7
bind_port = 9696
auth_strategy = keystone
core_plugin = ml2
allow_overlapping_ips = true
service_plugins = router
base_mac = fa: 17: c1: 00: 00: 00
notify_nova_on_port_status_changes = true
notify_nova_on_port_data_changes = true
allow_automatic_dhcp_failover = true
dhcp_agents_per_network = 3
dvr_base_mac = fa: 17: d1: 00: 00: 00
router_distributed = true
allow_automatic_l3agent_failover = true
l3_ha = true
max_l3_agents_per_router = 3
rpc_backend = rabbit
[agent]
root_helper = sudo / usr / bin / neutron-rootwrap /etc/neutron/rootwrap.conf
[database]
connection = mysql + pymysql: // neutron: ZPASSWORDZ @ mama / neutron
[keystone_authtoken]
auth_uri = mama : 5000
auth_url = mama : 35357
memcached_servers = mama: 11230
auth_plugin = password
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = ZPASSWORDZ
[nova]
auth_url = mama : 35357
auth_plugin = password
project_domain_name = default
user_domain_name = default
region_name = RegionOne
project_name = service
username = nova
password = ZPASSWORDZ
[oslo_messaging_rabbit]
rabbit_hosts = mama: 5673
rabbit_userid = openstack
rabbit_password = ZPASSWORDZ
rabbit_durable_queues = true
rabbit_retry_interval = 1
rabbit_retry_backoff = 2
rabbit_max_retries = 0
rabbit_ha_queues = true

We waited patiently for a day (and we wanted to run, shouting “it worked!”), We applied similar changes in the old farm. The second week is normal flight.

Conclusion

Conclusion: of course, all this would not have happened if we set up not manually, but through an automated installer. However, the experience gained is invaluable.

Source: https://habr.com/ru/post/333872/

All Articles