📜 ⬆️ ⬇️

The history of one bug (# 1653967)

Abstract : The real story from the life of real administrators for catching an idiotic bug.
The instructive part: Never underestimate dependency dependencies.

Introduction


Private upgrade in the lab from Openstack Mitaka to Openstack Newton (newer version). Several deprecated options in configuration files, keystone moved from eventlet to WSGI and broke existing configuration with haproxy; Because of the typical “ipv6 listen”, apache did not conflict with haproxy for the same ports used on the star (one listened to ipv6, the other ipv4 only), so the requests went to haproxy instead of apache, where they died from 503, since there was no upstream ... However, the story is not about that.

After the main problems were poofishkeni, Nova (one of the components of Openstack) started to fall with the error when starting: ConfigFileValueError: Value for option url is not valid: invalid URI: 'http://neutron-server.example.com:21345'. . It was very strange. Given that 100,500 options have changed in the config, there is a suspicion that we are using an outdated option that we no longer need to use. However, the documentation said that the example options are url = http://controller:9696 .
')

Debugging


Obvious debugging steps:

Total, bug: the presence of a hyphen in the hostname causes a ConfigFileValueError. Bug Report: bugs.launchpad.net/ubuntu/+source/nova/+bug/1653967

Check that this is a bug: RFC3986 states that:
 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reg-name = *( unreserved / pct-encoded / sub-delims ) host = IP-literal / IPv4address / reg-name 

(this is such a BNF notation that says that a hyphen can be used in the host).

We all know that already, but it is always better to double-check.

Examine the code that reports the error:

  try: return convert(opt._get_from_namespace(namespace, group_name)) except KeyError: # nosec: Valid control flow instruction pass except ValueError as ve: raise ConfigFileValueError( "Value for option %s is not valid: %s" % (opt.name, str(ve))) 

The error occurred on two options: url and novncproxy_base_url. The error is identical, although it is more convenient to grep'at the second. Begin to look for the second. Here is how it is defined in the code:

  cfg.URIOpt( 'novncproxy_base_url', default='http://127.0.0.1:6080/vnc_auto.html', deprecated_group='DEFAULT', help=""" 

Yeah. And cfg is from oslo_config import cfg . oslo.config is the Openstack library for working with configs. Enjoying raw.

We see:

 class URI(ConfigType): ... def __call__(self, value): if not rfc3986.is_valid_uri(value, require_scheme=True, require_authority=True): raise ValueError('invalid URI: %r' % value) 

Suddenly:

 >>> import rfc3986 >>> rfc3986.is_valid_uri('http://test.com') True >>> rfc3986.is_valid_uri('http://test-test.com') False 

Oops. Disorder. But there is: github.com/sigmavirus24/rfc3986/issues/11
Bug has long been fixed. In version 0.2.2. And on our host:

 apt-cache policy python-rfc3986 python-rfc3986: Installed: 0.2.0-2 Candidate: 0.2.0-2 Version table: *** 0.2.0-2 500 500 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages 100 /var/lib/dpkg/status 

But in a more recent version in zesty there is version 0.3.1-2, which does not suffer from such a problem.

Further proceedings


A long time ago, a bug was made. He was some time, then he was fixed. But during this time, the Code in which the Bug was, and no one paid attention to Fix Bug, and the version with the Bug remained in the deb-repository for years. She didn't care about anyone - until two commits in oslo.config and nova happened:

 commit 45ee2bed52a57b9801435b43ad45d8f50204580d Author: Masaki Matsushita <glass.saga@gmail.com> Date: Mon Sep 28 20:28:28 2015 +0900 Add URIOpt This change add URIOpt which validates string as URI. Closes-Bug: #1500398 Change-Id: Ie8736b8654b9feb2a2b174159f08dbea03568d84 

 commit 6091de77eda12286786e28ae4f0779e7efc54634 Author: Maciej Szankin <maciej.szankin@intel.com> Date: Thu Jul 28 10:30:59 2016 -0500 Improve consistency in VNC opts * Updated header flags * Moved all vars to list * Removed possible values and related options sections where they were not needed * Changed IntOpt to PortOpt where needed Change-Id: I3255a867091f8e14c907c7fde9a2aa3abc249ae9 Implements: Blueprint centralize-config-options-newton 

I made this commit from StrOpt UriOpt and started using (via oslo.conf) python-rfc3986. Due to the fact that the old version of python-rfc3986 was packaged, an unexpected regression occurred in the software.

Bonus: how we will fix it


Usually in such cases, if upgrading to a newer version is easy (and does not cause other problems), then we simply pick up the package from a newer version of the distribution (in this case, zesty, aka ubuntu-17.04). We will put it in our private repository running aptly (as is) and will use it when installing / configuring the server. If there was no such package in nature, we would set up a CJ job to package and publish it (to the aptly repository). If this option was not available (for example, incompatible changes), then we would add one more patch to our patchqueue for nova, which would make StrOpt instead of UriOpt. This implies that we will rebuild nova from the ubuntu package with our own patches. This is done by CI, which publishes packages to that very private repository.

Bit of flame


And how would this problem be solved in a proprietary environment? Errors allow everything (otherwise we would have software without bugs). After the error was made in support of the first level, after having installed the installed versions, updates and contracts, it would have reached the second level support, the third level, and so on to a person with real qualifications who can look at the code. He found and fixed the problem. What is the estimate for that bug fix? Two hours on the first level, another hour on the second, a business day to research the problem, another business day on fix, maybe another day on release and testing. This is the perfect scenario. In practice, my most optimistic assessments speak about weeks, turning into “corrected in the next release in half a year”.

How much did it take from me, in the opensource project, to fix the problem on my own? ~ 14: 30, the problem was revealed today, and I fixed it on the launchpad. At 15:20 it was already known about the problem with dependency, at 15:30 it was checked that with the new version of python-rfc3986 this problem does not exist. At 16:50 (Cyprus time) I finish writing this post on Habr.

Source: https://habr.com/ru/post/318982/


All Articles