How we beat the twilight between testing and exploitation

Some time ago we in HeadHunter discovered the “twilight zone” when transferring a new version of the site from testing to operation. Lack of attention to the difference between test and combat infrastructure periodically led to the fall of the site.

Exit dusk

The old test benches differed noticeably in their internal structure from the working cluster. The init-scripts for launching services differed, the configuration files differed in location and content. The interaction of services with each other occurred without taking into account the characteristics of the combat environment.
')
I will show the logic of our solution, which allowed us to achieve qualitatively new test results.

This article continues my report on SQA Days-18 .

Configuration files

The first moment we saw was different configuration files on test stands and in battle. In our case, these were even different files by location. If we talk about their content, they were written by different people and in different ways. What does it mean? This means that each package brought with it to the combat servers the default settings file that was in / var / lib / ... or / usr / shared / ... These are the settings that are convenient in the test environment. And those that need to be replaced were already in the configuration file in / etc / package_name. Settings were recorded by system administrators when they received the corresponding task in Jira.

Did such an interaction scheme guarantee that the config files in the battle are configured correctly? Absolutely not. Files with settings that were used in the combat environment, have not been tested! And so testers and developers were frequent guests in operation with the question: “We asked this week to change this setting. Have you registered it? Show what you got. ”

Why did this situation arise? Because on the test stands, as we used to see them, services live on the same server. In the same file system. And if the settings of one of them need to be changed in the / etc / default / jetty file, the settings of the other will have to be changed somehow differently. To do this, special init scripts were written on the test bench to manage different services. And in the same place, in the init script, the config files that were in non-standard places were indicated.

To resolve a conflict

How to resolve this conflict? We decided that we need isolation of services from each other at the file system level. After all, in battle, each service revolves on a separate server or virtual server.

Perhaps chroot can solve the problem of isolating services on a test bench at the file system level. Then each service will have its own / etc folder and its configuration files, which are located in the same place where they are located on the combat servers. And the log files will be in the same folders as in combat. And this solution brings to the solution of the problem, how to use the same config files on the test bench as in the production environment.

Is file system isolation enough?

On the old test benches, all services listened to localhost, each at its own port. And communicated with each other through localhost. However, on this site, each service lives on its server. And, moreover, each service is launched in 2 or more copies. This solution is necessary, firstly, for load distribution and horizontal scalability of the service. And, secondly, to ensure the reliability of the service. In cases where one of the servers must be stopped for maintenance, the others take on the same part of their work.

Thus, to ensure that requests are distributed across multiple servers that serve a single server, you need an internal load balancer.

Sysadmin profitable?

And here we saw another item for our to-do list. Balancer configuration, this is what we have never tested before! Therefore, the risk associated with the creation of the nginx configuration, which acts as a balancer, rested entirely with the sysadmins. And they had no opportunity to check its correct work anywhere, except on a live site. And sometimes the experiments did not end very well ...

Interesting. It turns out that system administrators can benefit from the implementation of the new test bench operation scheme. Perhaps they will be able to attract new stands to the project.

Balancer

And there will be a load balancer on the new stands. Then we need to configure the nginx configuration of the IP and ports of the servers in the upstream. It may be more convenient to really allocate to each service a server on which it can work.
So, in addition to isolating servers by file system, we added isolation over IP. In addition, ancillary services, such as rsyslog, will be able to work on the same principle as on this site. Perhaps, only if the config files of each service are the same as in combat.

And this is the third item on our to-do list. How to ensure the use of the same configuration files and with the same content, both on test benches and on combat servers?

How to achieve the same configs?
If we have already decided that each service will be launched on a separate server, then maybe we can use the deploy scripts that the system administrators have? And with their help, lay out the same configuration files on the test bench as on the big site? Yes, we can do that. Given the fact that the passwords from payment systems or SMS mailings should remain secret.

Then, finally, it will be possible to store the display scripts and configs on GitHub, because the secret data has been cleaned from there. Both developers and testers will stop visiting system administrators with requests to share the settings as they are implemented on the site.

Configs on github

To hide passwords in configuration files, we use variables. Because the files of configs themselves became Jinja2 templates for Ansible. We wrote our calculation system using it. And Ansible allows us to have two sets of variables, i.e. two pairs of folders group_vars and host_vars, where the values of variables are defined. One set is in the playbooks folder, and the other is in the folder with the inventory file. And one of these sets always takes precedence over the other.

Thus, we put in GitHub not only display scripts and config-files, but also one set of variable values, if they are not secret. These can be memory limitations for applications, the number of threads or forknuf processes, and timeouts. Those. those values that differ in the battle and on the test stand.

Keep secrets

Secret values, such as passwords to payment systems and services of SMS mailings, as well as passwords to the database, are in the private repository of system administrators and are not available to testers.

On the test bench, instead of a private set of variables from the operation service, testers use their values from their repository. There, they define both passwords specific to the test bench, and the values of memory limits, the number of threads and processes, or timeouts that are specific to the test environment.

Separate server for each stand - wasteful
From the above, we can see what requirements those individual servers for the services that we will run on the test bench must meet:

file system isolation;
a separate IP address;
init to run our and related packages;
opensshd for Ansible.

These requirements can satisfy Linux, LXC containers. An additional advantage of using them is that they use shared memory, allocating it as applications inside containers request it. And, unlike virtual machines, do not bite off immediately a large piece of RAM. Thanks to this we save memory.

What are the benefits of our decision?

As a result of the fact that we

identified a pseudo server for each service on the test bench;
use the same calculation scripts and the same settings files as in the battle;
the internal balancer was repeated on a Linux container inside the stand;
are forced to first build the package from the Git branch in order to roll it out to your pseudo-server using deploy scripts,

We have achieved some results.

First, we moved from testing the code branch to testing the package for our distribution. Now we can be sure that in combat the package will be able to install on a clean server and create all the necessary folders with sufficient rights for it to work.

Secondly, we began to test the very config files that will be used on this site. And, therefore, we have eliminated the human factor in writing configs, which could lead to the fall of the site. The differences between the stand and the production environment we made in variables.

Third, we started testing the balancer configuration. Thus, at the stage of preparing the task, we check the interaction of services with each other precisely in the infrastructure that works in battle.

And fourthly. Now we can run two or more instances of each service. We can not only debug the work of retries in nginx, but also test how the site behaves during the release of the new version.

Imagine that now you can run AutoTests at the moment when we simulate the release of a new version of the site! And to achieve stable operation of the stand at a time when one server is already working on the new software version, the second one is stopped for updating, and the third one is still responding with the old version. Here is a worthy task!

Result

Summarize. Reengineering of the testing process and refactoring of test benches gave an excellent result. For 2 years the site uptime exceeded 99.9%. This is a great indicator if you count it in minutes. For one month a simple site is less than 43 minutes. At the same time, we have tightened the definition downtime 3 times, from 60 500 errors per second to 20.

And for internet business, which HeadHunter is, improving uptime means real money saving. Add to this the clients who were attracted by hh.ru due to the more stable operation of the site than before. What do you think, is uptime site a key success factor for your business?

So, 4 simple steps:

allocate to each service a separate server on the stand;
test the package, not the branch with the code;
use the calculation scripts from operation and test the config files;
repeat the balancer and test the release process (the irreality has already been told about how to do this).

The victory of light is inevitable!

Source: https://habr.com/ru/post/272067/

All Articles