Accelerate Ansible

Under the hood d2c.io we use Ansible. With it, we create virtual machines with cloud providers, install software, as well as manage Docker-containers with client applications.

Ansible is a handy tool that is ready for use with almost no configuration. This is possible due to the absence of agents (agentless system), so there is no need to pre-install anything on the hosts to be serviced.

In most cases, ssh used to connect to hosts. However, the flip side of this medal is a certain slowness, since all the logic is on the management server and forms each task (task) Ansible locally and sends it to execution via an SSH connection. It then takes the result, analyzes it, and proceeds to the next step. In the article we will discuss how to speed up the work of Ansible.

Let's start with the measurements

It is impossible to improve what is impossible to measure, so we will write a small script to calculate the execution time.

First, create a test playbook:

test.yml

 --- - hosts: all # gather_facts: no tasks: - name: Create directory file: path: /tmp/ansible_speed state: directory - name: Create file copy: content: SPEED dest: /tmp/ansible_speed/speed - name: Remove directory file: path: /tmp/ansible_speed state: absent

And now we will write a script to calculate the execution time:

time_test.sh

 #!/bin/bash # calculate the mean average of wall clock time from multiple /usr/bin/time results. # credits to https://stackoverflow.com/a/8216082/2795592 cat /dev/null > time.log for i in `seq 1 10`; do echo "Iteration $i: $@" /usr/bin/time -p -a -o time.log $@ rm -rf /home/ubuntu/.ansible/cp/* done file=time.log cnt=0 if [ ${#file} -lt 1 ]; then echo "you must specify a file containing output of /usr/bin/time results" exit 1 elif [ ${#file} -gt 1 ]; then samples=(`grep --color=never real ${file} | awk '{print $2}' | cut -dm -f2 | cut -ds -f1`) for sample in `grep --color=never real ${file} | awk '{print $2}' | cut -dm -f2 | cut -ds -f1`; do cnt=$(echo ${cnt}+${sample} | bc -l) done # Calculate the 'Mean' average (sum / samples). mean_avg=$(echo ${cnt}/${#samples[@]} | bc -l) mean_avg=$(echo ${mean_avg} | cut -b1-6) printf "\tSamples:\t%s \n\tMean Avg:\t%s\n\n" ${#samples[@]} ${mean_avg} grep --color=never real ${file} fi

Run our test playbook 10 times and take the average.

Ssh multiplexing

In the local network: it was 7.68 s, it was 2.38 s
On remote hosts: it was 26.64 s, it was 10.85 s

The first thing to check is whether the reuse of SSH connections is working. Since Ansible performs all actions via SSH, any delay in establishing a connection slows down the execution of the playbook as a whole. By default, this setting is enabled in Ansible. In the configuration file, it looks like this:

 [ssh_connection] ssh_args = -o ControlMaster=auto -o ControlPersist=60s

However, be careful, if for some reason you override the ssh_args parameter, then you must explicitly specify values for ControlMaster and ControlPersist otherwise Ansible will “forget” about them.

You can check whether your reuse of SSH connections works in your case as follows - run Ansible with the -vvvv parameter:

 ansible test -vvvv -m ping

In the output, we should see the launch of SSH with the necessary parameters:

 SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s ... -o ControlPath=/home/ubuntu/.ansible/cp/7c223265ce

And further among the set of debug information:

 Trying existing master Control socket "/home/ubuntu/.ansible/cp/7c223265ce" does not exist setting up multiplex master socket

You can also make sure that within 60 seconds after the completion of the task, an open socket can be observed as a file (in our example /home/ubuntu/.ansible/cp/7c223265ce ).

Attention: if you work with several identical environments from one management machine (blue / green deployment, stage / prod), make sure not to shoot yourself in the foot! If, for example, you first downloaded the actual settings with the sales and the next step you want to update the test environment with the new version of the components, and the SSH sockets remain open from the sale, then oh ... So you need to either make the ControlPath different in these environments or force the wizard to close -sessions or remove sockets before starting work with another medium.

Pipelining

In the local network: it was 2.38 s, it became 1.96 s
On remote hosts: it was 10.85 seconds, it became 5.23 seconds

By default, Ansible runs modules on target hosts as follows:

Generates a Python file with a module and its parameters to run on the target machine.
It connects via SSH, it recognizes the user's home directory
Connects via SSH, creates a temporary directory for work
Connects via SSH, SFTP copies the Python file to a temporary directory.
Connects via SSH, runs the Python file on the target machine, and deletes the temporary directory
Gets the output of the module from standard output.

Given that the entire list is performed for each task, the costs incur significant. To speed up this process in Ansible there is a pipelining mode, by analogy with the I / O transfer pipeline between commands in Linux. In the configuration file, the mode setting looks like this:

 [ssh_connection] pipelining = true

When using pipelining Ansible mode works as follows:

Generates a Python file with a module and its parameters to run on the target machine.
Connects via SSH, runs the Python interpreter.
Sends the contents of the Python file to standard interpreter input.
Gets the output of the module from standard output.

Total: there were 4 SSH connections and a few extra commands, now there’s one. Acceleration is obvious, especially for remote servers in the WAN.

To check if your pipelining mode works, you need to run Ansible with advanced logging, for example:

 ansible test -vvv -m ping

If there are several ssh calls in the output:

 SSH: EXEC ssh ... SSH: EXEC ssh ... SSH: EXEC sftp ... SSH: EXEC ssh ... python ... ping.py

So the conveyor mode does not work. If the ssh call is one:

 SSH: EXEC ssh ... python && sleep 0

So pipelining works.

Important: By default, this setting is disabled in Ansible, as it may conflict with the requiretty requirement in sudo settings. At the time of writing this article, Amazon EC2 has requiretty disabled in the latest Ubuntu and RHEL images, so pipelining can be safely used.

PreferredAuthentications and UseDNS

In the local network: it was 1.96 s, it became 1.92 s
On remote hosts: it was 5.23 seconds, it was 4.92 seconds

If you manage dozens of target machines, any little thing that affects connection speed is important. Reverse-DNS queries on the server side and the client side can be one of such trifles, especially if you are using a non-local DNS server.

UseDNS

UseDNS is an SSH server configuration (/ etc / ssh / sshd_config file) that causes it to check the PTR record to the client’s IP address. Fortunately, in modern distributions it is disabled by default. But just in case, it is worth checking on its target hosts if the primary SSH connection is “slowed down”.

PreferredAuthentications

This is an SSH client setting that informs the server about the authentication methods that the client is ready to use. By default, Ansible uses

 -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey

And if GSSAPIAuthentication enabled on the servers, such as in the RHEL image on Amazon EC2, then this mode will be tested first. This will lead to unnecessary steps and attempts to verify PTR records by the client.

Most often, we only need key authentication, so we’ll point this out explicitly in ansible.cfg :

 [ssh_connection] ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey

This setting will save the client from unnecessary negotiations with the server and speed up the installation of the master session.

Collecting facts

In the local network: it was 1.96 s, it became 1.47 s
On remote hosts: it was 4.92 s, it became 4.77 s

When executing playbooks (ad-hoc commands this does not apply) Ansible collects facts about the remote system by default. This step is similar to running the setup module and also requires a separate SSH connection. In our test playbook, we do not use a single fact, so we can skip this step by specifying the playbook parameter:

 gather_facts: no

If you often run playbooks using facts, but collecting facts greatly affects the speed at which playbooks are executed, consider using an external fact caching system (this is partly described in the Ansible documentation ). For example, you set up Redis, which you fill with fresh facts once an hour, and in all working playbooks you turn off the forced collection of facts. They will now be taken from the local cache.

WAN → LAN

It was 4.77 s, it became 1.47 s

Despite the fast communication channels, in the local network, everything usually works faster than in the global one. Therefore, if you often run playbooks on multiple servers, say on Amazon EC2 in the eu-west-1 region, then it is reasonable to place the main management server on the same site. Depending on the usage scenario, moving the management server closer to the target servers will help to significantly speed up the playlist execution.

Pull mode

It was 1.47 s, it became 1.25 s

If you need more speed, you can perform a playbook locally on the target server. To do this, there is the utility ansible-pull . You can also read about it in the official documentation of Ansible . It works as follows:

Clones the repository to a local directory.
Runs the specified playbook locally (with the -c local parameter)
If the playbook is not specified, try running:
- <fqdn>.yml
- <hostname>.yml
- local.yml

One of the use cases is to schedule the repository to be polled and execute a playbook if something has changed (the --only-if-changed parameter).

Fork

In the previous sections, we talked about speeding up the execution of a playbook on each host individually. But if you start the playbook for dozens of servers at once, then the bottleneck may be the number of forks — separate parallel processes for performing tasks on different servers. In the configuration file, this setting looks like this:

 [defaults] forks = 20

The default value of "forks" is 5, which means Ansible communicates with no more than five hosts simultaneously. Often, the capabilities of the CPU on the management server and the communication channel allow servicing a larger number of hosts simultaneously. The optimal amount is chosen experimentally in a particular environment.

Another point about concurrency is that if the number of threads is small and there are a lot of servers, then master ssh sessions can “faint”, which will result in the need to re-establish a full-fledged SSH session. This is due to the fact that Ansible defaults to the linear execution strategy. He waits until one task is completed on all servers and only then proceeds to the next task. It can happen that the “first” server quickly performs the task and then waits until the task is completed on all other servers, so long that the ControlPersist time expires.

Poll interval

After launching the module on the target server, the Ansible process on the management machine goes into continuous polling mode (polling) for incoming results of the module execution. How aggressively he does this affects the CPU load on the management machine. And we need CPU time to increase the number of parallel processes (see previous section). The timeout setting between such internal polls is specified in seconds in the settings file:

 [defaults] internal_poll_interval = 0.001

If you need to run “long-playing” tasks on a large number of hosts, the result of which does not make sense to check so often, or there is not a lot of free CPUs on the controlling machine, you can do more polling, for example, 0.05 .

PS: From the highlights, perhaps - everything. Further speed must be sought already in the optimization of the playbooks themselves, but this is a completely different story.

Source: https://habr.com/ru/post/343368/

All Articles