We put Selenium Grid on Apache Mesos wheels

Hi, Habr! My name is Nastya, and I do not like queues. Therefore, I will tell you, using the example of Alpha Laboratories and our research, how we can organize the infrastructure and architecture for running tests in order to get the result several times faster. For example, we managed to achieve such a figure as 5 minutes of total test time for an application. To do this, we had to change the approach to launch Selenium Grid.

Before I start talking about the selenium grid itself and everything related to it, I want to clarify the essence of the problem that we were trying to solve.
')
Last year, we implemented DevOps as a process. And at one moment, automating everything and everything, we realized that the time to market for each artifact at the testing stage should not exceed 30 minutes. Conceptually, we wanted some releases to pass the authentication if they do not need acceptance testing. For those artifacts that need to be checked by hand, 30 minutes is the time for which the tester receives the results of the autotest run, analyzes them, and also does acceptance testing. At the same time autotests should be automatically launched within our pipeline.

To achieve this goal, we needed to speed up the run of autotests. But in addition to speeding up autotests, it was necessary to make sure that with all the abundance of projects we did not have a queue for their launch.

Most often, the task of accelerating the AutoTest run is solved in two ways:

The approach of the rich is to flood problems with money: buying additional iron, clouds, hiring new people.
The approach for commoners is an engineering way to solve this problem.

We in our company adhere to the second approach, but not because we have no money. I am an engineer, and, like many engineers, lazy about such matters. Therefore, I decided to take a more difficult and interesting path. And at the same time save the bank that same bag of money.

So, the goal is clear: to speed up and eliminate the queues to run autotests without raising additional funding.

At the very beginning we had a rather small park consisting of 15 virtual machines.

The average configuration of the machine was as follows: 4RAM / 2 core / 50 HDD
And at one point in time on one machine without a loss in speed, we could perform no more than 2 test flows. Those. run no more than 2 sessions with browsers. Otherwise - the speed of the tests sagged.
All the machines were Windows, which also imposed certain restrictions on us (for example, we did not test the cross-browser compatibility)
And the machines were in different subnets of the Bank (different data centers). Therefore, it was extremely difficult to manage their configurations, since the re-creation and management took place on the side of the system administrators.

In total, we have about 20 projects with autotests, which are launched at different times and with different frequencies.

Our teams:

want to be released from 3 to 5 times a day
release releases no more than once every 1-2 weeks

And all teams focus on delivering value to the customer quickly. Of course, no one wants to “hang” in the queue to run autotests.

Resources sorely lacked. Why? Let's look at a specific example:

We have a project in which about 30 tests (this is an average figure)
If we run tests in one thread, then this is at least 30 minutes.
Our goal is to meet in 10 minutes - it means that we need to parallelize the test run on several browsers, and accordingly - on several machines.
So, we run these tests in parallel in at least 3 threads. In practice, it turns out that each project generates from 5 to 10 threads.
And now let's remember our 20 projects. If we have a situation when everyone wants to run autotests at the same time, in order to avoid a queue, at least 60 sessions with tests should be raised.
40 still rise, given the fact that 2 sessions per virtualku.
And the rest will be in the queue - at least 10 minutes.

Notice, we have considered a very positive case, when there are few tests in a project, and only 3 streams. Iron is not enough, you need to think about how to ease the load on the virtual. What if we move from virtual machines to docker containers?

Counted:

Let's take our 15 machines and build a single space out of them, where we will create docker containers in which our tests will be run.
15 virtual machines = these are 60 RAM, 30 core and 750 HDD, and all this is in three data centers, i.e. we can create a failover space.

Let's look at the configuration of one docker-container, which will allow us to run tests into 1 stream, and compare with what we had when using virtual machines:
500 RAM, 0.01% core, and HDD 400 mb.

It turns out that at one point in time we can create 120 containers!

This not only covers our requests in 60 sessions, but also insures for the future. After all, the number of teams is growing, which means that the number of projects launched is also constantly growing. So, it became quite obvious that we need to take the available resources and combine them into a single computing power space, this is also called the sandbox. Combining, we do not want to think about it in the paradigm of some hosts / virtual machines. We just want to have a space to which we can connect using some api, and create our own docker containers in it, on which we will then run tests.

Dynamic sandbox

So, we need to create a sandbox for computing resources. However, it should be dynamic: i.e. We should be able at any time to connect / disconnect from it the resources that we have. Moreover, all the hosts that we connect can have different configurations and be on different subnets, for us it’s just the main thing that between them it was possible to establish communication over certain ip and ports. A dynamic sandbox is also called a cloud or cluster, and in it we have an interface for creating and managing docker containers.

When we understood how we wanted to solve the problem, we built our sandbox by combining our hosts into a cluster using Apache Mesos and Marathon.

Thus, we get a common space with computational resources, which has its api. The API is provided by Marathon, and Apache Mesos unites the hosts.

Test orchestrator: Selenium grid to the rescue

We decided that we need a cluster, and even created it. But the question is, how are we going to run tests in a cluster? You remember that in any case we want to receive test results in no more than 10 minutes?

And here the parallelization of test run should come to our aid.

To solve this problem, we need a centralized tool that will allow running and parallelizing tests in several threads for each project. There are several popular tools.

Jenkins
native orchestrator Selenium

Although my story is about how we ran the selenium grid in docker containers - first we will look at how the grid works in virtual machines.

In fact, the whole procedure consists of 3 actions:

1. We copy Selenium Standalone Server (the version we need) to some directory.
2. Then we execute the command that launches this server in the mode we need: hub or node mode. Please note that the same physical jar-nickname that you duplicate to different hosts is responsible for these two functions.

$ java -jar selenium-server-standalone.jar -role hub

3. Configure the node. Either through the command line, or in the json-file we specify a set of browsers and their parameters.

 $ java \ -jar selenium-server-standalone.jar \ -role node \ -hub http://host1:4444/grid/register

What makes the hub after the start of the grid

Creates new sessions with nodes
Sends test requests to the queue if all nodes are busy;
It is an error if it does not have a node or node with specific parameters.

What does the node

After we started the server in node mode on the virtual machine and specified the hub address in the command parameters, the node's task is to register on the hub. That is, to inform him that she is in his grid, and about what browsers with drivers she has.
The registration itself looks like an HTTP-request sending with sending a json-array, which contains all the information on the node.
The next task of the node is to fulfill those requests that it receives through the hub after it has created a session with this node.
By requests I mean those commands that are sent by our jar-nick with autotests. As an example, the command will be some step like “Find me a button on this page with the following id”. Accordingly, in order for the hub to perform such a test step, it is necessary to know to which test this step applies. Indeed, at one moment he can perform several tests. And this step implies that the one who will execute this command has already completed some kind of background history from other teams, for example, just went to the corresponding page. That's exactly what the unique session identifier with a browser in the node that creates a hub and then uses the ID for which node to distribute requests to is needed for this.
Noda simply waits for the command from the hub, and when it receives http requests that it redirects to it, it executes them.

What is the difference between starting grid in docker containers?

1. The node at the time of start is already configured.

Let's look at the contents of the node. The json-config file for the node is in the container with it, then we rename it, and our server will learn about its parameters from this file:

 /opt/selenium/generate_config > /opt/selenium/config.json

Moreover, if we look at the contents of the Dockerfile node itself, we will see that when we configure the node environment, we immediately set the environment variables, which are then written to this config. Thus, we don’t need to go into the “guts” of the container itself to change the launch parameters of the node, we just need to override the values of the specified variables in the Dockerfile. And that's all.

2. When we start a node in a container, we can always be sure that our environment already has a browser and a driver for it. Because all this is configured and installed at the time of the assembly of the image itself.

 $ /opt/selenium$ ls chromedriver-2.29 selenium-server-standalone.jar config.json

3. We also have a sh script that runs after the container has started. And in this script we see that after the container has risen - our java server starts right away.

 $ java ${JAVA_OPTS} -jar /opt/selenium/selenium-server-standalone.jar \ -role node \ -hub http://$HUB_HOST:$HUB_PORT/grid/register \ -nodeConfig /opt/selenium/config.json \ ${SE_OPTS} &

Similarly, all in relation to the hub.

As a result, the launch of the selenium grid in the container is reduced to one team - the start of the docker container.

Static grid problem

Despite the fact that the hub is well able to work with queues and timeouts, at the very beginning of using a static grid, we experienced problems due to timeouts. If the hub and node were not used for a long time, then during the subsequent connection we caught situations when, when creating a session at the node, this very session fell off precisely because of time-outs or because remotewebdriver could not lift the browser. And all these problems were treated with a grid restart, it was then that we realized that for us on-demand the selenium grid would be the solution.

We also didn’t want the static grid to just occupy a place in a cluster that is already small in our case. How to solve the situation when for different projects we need different grid configurations? When for one project need one version of the browser, for another - another? Obviously, keeping grids on is not a good idea.

Selenium Grid On-Demand

Therefore, we wanted to raise the selenium grid on request: I will explain with an example

Suppose we want to run tests for a project with 30 tests, which are decomposed into 3 test suites.
So, we run the job, which first creates a selenium grid in the cluster for this run, and it passes the number of our test suite as the value of the parameter about the number of nodes in the grid. That is, for each project - the grid configuration is different.
After the selenium grid lift command has completed its work, the tests are run.
After running the tests, the grid is deleted.

It would seem an ideal concept. We use this approach to solve two problems at once: both with the degradation of the grid, and with the lack of space in the cluster to store various configurations of the grid.

Automation of the creation of Selenium Grid On-Demand

To solve this problem, it was necessary to write an automated grid creation script. We solved it with the help of ansible, having written the necessary roles. I will not tell what is ansible. But I can say that you can also write such a script in bash-e or in another programming language, which gives you two commands to create and delete a grid.

Remember that starting a grid consists of running a couple of commands. And each team has its own parameters. And in order to automate the launch of these commands, these parameters need only be automatically calculated before the command is launched. Or hardcode.

We cannot hardcode, because we a priori do not know on which host and port the components of the Selenium Grid go up, since Apache Mesos decides for us.

Of course, we can dodge and manually monitor the open ports and hosts on which we are raising the Selenium Grid, but then why do we need Apache Mesos and Marathon at all if we do everything manually?

So, it was necessary to automate the calculation of the following parameters:

the number of nodes we raise
determining the address of the hub (its host and port on which it rose) to transmit this value to the node, otherwise it will not be able to register.

Api Marathon helped us in this, and with its help we obtained data on which host and port the hub went up to. And then this value was transferred before the start of the node. So, what we have:

Deploy Selenium Grid

 $ ansible-playbook -i inventory play-site.yml \ -e test_id=mytest \ -e nodes_type=chrome \ -e nodes_count=4

test_id:
nodes_count:
nodes_type: [chrome|firefox]

Delete Selenium Grid

 $ ansible-playbook -i inventory play-site.yml \ -e test_id=mytest \ -e clean=true

Shell scripts executed on Jenkins, before running the ansible playbook, are calculated automatically and pass the value of the variable. The test run is built into the pipeline using job dsl.

 export grid_name=testproject export nodes_count=$(find tests -name "*feature" \ | grep -v build | grep -v classes | grep features | wc -l) cd ansible ansible-playbook -i inventory play-site.yml \ -e test_id=$grid_name \ -e nodes_type=chrome \ -e nodes_count=$nodes_count export hub_url=$(cat hub.url) currentdir=$(pwd) cd ../tests ./gradlew clean generateCucumberReport \ -i -Pbrowser=$browser -PremoteHub=$hub_url

As soon as we solved this problem and learned to raise the selenium grid in our cluster, we hurried to run the tests, and this was where we were disappointed. Tests do not run, moreover - the hub does not even raise the session with the node.

The problem of raising Selenium Grid On-Demand in a distributed cluster

Let's see what our scripts lacked.

Take another look at what the command would look like if we ran the nodes in the Docker container for the selenium grid every time:

 $ docker run -d -p 6666:5555 selenium/node-chrome

Do you see two ports? Probably some of you wondering where the second port came from. So, the docker has an internal port and an external port. The external port listens to the container itself. And the internal port is monitored by the selenium server standalone process itself, which runs in the -node mode.

In this example, all requests for port 6666 of the container will be forwarded to port 5555 of the node inside it.

Running a node in Marathon

When configuring an Apache Mesos cluster, we specify a range of ports for each host. This range is used for containers that are lifted by Marathon.

For example, if we set a range of 20000-21000, then our containers will receive a random port from this range.

A marathon agent runs something like this.

 $ docker run -d -p <?>:5555 selenium/node-chrome

When the container is launched, it selects the next free port and substitutes it for the question mark. Thus, at the time of the start of the node in the network bridge mode, we have a mapping of ports.

 $ docker run -d -p 20345:5555 selenium/node-chrome

Marathon starts a container on a random host and a random port.

The node sends the wrong coordinates.

Docker containers, by default, run in bridge mode. What does this mean for us? And the fact that the node will not see your real IP and port! Suppose that Apache Mesos has raised to us a node on host 192.168.1.5 and port 20345. But the process of the node in the container will think that it goes up on some 172.17.0.2; and its port is 5555.

 host = 172.17.0.2 port = 5555

And she will register on the hub with the return address. Naturally, the hub at this address will not find it. And when running tests, the hub will not be able to raise the browser session.

Solving the problem of registering nodes on the hub

But there is also a host mode. When a container uses the host ports directly and there is no such thing as an internal port.

When we thought about solving this problem, naturally, we thought, why do we need to start the container and at the same time create a network bridge, and why not use the host mode? We indicate one port on which we rise, and the container, and the selenium server immediately looks at it.

But it was not there. In order for our tests to be performed in a docker-container, which as such has no display, we also need to take screenshots, we use an xvfb-server, which also occupies a certain port when the container starts. By the way, so the host mode does not suit us at all. We'll have to somehow twist the bridge mode.

Container environment variables

When Marathon started the container, it sets the actual host and ports on which it picked up the container in the environment variables of this container.

That is, the container has the values of the variables HOST and PORT0.
This means that inside the container there is information on which host it is deployed on and what external ports it has.

In order for us to get everything working, it is necessary that the values of the host and Port variables sent in the registration request contain the values of the container's HOST and PORT0 variables.

 { … "host": "$HOST", "port": "$PORT0", … }

The HOST parameter is easy to specify - Selenium has a special setting.

With port harder. If you transfer this PORT0, then Selenium will not only register with it on the hub, but also rise on it! Why is this a problem?

For example, Apache Mesos gave us an external port 20765. At the start of the container it makes the mapping: 20765: 5555. The second number we ask immediately, hard, in the config. And the docker will expect that inside the container the node will hang on 5555. And it will forward connections from the external port 20765 there.

But if we pass the -port 20765 parameter to the node, then it will listen to 20765 from the inside! Not 5555. And all requests from the outside will not be processed.

You may have already guessed that the problem can be solved by dividing the port concept into two separate ones. The port on which the node rises, and the port, which it must inform the hub. In the docker-environment, these values usually do not match.

How to tell the node about these ports?
No

Out of the box Selenium Standalone Server does not know how.
Need to patch Selenium.

Patches Selenium Server

The code for Selenium itself is on GitHub. And we decided to add some more ... wonderful code to the selenium standalone server.

Added advertisePort parameter.

 @Expose( serialize = false ) @Parameter( names = "-advertisePort", description = "<Integer> : advertise port of Node. " + "This port is sent to Hub for backward communication with this node." ) public Integer advertisePort;

And the condition in the registration method on the server.

 if (registrationRequest.getConfiguration().advertisePort != null) { registrationRequest.getConfiguration().port = registrationRequest.getConfiguration().advertisePort; }

Now, if the advertisePort parameter is set when the node is started, then it is used instead of the standard port during registration on the hub. This is a local patch, we have not done a pull request to the selenium repository yet. When we run to the end of our scheme, let's do it.

With this parameter, nodes are correctly registered on the hub. Checked works. Tests are run.

And yes, we used Marathon, as it is used by our developers. This is essentially a proof of concept. But in general, this framework is not ideal for running the selenium grid, as it is focused on long running tasks. Such as services, UI-applications.

findings

In a dynamic organizational environment, dynamic resource management is required. Statics will break about process problems.

Therefore, our test run system consists of the following components:

cluster in which docker containers are created
selenium grid as an application that consists of the following components: hub and node
Jenkins as an application that executes our job
and scripts that automate some work. These include ansible and sh-scripts.

We did not need additional funding. And we accelerated the test run not even up to 10 minutes, but up to 5 minutes. The average metric for our projects began to equal exactly 5 minutes. 2 minutes for all procedures for lifting / removing a grid, project assembly, etc. And 3 minutes to complete test suites.

Was the result of the effort worth the effort? Of course, because in the dry residue, we accelerated the test run at least twice.

If you do not like queues too much and are trying to speed up the run of tests, perhaps our experience will be useful to you.

By the way, if from the posts about testing your heart beats more often and there is a desire to do something like this - please note that we have a vacancy for the tester.

And if there are any questions and clarifications - be sure to write in the comments.

Source: https://habr.com/ru/post/331434/

All Articles