Testing in Yandex. How to make a fault-tolerant grid of thousands of browsers

Any specialist involved in testing web applications knows that the majority of routine actions on services can do the Selenium framework. In Yandex, millions of autotests are performed per day, using Selenium to work with browsers, so we need thousands of different browsers that are available simultaneously 24/7. And here begins the most interesting.

Selenium with a large number of browsers has many problems with scaling and fault tolerance. After several attempts, we got an elegant and easy-to-maintain solution, and we want to share it with you. Our gridrouter project allows you to organize a fault-tolerant Selenium-grid from any number of browsers. The code is laid out in open-source and is available on Github . Under the cut, I will tell you what disadvantages of Selenium we paid attention to, how you came to our decision, and explain how to configure it.

Problem

Selenium has changed dramatically since its inception, the current architecture, called Selenium Grid, works like this.

A cluster consists of two applications: a hub (hub) and a node (node). A hub is an API that accepts user requests and sends them to nodes. A node is a requestor that launches browsers and performs test steps in them. Theoretically, an infinite number of nodes can be connected to a single hub, each of which can run any of the supported browsers. And what about in practice?
')

There is a weak spot. The hub is the only point of access to browsers. If for some reason the hub process stops responding, then all browsers become inaccessible. It is clear that the service also stops working if the data center where the hub is located has a network or power failure.
Selenium Grid doesn’t scale well. Our long-term experience of operating Selenium on different equipment shows that under load one hub is able to work with no more than several dozen connected nodes. If you continue to add nodes, then at peak load the hub may stop responding over the network or it processes requests too slowly.
No quoting. You cannot create users and specify which browser versions which user can use.

Decision

In order not to suffer in the fall of one hub, you can raise several. But the usual libraries to work with Selenium are designed to work with only one hub, so you have to teach them to work with a distributed system.

Client balancing

Initially, we solved the problem of working with several hubs using a small library that was used in the test code and performed balancing on the client side.

Here's how it works:

Information about hosts with hubs and browser versions available on them is saved to the configuration file.
The user connects the library in their tests and requests the browser.
A host is randomly selected from the list and an attempt is made to get a browser.
If the attempt is successful, the browser is given to the user and the tests begin.
If the browser could not be obtained, then again the next host is randomly selected, etc. Since different hubs may have different number of browsers available, different weights can be assigned to the hubs in the configuration file, and a random selection is made taking these weights into account. This approach allows to achieve a uniform load distribution.
The user receives an error only if the browser could not get on any of the hubs.

The implementation of this algorithm is simple, but requires integration with each library to work with Selenium. Suppose in your tests the browser is obtained with the following code:

WebDriver driver = new RemoteWebDriver(SELENIUM_SERVER_URL, capabilities);

Here RemoteWebDriver is the standard class for working with Selenium in Java. To work in our infrastructure, you will have to wrap it in our own code with the choice of a hub:

 WebDriver driver = SeleniumHubFinder.find(capabilities);

The test code no longer contains the URL to Selenium, it is contained in the library configuration. This also means that the test code is now tied to SeleniumHubFinder and will not run without it. In addition, if you have tests not only in Java, but also in other languages, you will have to write a client balancer for them all, and this can be expensive. It is much easier to put the balance code on the server and specify its address in the test code.

Server balancing

When designing the server, we laid down the following natural requirements:

The server must implement the Selenium API ( JsonWire protocol) in order for the tests to work with it, as with a regular Selenium hub.
You can place as many server heads as you want in any data centers and balance them with an iron or software balancer (SLB).
The heads of the server are completely independent of each other and do not store the shared state.
A server out of the box provides quoting, that is, independent work of several users.

Architecturally obtained solution looks like this:

The load balancer (SLB) scatters requests from users to one of the N heads with the server listening on the standard port (4444).
Each of the heads stores in the form of a configuration information about all the available Selenium-hubs.
When a request is received by the browser, the server uses the balancing algorithm described in the previous section and receives the browser.
Each running browser in the standard Selenium gets its unique identifier, called the session ID . This value is transmitted by the client to the hub with any request. When a browser is received, the server replaces the real session ID with a new one, which additionally contains information about the hub on which this session was received. The resulting session with the extended ID is given to the client.
For further requests, the server retrieves the host address with the hub from the session ID and proxies requests to this host. Since all the information needed by the server is in the request itself, there is no need to synchronize the state of the heads - each of them can work independently.

Gridrouter

We called the server a gridrouter and decided to share its code with everyone. The server is written in Java using the Spring Framework . Project sources can be viewed at the link . We also prepared Debian packages that install the server.

At the moment, the gridrouter is installed as a combat server used by various Yandex teams. The total number of available browsers in this grid is more than three thousand. At peak loads, we serve about the same number of user sessions.

How to set up a gridrouter

In order to configure the gridouter, you need to set a list of users and quotas for each user. We did not set a goal to make a super-secure authentication with hash functions and salt, so we use basic basic HTTP authentication , and store logins and passwords in clear text in the /etc/grid-router/users.properties text file:

 user:password, user user2:password2, user

Each line contains the login and password through a colon, as well as the role, which is currently the same, - user . As for quotas, everything is also very simple. Each quota is a separate XML file /etc/grid-router/quota/<login>.xml , where <login> is the name of the user. Inside the file looks like this:

 <qa:browsers xmlns:qa="urn:config.gridrouter.qatools.ru"> <browser name="firefox" defaultVersion="33.0"> <version number="33.0"> <region name="us-west"> <host name="my-firefox33-host-1.example.com" port="4444" count="5"/> </region> <region name="us-east"> <host name="my-firefox33-host-2.example.com" port="4444" count="5"/> </region> </version> <version number="38.0"> <region name="us-west"> <host name="my-firefox38-host-1.example.com" port="4444" count="4"/> <host name="my-firefox38-host-2.example.com" port="4444" count="4"/> </region> <region name="us-east"> <host name="my-firefox38-host-3.example.com" port="4444" count="4"/> </region> </version> </browser> <browser name="chrome" defaultVersion="42.0"> <version number="42.0"> <region name="us-west"> <host name="my-chrome42-host-1.example.com" port="4444" count="1"/> </region> <region name="us-east"> <host name="my-chrome42-host-2.example.com" port="4444" count="4"/> <host name="my-chrome42-host-3.example.com" port="4444" count="3"/> </region> </version> </browser> </qa:browsers>

It can be seen that the names and versions of the available browsers are determined, which should exactly coincide with those indicated on the hubs. For each browser version, one or several regions are defined, that is, different data centers, each of which contains a host, a port, and the number of available browsers (this is the weight). The name of the region can be arbitrary. Information about the regions is needed in cases where one of the data centers becomes unavailable. In this case, the gridrouter, after one unsuccessful attempt to obtain a browser in a certain region, is trying to get a browser from another region. Such an algorithm greatly increases the likelihood of a quick browser acquisition.

How to run tests

Although we mainly write in Java, we tested our server with Selenium tests in other programming languages. Usually in tests, the hub URL is indicated something like this:

 http://example.com:4444/wd/hub

Since we use basic HTTP authentication, the following links should be used when working with the gridrouter:

 http://username:password@example.com:4444/wd/hub

If you have any problems with the configuration, please contact us, get an issue on Github .

Recommendations for configuring hubs and nodes

We conducted experiments with different configurations of hubs and nodes and concluded that from the point of view of ease of operation, reliability and ease of scaling, the following approach is most practical. Usually they install one hub to which many nodes are connected, because the entry point must be one.

When using the gridrouter, you can install as many hubs as you like, so the easiest way is to configure one hub and several nodes connected to localhost: 4444 on one machine. It is especially convenient to do this if everything is deployed in virtual machines. For example, we found that the combination of a hub and five nodes is optimal for a virtual machine with two VCPUs and 4 GB of memory. We install only one browser version per virtual machine, since in this case it is easy to measure memory consumption and convert the number of virtual machines to the number of available browsers.

Source: https://habr.com/ru/post/268309/

All Articles