The optimal location of shards in a petabyte cluster Elasticsearch: linear programming

At the very heart of the Meltwater and Fairhair.ai information retrieval systems is a collection of Elasticsearch clusters with billions of articles from the media and social media.

Index shards in clusters differ greatly in access structure, workload, and size, which raises some very interesting problems.

In this article, we describe how linear programming (linear optimization) was applied to maximize the uniform distribution of the search and index workload across all nodes in the clusters. This solution reduces the chance that a single node will become a bottleneck in the system. As a result, we increased the search speed and saved on the infrastructure.

Prehistory

Fairhair.ai information retrieval systems contain about 40 billion messages from social media and editorials, processing millions of queries daily. The platform provides customers with search results, graphics, analytics, data export for more advanced analysis.
')
These massive datasets are housed in several 750-node Elasticsearch clusters with thousands of indices of more than 50,000 shards.

For more information about our cluster, see previous articles on its architecture and load balancer on machine learning .

Uneven workload distribution

Both our data and user requests are usually tied to a date. Most requests fall within a specific time period, for example, last week, last month, last quarter, or an arbitrary range. To simplify indexing and queries, we use time-based indexing , similar to the ELK stack .

Such architecture of indexes gives a number of advantages. For example, you can perform efficient mass indexing, as well as delete whole indexes when data is aging. It also means that the workload for a given index varies greatly over time.

The latest indexes are exponentially more queries, compared with the old ones.

Fig. 1. Access scheme for indexes by time. The vertical axis shows the number of completed queries, the horizontal axis shows the age of the index. Weekly, monthly and annual plateaus are clearly visible, followed by a long tail of a lower workload on older indices.

Patterns in Fig. 1 were quite predictable, as our clients are more interested in the latest information and regularly compare the current month with the past and / or this year with the last year. The problem is that Elasticsearch does not know about this pattern and does not perform automatic optimization for the observed workload!

The built-in algorithm for placing shards Elasticsearch takes into account only two factors:

The number of shards on each node. The algorithm tries to evenly balance the number of shards per node throughout the cluster.
Tags free disk space. Elasticsearch considers the available disk space on a node before deciding whether to allocate new shards to this node or move segments from this node to others. When 80% of the used disk is used, it is prohibited to place new shards on the node, the system will actively transfer shards from this node to 90%.

The fundamental assumption of the algorithm is that each segment in the cluster receives approximately the same amount of workload and that everyone has the same size. In our case, this is very far from the truth.

The standard load distribution quickly leads to hot spots in the cluster. They appear and disappear randomly, because the workload changes over time.

A hot spot is essentially a node operating near its limit of one or more system resources, such as a CPU, disk I / O, or network bandwidth. When this happens, the node first queues requests for a while, which increases the response time to the request. But if the overload continues for a long time, then eventually requests are rejected, and users get errors.

Another common consequence of overload is the unsustainable pressure of the JVM garbage due to queries and indexing operations, which leads to the “terrible hell” phenomenon of the JVM garbage collector. In such a situation, the JVM either cannot get the memory fast enough and crashes out (out of memory), or gets stuck in an endless garbage collection cycle, freezes and stops responding to requests and pings of the cluster.

The problem worsened when we refactored our architecture under AWS . Previously, we were saved by the fact that we launched up to four Elasticsearch nodes on our own powerful servers (24 cores) in our data center. This masked the effect of the asymmetric distribution of shards: the load was largely smoothed by a relatively large number of cores on the machine.

After refactoring, we placed only one node on less powerful machines (8 cores) - and the first tests immediately revealed major problems with “hot spots”.

Elasticsearch assigns shards in random order, and with more than 500 nodes in a cluster, the likelihood of too many “hot” shards on one node has greatly increased - and such nodes quickly overflowed.

For users, this would mean a serious deterioration in work, since overloaded nodes slowly respond, and sometimes they completely reject requests or fall. If you bring such a system into production, users will see frequent seemingly random UI slowdowns and random timeouts.

At the same time, there remains a large number of nodes with shards without any special load, which are actually inactive. This leads to inefficient use of our cluster resources.

Both problems could have been avoided if Elasticsearch more intelligently distributed shards, since the average use of system resources on all nodes is at a healthy level of 40%.

Continuous cluster change

During the work of more than 500 nodes, we observed one more thing: a constant change in the state of the nodes. Shards are constantly moving back and forth along nodes under the influence of the following factors:

New indexes are created, and old ones are discarded.
Disk labels work due to indexing and other changes on shards.
Elasticsearch randomly decides that there are too few or too many shards on the node compared to the average value of the cluster.
Hardware failures and crashes at the OS level lead to the launch of new AWS instances and their joining to the cluster. With 500 nodes, this happens on average several times a week.
New nodes are added almost every week due to normal data growth.

Taking all this into account, we came to the conclusion that for a comprehensive solution to all problems, a continuous and dynamic re-optimization algorithm is needed.

Solution: Shardonnay

After a long study of the available options, we came to the conclusion that we want:

Build your own solution. We did not find good articles, code or other existing ideas that work well in our scale and for our tasks.
Start the rebalancing process outside of Elasticsearch and use the cluster redirection APIs , rather than trying to create a plugin . We wanted a fast feedback loop, and deploying a plugin on a cluster of this magnitude can take several weeks.
Use linear programming to calculate optimal shard movements at any time.
Perform optimization continuously, so that the state of the cluster gradually comes to the optimum.
Do not move too many shards at the same time.

We noticed an interesting thing, that if you move too many shards at the same time, it is very easy to cause a cascading storm to move shards . After the start of such a storm, it can continue for hours when shards uncontrollably move back and forth, causing the appearance of marks on the critical level of disk space in various places. In turn, this leads to new movements of shards and so on.

To understand what is happening, it is important to know that when you move an actively indexed segment, it begins to actually use much more space on the disk from which it is moving. This is due to the way Elasticsearch saves transaction logs . We have seen cases when the index doubled as the node moved. This means that a node that initiated shard movement due to high disk space usage will use even more disk space for a while until it moves a sufficient number of shards to other nodes.

To solve this problem, we developed the Shardonnay service in honor of the famous Chardonnay grape variety.

Linear optimization

Linear optimization (or linear programming , LP) is a method of achieving the best result, such as maximum profit or least cost, in a mathematical model whose requirements are represented by linear relations.

The optimization method is based on a system of linear variables, some constraints that must be met, and an objective function that determines how a successful solution looks. The goal of linear optimization is to find the values of variables that minimize the objective function while respecting the constraints.

Shard distribution as a linear optimization problem

Shardonnay should work continuously, and at each iteration it performs the following algorithm:

Using the API, Elasticsearch retrieves information about existing shards, indexes, and nodes in a cluster, as well as their current location.
Models the cluster state as a set of binary variables of the LP. Each combination (node, index, shard, replica) gets its own variable. In the LP model, there are a number of carefully designed heuristics, constraints and the objective function, see below.
Sends the LP model to a linear solver, which gives the optimal solution, taking into account constraints and the objective function. The solution is a new assignment of shards on the nodes.
Interprets the decision of the LP and converts it into a sequence of movements of shards.
Instructs Elasticsearch to perform shard movements through the cluster redirection API.
Waiting for the cluster to move the shards.
Returns to step 1.

The main thing is to develop the right constraints and objective function. The rest will make solver LP and Elasticsearch.

Not surprisingly, the task turned out to be very difficult for a cluster of this size and complexity!

Restrictions

We base some restrictions on the model based on rules dictated by Elasticsearch itself. For example, always stick to disk labels or prohibit replica from being placed on the same node as another replica of the same shard.

Others are added based on experience gained over the years working with large clusters. Here are some examples of our own limitations:

Do not move today's indexes, as they are the hottest and get almost constant load on reading and writing.
Prefer the movement of smaller shards, because Elasticsearch copes with them faster.
It is desirable to create and place future shards a few days before they become active, begin to be indexed and undergo a heavy load.

Cost function

Our cost function weighs together a number of different factors. For example, we want:

minimize the variance of indexing and search queries to reduce the number of “hot spots”;
keep the minimum dispersion of disk usage for the stable operation of the system;
minimize the number of shard movements so that chain storms do not start, as described above.

Reduction of LP variables

On our scale, the size of these LP models is becoming a problem. We quickly realized that problems could not be solved in a reasonable time with more than 60 million variables. Therefore, we applied many optimization and modeling tricks to drastically reduce the number of variables. Among them are biased sampling, heuristics, the "divide and conquer" method, iterative relaxation and optimization.

Fig. 2. The heat map shows the unbalanced load on the Elasticsearch cluster. This is manifested in a large dispersion of resource use in the left part of the graph. Thanks to continuous optimization, the situation is gradually stabilizing.

Fig. 3. The heat map shows CPU utilization on all nodes in the cluster before and after setting the hotness feature in Shardonnay. You can see a significant change in CPU usage with a constant workload.

Fig. 4. The heat map shows the reading capacity of the disks during the same period as in fig. 3. Read operations are also more evenly distributed across the cluster.

results

As a result, our solver LP finds good solutions in a few minutes, even for our huge cluster. Thus, the system iteratively improves the state of the cluster in the direction of optimality.

And the best part is that the dispersion of workload and disk usage converges as expected - and this near-optimal state is maintained after many intentional and unexpected changes in the cluster state since then!

We now support a healthy workload distribution in our Elasticsearch clusters. All thanks to linear optimization and our service, which we call Chardonnay with love.

Source: https://habr.com/ru/post/429738/

All Articles