Author: Alexey OvchinnikovQuite often when creating a virtual machine on the cloud, there is a desire to associate it with some storage device. Quite often, when creating a virtual machine on a cloud, you want it to work as fast as possible. In the case when a storage device is connected to a virtual machine (VM), the exchange of information with it can significantly degrade the performance of the bundle. Therefore, it is clear that if the storage device is located on the same physical node where the VM is deployed, the delay will be minimal. What is not obvious is how to achieve such a convenient placement using the OpenStack platform.
Unfortunately, OpenStack does not yet provide the means for such a fine tuning by default, however, being an open and easily extensible platform, OpenStack allows you to add yourself with similar functionality. In this post I will discuss the features of the implementation of such add-ons and pitfalls that may occur during their development and use.
')
I will begin my discussion with a simple question, namely how a VM can be placed on a particular node.
As everyone (perhaps) is well aware, the scheduler (nova-scheduler component) is responsible for locating VMs on the nodes, therefore, in order to achieve the original goal, it is necessary to somehow modify its behavior so that it takes into account the distribution of storage devices. The standard approach to this is to use scheduler filters. Filters can influence the choice of a node by the scheduler, while the filters can be controlled from the command line, passing them the characteristics that the nodes selected by the scheduler must correspond to. There are several standard filters that allow you to solve a fairly wide class of planning tasks and described in the
OpenStack Docs project documentation. For less trivial tasks, there is always the possibility to develop your own filter. This is what we will do now.
Some words about filters
The general idea of filtering planning is quite simple: the user specifies the characteristics that the node must respond to, after which the scheduler selects a set of nodes that respond to them. Then VM can be started on one of the nodes selected at the previous stage. On which, is determined by its loading and a number of other characteristics that are irrelevant at the filtration stage Consider the filtering procedure in more detail.
Quite often in the system there are several filters. The scheduler first compiles a list of all available nodes, then applies each of the filters to this list, discarding unsuitable nodes at each iteration. In this model, the filter problem is very simple: consider the node submitted to it at the input, and decide whether it meets the filtering criteria or not. Each of the filters is an object of one of the filter classes, which has at least one method -
host_passes()
. This method should take a node and filtering criteria as input and return
True
or
False
depending on whether the node meets the specified criteria. All filter classes must inherit from the
BaseHostFilter()
base class defined in
nova.scheduler.filters
. At startup, the scheduler imports all modules listed in the list of available filters. Then, when the user sends a request to start the VM, the scheduler creates an object for each of the filter classes and uses them to screen out unsuitable nodes. It is important to note that these objects exist during the same planning session.
For example, consider the RAM filter, selecting nodes with a sufficient amount of memory. This is a standard filter that has a fairly simple structure, so based on it you can develop more complex filters:
class RamFilter (filters.BaseHostFilter):
"" "Ram Filter with over subscription flag" ""
def host_passes (self, host_state, filter_properties):
"" "Only return hosts with mobile RAM available." ""
instance_type = filter_properties.get ('instance_type')
requested_ram = instance_type ['memory_mb']
free_ram_mb = host_state.free_ram_mb
total_usable_ram_mb = host_state.total_usable_ram_mb
memory_mb_limit = total_usable_ram_mb * FLAGS.ram_allocation_ratio
used_ram_mb = total_usable_ram_mb - free_ram_mb
usable_ram = memory_mb_limit - used_ram_mb
if not usable_ram> = requested_ram:
LOG.debug (_ ("% (host_state) s does not have% (requested_ram) s MB"
"Usable ram, it only has% (usable_ram) s MB usable ram."),
locals ())
return false
# save oversubscription limit for compute node to test against:
host_state.limits ['memory_mb'] = memory_mb_limit
return true
To determine whether a given node is suitable for a future VM, the filter needs to know how much RAM is available on the node at the moment, as well as how much memory is required for the VM. If it turns out that the node has less free memory than is necessary for the VM, then
host_passes()
returns
False
, and the node is removed from the list of available nodes. All node status information is contained in the
host_state
argument, while the information needed to make a decision is placed in the
filter_properties
argument. Constants reflecting some general planning strategy, such as
ram_allocation_ratio
, can be defined elsewhere, in configuration files or in the filter code, but this, by and large, is not essential, since everything necessary for planning can be passed to the filter using so-called scheduler tips.
Scheduler Tips
Scheduler hints are nothing more than a dictionary of key-value pairs that is contained in each request generated by the
nova boot
command. If nothing is done, then this dictionary will remain empty and nothing interesting will happen. If the user decides to transmit some hint and thus replenish the dictionary with hints, then this can be easily done using the
hint
key, for example, in the following command:
nova boot … --hint your_hint_name=desired_value
. Now the dictionary with hints is not empty, it contains a transmitted pair. If any extension of the scheduler is able to use this hint, then it just received information that should be considered when working. If there is no such extension, then nothing will happen again. The second case is not as interesting as the first one, so let's stop at the first one. Let's see how the extension can use the tips.
To use the prompts, they obviously need to be extracted from the query. This procedure is also quite simple: all hints are stored in the
filter_properties
dictionary by the
scheduler_hints
key. The following code snippet fully explains the procedure for getting hints:
scheduler_hints = filter_properties ['scheduler_hints']
important_hint = scheduler_hints.get ('important_hint', False)
In the scheduler in
nova scheduler_hints
always present in the request, so when developing your extension you can not expect any unpleasant surprises here, however, you should be careful when reading the value of the hint.
Now we have the opportunity to receive arbitrary hints. To achieve this goal, it remains to discuss how they should be used to
Increase the locality of connected storage devices!
Having knowledge of how to expand the functionality of the scheduler, you can easily design a filter that allows you to run VMs on the same nodes that have the storage devices of interest to the user physically. Obviously, we will need to somehow distinguish the storage device that we are going to use. Here we can come to the aid of the line volume_id, unique to each device. From volume_id, you should in some way get the name of the node to which it belongs, and then select this node at the filtering stage. Both of the last tasks should be solved by a filter, and in order for it to work, the filter needs to be informed of the node name with the help of the corresponding hint.
First we use the hint mechanism to pass the volume_id to the filter. To do this, we agree to use the name
same_host_volume_id
. This is a simple task, solving which we immediately meet with the following, which is less obvious to decide: how to get the node name, knowing the identifier of the storage device? Unfortunately, to all appearances, there is no easy way to solve this problem, so we will turn for help to the person responsible for data storage: the cinder component.
You can use cinder services in different ways: for example, you can use a combination of API calls to get the metadata associated with a given volume_id, and then extract the node name from them. However, this time we will use a simpler method. We will use the ability of the cinderclient module to form the necessary requests and will work with what it returns:
volume = cinder.cinderclient (context) .volumes.get (volume_id)
vol_host = getattr (volume, 'os-vol-host-attr: host', None)
It should be noted here that this approach will only work for the release of Grizzly and later, since the extension for cinder, which allows us to obtain information of interest to us, is available only in them.
Further implementation is trivial - you need to compare
vol_host
with incoming names and return
True
only when they match. Implementation details can be viewed either
in the Grizzly package or in the Havana implementation. With some reflection on the resulting filter, the inevitable question arises:
Is this the best thing to do?
No, the considered method is neither optimal nor the only possible. So, in a straightforward implementation, there is a problem with multiple calls to cinder, which is quite expensive, and a number of other problems that slow down the filter. These problems are not significant for small clusters, but can lead to significant delays when working with a large number of nodes. To improve the situation, you can modify the filter: for example, by entering the cache for the host name, which will limit one call to cinder for loading the VM, or add flags that will actually turn off the filter as soon as the desired host is detected.
To summarize, I’ll note that VolumeAffinityFilter is only the beginning of work on using locality to improve cloud performance, and there is room for development in this direction.
Instead of an afterword
The example I reviewed shows how you can develop a filter for the nova scheduler, which has a feature that distinguishes it from others. This filter uses the API of another component of the OpenStack platform to fulfill its purpose. Coupled with the addition of more flexibility, this approach can be detrimental to overall performance, since services can be located far from each other. A possible solution to the problem of such fine-tuning can be to combine the schedulers of all services into one that has access to all cloud characteristics, but at the moment there is no simple and effective way to solve this problem.
Original article
in English