How can you use interrupted Yandex.Cloud virtual machines and save on solving large-scale tasks

Today we want to talk about such a useful function of Yandex. Oblak as interrupted virtual machines. This is a special option that you can choose when creating a virtual machine to use computational resources at a reduced price. What is so special about interrupted virtual machines, why are they cheaper than usual, and in what cases is it reasonable to use them?

The power of Yandex.Oblak, or rather, the Yandex Compute Cloud infrastructure service, is much more than those that are used by users. By default, it is assumed that users should be able to conditionally unlimited scaling. At least from these considerations, without taking into account other aspects, the available resources of the cloud platform significantly exceed current demand. Interrupted virtual machines are created on these free capacities.

Main limitations

In short, the essence of interrupted virtual machines can be described as follows: the service offers to use its free computing resources at a lower price, provided that these resources can be withdrawn at any time.
')
In general, interrupted virtual machines work like regular virtual machines, but for them there are a number of limitations:

They are not covered by a service level agreement (SLA).
Cannot be created and launched.
They can be forcibly stopped at any time. The probability of stopping is small, but not zero, it can change over time and vary in different accessibility areas of Yandex . Cloud .
An interrupted virtual machine cannot be made normal, but a normal interrupted one. The corresponding flag is set once and does not change.
The machine will necessarily be stopped in a period not exceeding 24 hours.

In practice, in the overwhelming majority of cases, interrupted virtual machines work out all 24 hours provided by the terms of service. A forced shutdown, as a rule, occurs only when a large number of normal virtual machines are created in a specific availability zone in a short period: a new user appears with serious needs or current users are massively scaled.

At the same time, the stopped virtual machine can be restarted: all data on the disks is saved both with automatic and manual shutdown.

Usage scenarios

Restrictions for interrupted virtual machines raise a logical question: how to use them if resources can be recalled at any time? As an explanation, here are some possible use cases.

Batch processing

Batch processing involves the parallel execution of a large number of resource-intensive tasks. This can be file format conversion, image processing and recognition, ETL operations . The bottom line is that during batch processing there is a queue of tasks and a whole set of working processes (performers) that receive tasks from the queue. If an individual performer running on an interrupted machine stops, the task will simply be transferred to the next performer. In other words, stopping one or even several virtual machines will not have a significant negative impact on the process and the result of processing.

When batch processing, we are talking about using dozens of virtual machines. The use of interrupted machines gives a very noticeable savings. Now one of the main consumers of productive interrupted virtual machines with 32 cores is a long-time client of Yandex.Oblaka, Seismotek. Seismotech processes seismic data that is necessary for the exploration of gas and oil fields. Seismic exploration involves working with large volumes of information. Data is processed by the batch method. The company simultaneously uses up to 60-odd interrupted machines: a total of up to 2000 vCPU and 4000 GB RAM.

Hadoop Projects

Hadoop is used to design and execute distributed programs running on clusters of hundreds and thousands of low-cost nodes. The mechanisms of file replication and automatic restart of tasks performed on failed nodes provided in Hadoop ensure the stability of the distributed system to the failures of individual machines. That is why, where Hadoop is applied, at least some of the nodes can be easily deployed on interrupted virtual machines. In case of their early shutdown, the tasks will be sent to other nodes.

Fail safety of web services

Continuous availability of a web service can be achieved using a cluster. A cluster consists of two or more servers. One of its tasks in the application to web services is to ensure stable operation at the time of peak loads. Typical examples are online shopping sites or sports sites where traffic growth is tied to specific dates. For stores, these can be traditional holidays or discount periods, and for sports-related websites, there are days of events when there are broadcasts, reviews and photo reports are published. At such times, the amount of traffic may increase significantly.

The cluster must cope with the influx of visitors, distributing traffic to different nodes. For a period of sharp, but briefly increasing load, fault tolerance can be provided by adding servers on interrupted virtual machines. This option is inexpensive and copes well with its task. It is important to observe one condition: such a cluster must necessarily be hybrid, that is, include ordinary virtual machines. In this case, even an unlikely stop of the interrupted machines will not lead to a failure of the service.

Projects at Kubernetes

Kubernetes allows you to automate the deployment, scaling and management of containerized applications on a large number of nodes. One of the main entities, which can be called the Kubernetes building block, is under (pod). Pod provides running one or more containers on a single node. A node for each hearth is selected and assigned by the scheduler Kubernetes. If a separate node with a running hearth fails, the scheduler will automatically transfer it to a node that is operating normally. Such a maintenance scheme assumes that part of the nodes can be placed on interrupted virtual machines.

Continuous Integration Testing

The practice of continuous integration is based on the frequent assembly and testing of the project. This applies mainly automated testing. Schematically, it looks like this: a test environment is created on the virtual machine, the last build of the application is unloaded into it, automated testing is performed, the test results are unloaded, the virtual machine is deleted. As a rule, testing takes several tens of minutes, less often - several hours.

Traditionally weak points of continuous integration are considered to be the significant costs of supporting the integration process itself and the high need for computing resources. From this point of view and taking into account the timeframe of automated tests, interrupted virtual machines look more than a suitable option for continuous integration. They are much cheaper, and the probability of stopping the machine directly at the time of testing is vanishingly small. Moreover, even if the car is still stopped, the damage from the point of view of business will be minimal.

Use in conjunction with other services Yandeks.Oblaka

The Yandex Instance Groups service allows you to automatically monitor the status of a whole group of interrupted virtual machines. It can independently create virtual machines with specified characteristics, maintain the required number of machines in a group, and restart interrupted instances if they stop. It does not matter whether the forced stop occurred or 24 hours have passed since the launch. Only one thing is important: a restart will occur if there are available resources. Yandex Instance Groups makes working with interrupted virtual machines more convenient, but it cannot guarantee that there will necessarily be free capacity in a specific availability zone.

Economic indicators

As we mentioned, interruptible virtual machines can reduce the cost of using computational resources. Inside Yandex, we started working on the implementation of a similar function a few years ago. To divide computational tasks into guaranteedly executed and interrupted ones, considerable investments were required. But it was not for nothing: as a result, we increased the level of useful utilization of the server infrastructure from 30-40% to 70-80%.

Now similar features are available to all users of Yandex. Cloud by pressing a single button. A simple example: if you translate half of used virtual machines with one hundred percent kernel load into interrupted format, you can save up to 35-40% of the budget.

At reduced cost, available CPU and RAM resources. Disk space and IP addresses are charged at regular rates. This is what a simple calculation for the Cascade Lake platform shows.

If you wish, you can compare the cost of using virtual machines in different modes using a calculator .

Hopefully, we were able to bring some clarity and give some useful examples in which cases interrupted virtual machines can be used to reduce the cost of computing resources without losing the quality of task execution.

Other publications about Cloud on Habré

Source: https://habr.com/ru/post/457538/

All Articles