So, you are trying to assess the reliability of your cloud service.

SLA (Service Level Agreement) is a form of guarantee of service reliability that is often found with service providers. Usually SLA is offered in the form of an offer - and either you are satisfied and use the service, or you are looking for another service. A typical wording is “industry leading 99.95% monthly uptime SLA”, which the majority of users should like to arrange.

Usually, a potential user, having read about the “99.95% monthly uptime SLA”, is very pleased - the guarantee of no downtime for more than 21 minutes per month with a length of 30 days sounds quite promising.

Everything is relatively simple, as long as you only consume the cloud service for your own needs. We looked at 99.95%, thought about no more than 21 minutes a month - were impressed and satisfied. What if you yourself create a service based on another service and decide which SLA you could offer?

For example, an image processing service (suspiciously similar to the ABBYY Cloud OCR SDK ). What SLA can be offered to such a service? It would seem that you need to take all dependencies on other services, carefully read their SLA, look at the number of nines, and decide how many nines after the comma you can write to your SLA.
')
Suppose an image processing service runs in Windows Azure and uses so-called web and worker roles from Azure Cloud Services to execute code and Azure Storage to store data. Fine. Open SLA by Cloud Services and see there that TL; DR; guaranteed availability of copies of roles within 99.95% during the month (provided that each role has at least two copies). Open SLA on Azure Storage and see there that TL; DR; At least 99.9% of requests to the storage are guaranteed. If the quality level does not match the guaranteed one, you need to contact support - and then the supplier will return some of the money.

This was a very brief SLA of these two services. If you use any of these services, you should carefully read and take into account all reservations.

The following is fundamentally important: even in the worst case, a relatively small amount of money will be returned to you, which will cover ... yes, it will not cover anything, because it is tied to the share of the cost of the service consumed, and the cost of using cloud services is very low compared to, for example, employee salaries who will contact support service provider. The meaning of SLA with three nines is very simple: “dear users, this is a very reliable service, we are trying very hard, ~~PLAY and~~ use, we will issue an invoice before the 10th day of the next month”, SLA essentially sets expectations from the service, which is also very important. If accessibility were “guaranteed” for, for example, 15% of the time, the expectations from the service would be completely different.

We now return to the question of what guarantees can be given to its users if the service essentially depends on another service with the SLAs described above. It seems to be guaranteed availability of machines on which the code is executed, for at least 99.95% of the time. A part of the calls to the repository may fail, but it’s about no more than one-tenth of a percent - it’s not scary, you have to design the service so that the unsuccessful operations with the repository are repeated several times with an increasing pause, and if the operation fails after several attempts reset the user's request - if this happens not too often, the user will be quite satisfied.

Accordingly, after some meetings and two weeks of correspondence with everyone in the copy, we can multiply everything into everything and decide what we can offer, for example, the serviceability of the service for 99.9% of the time during the month. Having formulated such an SLA, we say to our users “our service is reliable, use, everything will be fine, and if not, we will fix it very quickly, WITHOUT PANIC”.

You publish such an SLA and after a while it is EXTREMELY UNEXPECTED ...

... you realize that it is very urgent to publish the correction of some extremely annoying mistake. Or you need to change the settings at the infrastructure level. Or the service itself realized that the load had increased, and decided that it was necessary to issue a command for scaling.

For all these actions, some additional management service is used in the cloud infrastructure (perhaps you are using a portal that runs on top of such a service, or a program that sends calls to such a service). This is a very important service, precisely because of its existence, the clouds are so flexible and convenient to use. And this very important service at this very important moment, when you really need to do something very urgently, refuses to process your request.

In numerous presentations, screencasts and instructions you can see how this service is used left and right when deploying new virtual machines, publishing a package with a service filling and many other operations. No one tells you one important thing: this service is your only opportunity to manage the cloud. As soon as there is something wrong with the management service, you have potentially very big problems.

Returning to the wording of your SLA. Obviously, you need to somehow consider the need for such operations as scaling and publishing updates, and take it into account in your SLA. And yes, our service seems to be processing a large (and unknown in advance) number of images from users quickly enough, and for this it should be able to scale. And these necessary operations require the use of an "auxiliary" management service.

Then it is logical to look at the SLA of this management service in order to understand what to expect from it.

In Windows Azure, the Management API is used to manage the infrastructure (the management portal and cmdlets also work through it). So, open the Management API SLA and ...

... but no, it will not be possible to get acquainted with this document, because it simply does not exist. And Amazon EC2 also has no infrastructure management service SLA.

Wait ... OH SHI ~

Yes, we just almost ignored the complete lack of SLA of the service, on which our service depends significantly. This is not only about code updates (which seem to be postponed, but in fact sometimes they need to be published very urgently) - the ability to scale is constantly needed.

Why is there no SLA to the management service? One can only guess.

It can be assumed that it is not so easy to make the cloud management infrastructure sufficiently reliable. It's one thing to promise that a particular virtual machine will be accessible over the network, another thing is to promise that it will definitely be able to scale out to several more nodes.

You can instead assume that users do not consider the management service as an important service and are quite satisfied with the current SLA formulations for “basic” services.

Alternatively, one can assume both at the same time ~~(and without bread)~~ .

In any case, cloud service providers still have where to develop their services, and users should be more attentive to the dependencies of their own services. Otherwise, from the impressive number of nines after the comma no sense.

Dmitry Mescheryakov,
product department for developers

Source: https://habr.com/ru/post/213955/

All Articles

So, you are trying to assess the reliability of your cloud service.

More articles: