How to write a good SLA

How to write a good SLA (Service Level Agreement, it’s the same Service Level Agreement). And which SLA will be good.

This article is an attempt to summarize the existing experience, and also I am going to refer to it when I will be further asked what the SLA should look like. Working in the industry for more than a decade, to my surprise, I regularly come across a serious lack of understanding of the basis on which SLA is built. Probably because the document is quite exotic. After reading this text, I hope that a thoughtful reader should have a dot on it should fall into place. Target audience - those who write SLA, and they sympathize.

I will operate with the name SLA, Service Level Agreement, it is the Service Level Agreement. It came from ITIL / ITSM and stuck. This document is the cornerstone of current approaches to implementing IT functions. It is also one of the key if we either want to hire an external contractor for some services, or we want to make internal units as autonomous as possible, that is, in fact, treat them as an internal outsourcer. And although ITSM approaches are somewhat more versatile and you can use them with some fabrication even quite far beyond the framework of IT, in the following, I will cite as examples the situation where we have some IT system and service - this is its maintenance. Well, simply because such a task is typical and occurs everywhere. Similarly, you can write an SLA for any other activity, only the list of services and evaluation criteria will change.

Next, I will tell (and at the same time substantiate where I can) what parts the SLA should consist of, and what the information from each part affects. Understanding these cause-effect relationships will allow you to write a good SLA. Looking ahead, I will say that good is the one that allows you to steer the process.

Introductory part of the SLA

In the water part of the SLA, and I have not for nothing called it definitive, it would be nice to determine what is at stake.

It is best to start the SLA with a glossary, a brief description of the system and the roles of the process participants. We indicate the name of the system, on the basis of which product from which manufacturer it is made, if it is based on a boxed product, or on which technologies it is based, if samopal. Regular participants - users, key users, HelpDesk employees (first line of support), employees of the second, third (and so on) support lines, you can specify the names of the company's departments involved in the process and list the roles of employees of these departments.

Next, it is necessary to determine the boundaries of the SLA actions - territorial, temporary and functional. That is, where the service will be provided (remotely or on the territory, addresses / appearances), when (from and to, work schedule, including weekends and holidays). The section with the functional framework of the system contains the major version of the system (which will not change from installing updates), the list of system modules (if the system is modular), configurations (if there are different basic ones, such as 1C), interfaces with other systems. It is better to discuss the interfaces right away and what part of them is related to SLA, and which part is not relevant. If, in addition to the productive instance of the system, the SLA extends to the test zone or any other copies of the system, this should be recorded explicitly.

If SLA is a service level agreement, then there must be a service. No magic, any service is represented by a set of services that make it up. They can be of different types, all of them need to be listed in the SLA with a minimal but completely comprehensive description that will allow any interested person to understand exactly what is meant by each service. It is also useful to give examples of services of each type, to specify the typical cases that the service is included and what is not included. But at the same time, the description of each service should be as compact as possible. We number the services so that they can be conveniently referenced.

To summarize, the overall part of the SLA should clearly define the service that we are going to handle the rest of the document. The general part should make it clear what should be done and what should not be done within this SLA. If there are uncertainties, then we modify the description. Ideally, a third-party person in the subject (for example, an employee of a profile consulting company) should, after reading, say "yes, everything is clear!"

Somewhere here, the prologue begins to smoothly transition into the essential. However, I also prefer to immediately determine the system of priorities, since our service will most likely depend on priorities. I have not yet seen, so as not to depend. Well, just asking to insert a description of the priorities here (including their use / change) and the escalation procedure.

Everything, then you can move to a significant part - to determine the level of service. Before proceeding directly to the writing of these tsiferok, you should be distracted by the solution of a deeply philosophical question of what exactly we will write here, and what will ultimately determine how good the SLA turned out.

Which SLA is good?

We start with the question "what kind of SLA will we consider good?" A very worthy question, very few people can answer it clearly. Dropping three tons of thinking and several stochchennyh languages, I will go straight to the point.

Why SLA is a reverent attitude? Why from the heap of documents describing the work regulations and other policies of the internal kitchen of the IT departments does the SLA stand apart? Yes, because the SLA is a regulatory document. This document not only determines what and how we will have the service (this part just often duplicates other regulatory documents), but determines where we look in the process of providing the service and what we want to see there. This essentially determines the whole nature of the work. And the art, with which metrics of the process are selected and, most importantly, their target values - this will determine how the service will be rendered. This allows you to control the process .

That's exactly what we want to see in the SLA. That is, the more control you get, the better the SLA . Accordingly, less control is worse. There is no control at all - you can throw out the SLA as unnecessary.

Select metrics for SLA

Many great minds of mankind have devoted a lot of time and attention to inventing metrics. It is usually not difficult to choose such metrics that are suitable in a particular case. Knowledge and understanding of the subject area is key here. Interestingly, some processes are not amenable to entering metrics. For example, the work of a programmer cannot be described by good metrics; any of them can be exceeded by the programmer to the detriment of the cause, that is, discredited. And due to the nature of the profession, any metric will certainly be discredited. But more about that some other time. To support IT-systems, everything is somewhat simpler. Often, take the reaction time (sometimes meaning by it the time before the request is processed) and the target time for solving the request. If your organization has historically developed other generally accepted parameters, then take them. You can familiarize yourself with the world experience and choose your own metrics by searching for the keywords "SLA" and "metric".

What is important here? Without going into details (this is a topic for a separate article), metrics should have the following qualities:
(1) reflect the quality of service provided,
(2) be easily measurable,
(3) be as versatile as possible (to be used in all of their SLAs),
(4) they should not be much.

If there is more than one metric, you should explicitly indicate which parameter is decisive. Otherwise, there is a risk that the executor, instead of solving a critical problem, will deal with the comparison of metrics. If an external contractor is hired for the service, then it is for violation of the main parameter that fines can be determined.

And finally, last in order (but not least):
(5) the metric should depend only on the work of the performer.

If the correlation of the metric with the work of the contractor is weak, then the metric will not work - the control is lost, the SLA does not work.

I will give an example of a bad metric. A 99.99% availability time for a specific IT system is a poor metric for HelpDesk to work. Because HelpDesk does not affect the system downtime from the word "no way". That is, if the system has "fallen", then HelpDesk can only transmit information as quickly as possible to the administrator who can "raise" the system. And how long it will take (and whether there will be any fuss at all) does not depend on HelpDesk. Punishing HelpDesk for someone's non-operational work is cruel and pointless. The only thing that can be done with this is that HelpDesk will put a similar SLA with the device.

Metric Values

Now I want to show how to competently approach the choice of metric values.

A typical error looks like this. I describe the situation. Suppose we have a fairly large system (for example, some kind of ERP), and work on its support:

HelpDesk (also known as the 1st level of support), accepting calls from all users of the company by phone, mail and intranet, making complaints to incidents and transferring incidents to specialized support groups of the 2nd level,
A Level 2 support team from analysts who know this system from a functional point of view, who can parse the incident, help users and identify system errors in the code / data,
Level 3 support group of developers who can fix the code / data in the system and attract a vendor of basic software if necessary,
The vendor of the base software is in this scheme the 4th level of support. Other divisions of the company such as networkers, infrastructures, etc. may also be on the same level.

If in your case, this system looks easier, do not worry. I will explain the principle. And the simpler the system, the easier it is to regulate it.

We write in SLA that the problem of the critical priority of our system should be solved, say, in a day. We argue that users want the problem to be solved in 24 hours. We asked them, and they confirmed. This is the main metric in our SLA. Is this good or bad? Consider from different sides.

HelpDesk, in any case, will have time to do everything that depends on it, not even in a day, but in an hour maximum. That is, in the process of a telephone conversation, an incident will be executed, clarifying questions will be asked, information will be recorded and sent to the 2nd level. Hour is so, with a stock. Therefore, HelpDesk does not pay any attention to the metrics in the SLA. The main thing is that everything that arrives today and today will fly away, and all SLAs will be fulfilled. But they always work that way.

Now the 2nd level received an incident (it can directly, maybe from HelpDesk), and until the end of the day there is time to deal with the incident. Not every incident can be resolved in such a time, but most of it is really resolved in a day, especially of critical priority. True, if the incident was let loose by the forgotten until the evening in HelpDesk, then there was no time left for its solution. At the same time, the 2nd level metric will break, and HelpDesk was really to blame ...

But suppose that the 2nd level managed to figure out the evening with the incident, but at the same time I found out that the cause of the incident was an error in the report. In order to understand this, we had to start the report many times with different parameters, and the report does not work quickly, so the work was completed only in the evening. The corresponding problem is issued by the request and sent to the side of the 3rd level.

Now the 3rd level in the face of the developers, if they have not yet gone home, has a dilemma - to work hard in the night or guaranteed to break the SLA and tackle the problem the next morning. In the case of manual pedaling of the situation, of course the first option will work, but I don’t want to call such a regular one. Because with this approach, an urgent flies always to the evening. This is the result of the shock (and, most importantly, good) work of colleagues from Level 2.

Debriefing. What do we see in the results in the light of our SLA? For HelpDesk and Level 3, SLA does not work, only works for Level 2.

What happens if we increase the target solution time to a week? Now you can begin to demand the performance of such an SLA from level 3. But on the other hand, for the 2nd level, such an SLA stopped working - why fuss, we will have time tomorrow. Or the day after tomorrow. As a result, problems will fall on the 3rd level on the last day of the week assigned, the 3rd level will be outraged by this fact and (if common sense suddenly wins) the week from the SLA will be divided into 2 days of 2nd level and 3 days of 3- go or something else. Well, the 2nd level, of course, will relax, because the time that he can waste is clearly more. But HelpDesk is no longer looking at SLA at all, they cannot break such an SLA even if they want. To them, users will unscrew the head before. And the total time to solve problems will be more. Somehow not very good.

And what should SLA start working for HelpDesk? Probably reduce the time. Up to one hour. But then both the 2nd and 3rd levels will no longer fall into SLA in principle. And they will stop looking at the SLA altogether, because from their point of view there are nonsense written there. And gradually, everyone will quit, because they can not do their job well, and nobody really loves it.

What to do? Draw conclusions. If we want control, we need to allocate target work time at each level of support. And to give HelpDesk a job for an hour, for Level 2 a day, and for Day 3, three days. During this time, everyone must complete their task. In the meantime, the problem is solved by others, one counter stops ticking, another turns on. Now we have everyone watching their time and do not lose it in vain. Full control. When an additional level of support is attracted, the total time should increase, reflecting the depth of the identified problem. If you need to intensify someone, you can do it targeted. For example, if you really need to fit into a day for everything about everything, then we divide them by 30 minutes for the 1st level, 4 hours for the second and 19 and a half for the third. This may be unnecessarily stringent requirements in some case, but I will explain later on the harmful effects of excessively stringent requirements. But now we have control in the SLA and it works, since the metric makes it easy to identify who does not do his part well. If you are writing a multi-level SLA, then always specify the metrics separately for each level.

Separately I will answer the question "but the users told us that in one day", and what to do about it. Users of IT systems very infrequently possess the competence sufficient to implement, configure and further develop and maintain the very system of which they are users. To solve such problems, there are IT departments that must, by the nature of their activities, understand and provide exactly what users actually need. So if your users called you time to solve a critical task, then it means you didn’t ask them well. Of course, they wanted critical problems to be solved faster, say in an hour. And even better, as soon as it occurs. Some, especially clever ones, may even require preventive problem solving, and yes, the best practitioners speak about proactive work. But then, in the event of complete success, the work of the IT department will not be visible to anyone, and all IT employees will be fired. Therefore, so cool bother only in those cases when without this really nothing (for example, in life support systems). So do not hide behind the incompetence of users, but rather do your job: explain to users that a simple problem will be solved not in a day, but even faster. And the complex will be solved longer and there is no getting away from it. And, by the way, in most cases the decision will be the same day.

Another interesting note. If the tasks are rather heterogeneous in nature and time required for the solution, and it is impossible to divide them into different services, then it makes sense not to specify the maximum time for the task in the target metrics, but to switch to statistical estimates. For example, 80% of requests will be resolved in a day. The alternative is to give an opportunity in the task to consistently change the deadline.

What are dangerous excessive requirements

Now, about the dangers of excessive cruelty of established metrics, and any other service parameters in the SLA. Everything is simple to banality: tightening metrics increases the cost of work . Dixi. Let me explain with examples.

Example number 1. Suppose that some kind of service requests is solved on average for 4 hours. We also know that a performer with a high level of expertise can solve such requests in 2 hours. What happens if we write 2 hours instead of 4 in the SLA? This will lead to the fact that the performer must be an expert, then it will become more expensive. Plus problems with his motivation in the future. Because on the one hand, an expert will be bored doing the same thing, but on the other hand there are a lot of places where he will be called upon intensively. According to my numerous observations, the price of the service increases one and a half to two times.

Example number 2. What happens if in the SLA in the same situation you write 1 hour (or 1 minute, which is the same in this context), that is, make the time unrealistic? To the previous increase in value, feel free to add the value of the expected late fees and multiply the result by a risk coefficient equal to, say, 1.5-2. And, worst of all, the SLA stops working. No need to do so in good sense.

Example number 3. We want instead of the 8x5 mode (8 hours on weekdays) to get the 24x7 mode. The price tag immediately increases two to three times. And this is only if you can do on duty shift, which will cover the night / weekend and vyzvanivat real performers in case of something. If you need a real permanent job in 24x7 mode, then the price tag will be five times higher, if not more. Why? Because three shifts and weekends / holidays, but a substitute for leave / sick leave. Moreover, qualified personnel may refuse to work on a non-standard schedule, and this gap of expectations will also have to be treated with money. Do you really need 24x7?

Example number 4. We want the constant presence of the performer in the office so that it can be seen how it works and whether it works - oh! - on the side? Yes, now he really can no longer help his colleagues, participate in parallel in other projects, and also cannot be an employee from the regional office. In the end, the performer will be obliged to comply with our dress code and lose time on the way to us. Total twice as expensive. Along the way, we blocked for ourselves the possibility to use additional resources during peak loads and to attract experts of the necessary qualifications as needed, which would have happened by itself in the case of the remote work of the contractor. Or maybe it would not happen, but now it definitely will not.

Any other wishes, especially irrelevant, will also be appreciated and added to the cost. Moreover, the less profile wishes, the higher will be appreciated. Feather plumes, Gypsies with bears - everything can be solved, but everything will be included in the price with an additional margin of inadequate. Up to a certain threshold, after which your zababahi will begin to carefully sidestep.

I think these examples show that this is an extremely rewarding occupation - to think about what is really important to have in the SLA, and what a whim. And if users insist on some whims, then just count the cost of the service in both cases and ask if they are ready to pay for it. Sometimes, by the way, will be ready. And sometimes it will turn out that not all is whims, which is also useful.

Pay special attention that all the above examples are relevant for both external and internal artist. So, do not be comforted by the hope that an outsource provider will suddenly make it suddenly cheaper. Yes, he can dump for all other reasons, but if it is unprofitable for him to work, he will switch to something more promising. Or get another reincarnation of a fairy tale about a furrier and seven caps.

How, then, to choose the right parameters? It is best to imagine yourself in the place of the performer, estimate typical tasks and sufficient reasonable time to solve such problems. These are the parameters and take for SLA. And then look at the work of the SLA in real life and make adjustments.

I will write explicitly for those who suddenly did not guess: if the tightening of requirements leads to higher prices, then the weakening is obviously the opposite to cheaper. This, too, can and should be used.

Final Wishes

Complete the SLA with relevant links to other documents describing the process: policies and regulations. It does not hurt to indicate which systems for handling queries (appeals, incidents, problems) are used, provide references to the regulations for working with them.

In important documents, to which the SLA undoubtedly applies, one should have a standard section with a version history, an indication of the process owner and a coordination sheet.

What to do if a lot of systems. To write your SLA for everyone - you get a completely confusing zoo, you need to unify. It would be useful to immediately make the metrics from SLA and their values universal, so as not to reinvent the wheel for each IT system in a company, and in general it’s easier to keep track of what is happening around, compare the situation across different systems. In large companies of various IT systems, there are dozens, if not hundreds. The best international experience says that all systems should be divided into classes (Mission Critical, Business Critical, etc.), and write down metrics for classes. In some cases, there may be individual exceptions, but most of the systems can be covered by a universal SLA in this way.

And finally. Since SLA is a regulatory tool, it should be used as a tool. That is regularly reviewed. The frequency depends on the subject area, usually once a year - this is a good initial approximation. The result of the SLA revision will not necessarily be its change, it may turn out that the service completely suits all interested parties. Or maybe you need to tweak / adjust the metrics or make corrections in accordance with the changes that have occurred. And SLA will always be relevant.

Application. Prototype SLA

Below, I have gathered together all of the above as a template so that you can copy the structure and refine it to fit your conditions. It’s harder to start from scratch.

SLA template

Service information
Document owner, approval sheet, version history.
I. Introduction.

The XXX system based on the UUU product version 1.2.3.

List of system components
')
The system consists of modules:
first module
second module
interface for loading orders
test copy of the production system

Service boundaries

Services are provided in the territory at the following addresses:
Moscow, Red Square, 1
der. Gadyukino, Lenin str., 2
remotely to all users of the XXX system.

Services are provided from 09:00 to 18:00 Moscow time on weekdays from Mon to Fri, except weekends and holidays of the Russian Federation.

List of services
Processing requests
Incident resolution
Troubleshooting system code / data errors
Consultation
Directory changes (see annex)
Monitoring free disk space

Performing user actions in the system is not a service, including non-standard data samples (ad-hoc reports).
Ii. Service level

Priorities

Priorities are defined as follows.
Highest (Avral) - all interested persons give up their affairs and begin to solve this problem. Usually work is conducted in emergency mode (round-the-clock).
High - the problem is critical, but not enough to go into emergency mode.
Normal - the problem is serious, but allows manual or other workaround.
Low - should be resolved, but not critical.

Use of priorities, escalation procedure (attracting attention) ...

Key Performance Indicators (KPIs)

Example for the service №2 "Solving incidents":
A priority Reaction time Decision time
Higher 1 hour 24 hours
Tall 1 hour 8 hours work time
Normal 2 hours 5 work days
Low 1 working day 22 workdays

KPI target values

Metric Calculation Algorithm

Target metric value (example):
80% of incidents should be dealt with at the target time.

A priority	Reaction time	Decision time
Higher	1 hour	24 hours
Tall	1 hour	8 hours work time
Normal	2 hours	5 work days
Low	1 working day	22 workdays

UPDATE: what did not fit in the main article:
SLA philosophy: what is escalation and why is it needed?
SLA philosophy: about query priorities

Source: https://habr.com/ru/post/336868/

All Articles