Ensuring data and service availability: RPO, RTO performance and SLA planning

Today I will try to clarify what the concept of data accessibility is from the point of view of an IT specialist, be it an IT administrator, a system integrator, an implementation consultant, etc. I hope that this article will be useful to readers in the preparation of the economic rationale for the introduction of appropriate software and / or hardware solutions, as well as service level agreements (SLA) - and someone will help make these documents more convincing .
To begin with, as a “memory knot”, I will formulate two postulates with which many, I am sure, are quite well acquainted:

RPO (recovery point objective) - acceptable data loss. Any information system should provide (by internal means or third-party) protection of their data from loss above an acceptable level.
RTO (recovery time objective) - acceptable data recovery time Any information system should provide (by internal means or third-party means) the ability to restore its work within a reasonable time.

Often this pair of indicators is displayed as a one-dimensional graph along the time axis.
But in such a one-dimensional graph there is not the most important thing that a business - money orients! How to calculate the RTO and RPO, based on the requirements of the business, I will tell under the cut.
')

Let's start “from the stove”, that is, from a straight line along the time axis, where:

The point-event indicates a system crash.
A target RPO is noted to the left of this point (that is, in the past).
To the right (i.e. in the future) the target RTO value is noted.

(This refers to targets, since there should be specific values for specific systems.)

It is clear that all systems in the company work not just like that, but for various needs / goals. The company itself earns (and spends) money. In the event of a system failure, the company obviously loses money. The RTO and RPO scores indicate what is acceptable for these losses.

Therefore, the second dimension is introduced into the schedule - financial (here they are, money - $):

From this schedule it is already clear that the cost of service downtime grows with time: the longer the system does not work, the more money the company loses.

The same with the cost of data loss: the more we (in historical perspective) lose them, the more expensive such a loss will cost the company. And yes, these graphs in nature are not symmetrical.

As a rule, these values do not change linearly, as reflected in the picture. Most often, the moment comes when the cost of loss begins to rise sharply - hence the very sad stories , when companies lost so much from a system failure that some even could not return to business.

To protect against such problems, you need to implement a system that will provide protection against data loss and disaster recovery. Such systems have their own costs, and, therefore, they can also be reflected on the graph (let's draw them in blue):

As you can see from the graph, the lower the RPO and RTO performance in case of data loss, the less service downtime a security solution provides, the more expensive such protection is.

Define break-even point protection solutions.

And here we see the intersection of curves on the graph - I marked these points with green arrows. These are the so-called break-even points for the protection system and for the protected information system. Moving away from this point, we get an expensive protection system, the cost of which exceeds the cost of loss / downtime, or vice versa - a cheap protection system, but not providing an acceptable level of losses.

It seems that the conclusion suggests itself: it is precisely focusing on the break-even point, and we need to select systems that will provide us with the necessary protection.

In fact, if we build such graphics, focusing on data from real life, we get a slightly different picture. In particular, the graph of the cost of the protection solution will not look like a solid line, but as a set of points. Different security solutions do not line up closely one after the other along the schedule, but are separate points, because each has its own “coordinates”: the cost (indicated by the vendor of this solution) and the time for this solution to ensure the corresponding data loss (RPO) and recovery rate (Rto).

In addition, as a rule, a solution is being sought to protect not one specific information system (IS), but groups (or all) of the company's information systems (that is, the entire infrastructure). In addition, each such solution is likely to have its own graphs of the cost of downtime / data loss over time.

It turns out that our break-even points are no longer points, but areas:

If we look at our infrastructure more closely and begin to build graphs for each IS, then we will see an interesting trend - systems are grouped with similar ones. About this below.

We consider different classes of solutions.

Please note, up to this point I was talking about “protection”, but I didn’t specify what kind of protection it was: backup, cluster, any other types of protection? It is worth saying that protection systems are different, and they can be classified.

The diagram below shows which approximate solution class, depending on the target RTO / RPO, is recommended to choose.

Of course, the picture shows everything quite schematically. In fact, there are no clear boundaries between the types of solutions, as well as the exact values in the form of points.

For example, now many backup solutions use technology to start a service from a backup. The time to ensure availability when using this technology is on average ~ 2-5 minutes per VM. And such indicators are within the RTO for replicas or even clusters.

Something about clusters

Clusters, as well as DR-solutions (and in general almost all solutions for protection against data loss or recovery of working capacity) have their own values for data recovery speed and data volumes that are lost. Therefore, they are also associated with their RTO / RPO scores.

Speaking, for example, about the HA-cluster (HA - High Availability), we mean that its RTO is equal to the switching time. Suppose MSCS for two nodes switches the DBMS in 30 seconds. Therefore, the target RTO that can be provided by this kind of cluster is from 30 seconds.

And if we consider VMware HA, which will work in 2 minutes (taking into account the start of the virtual machine, its guest OS and applications)? This solution is therefore suitable for applications with a target RTO value of 2 minutes.

Where are the losses for the HA cluster (and, accordingly, the provision of RPO), you ask? When the service rises, there is a possibility of small data loss. For example, if the DBMS checks the state of the database and can roll back its state to an incorrectly executed transaction. Or if the file system returns to the incorrectly saved version of the file, etc., etc.

Conclusion: it is not always worth building the same solutions one on top of another, for example, HA over HA. This will only unnecessarily complicate the infrastructure, complicate (and increase the cost) support for the operation of such systems.

To previous examples of two HA. Determine what real RTO value needs to be provided for the application? For values greater than 2 minutes, there is no point in costing the HA cluster for services inside the VM.

Let's pay attention to a number of factors:

Different accessibility systems can solve various problems, cover various risks (risk management is a separate topic). And even different clusters can cover various potential problems and also complement each other.

For example, backup mail server does not exclude, but complements the use of the HA cluster for mail servers. The cluster protects against the failure of a physical server and provides fast switching to a backup server. But the cluster does not protect against data loss (unwanted deleted data, inability to start the VM after a hardware failure, etc.). This requires the use of backup.
Clusters themselves can also be designed to protect against various failures and complement each other. For example, Micosoft Exchange DAG-cluster (HA) provides not only protection against the failure of one of the cluster's computing nodes (the server itself), but also when the server's disk fails, due to the fact that data is duplicated on other nodes.

What does sharing VMware vSphere HA mean? Quick recovery of protection. If one server with one MS Exchange node just turned off, it will first run the DAG, switching services to another node, and then HA VMware will load the failed server on the other side of its cluster. And the system is ready to go. (Although in this example I would consider the use of virtualization not only for one function of the cluster only, but also for all the other advantages of the platform itself).
Also on the chart above, I noted the archiving decisions. Please note that for archives it does not make sense to consider an RTO, since such solutions are used to recover old historical data. For such historical data it is necessary to provide RPO. That is, in this case we are talking about the depth and long-term storage of data not used for the current operating activities of the company.

So the cumulative use of various solutions is a blessing. The main thing is to be wise and understand what the solution is for.

We speak and write correctly! Or again about RTO and RPO

I want to emphasize this, since I myself occasionally make a mistake, and therefore I am wary of you:

RTO speed recovery!
RPO ≠ amount of lost data!

RTO and RPO are the target values for information systems (IS), the maximum limits in which we must meet. And these target values to us, IT specialists, are informed by the business, more precisely, by the business owners of the corresponding IP, but not vice versa.

I.e:
This is not to say that the RTO features Instant Recovery - 2 minutes.
You can not assume that backup once a day is RPO 24 hours.
Everything goes in the opposite direction, that is, from the business, and specifically for the RTO it will sound like this:

For a certain service, in the event of a system servicing this service, it is necessary to ensure recovery without preventing idle time in the operation of this service for more than 5 minutes (RTO - 5 minutes). This means that a solution is suitable that will make the system accessible in less than 5 minutes.

Or for RPO:

For a database, in the event of a DBMS failure, it is necessary to ensure recovery with an acceptable data loss of no more than 24 hours from the moment of failure. This means that a solution will be suitable that will ensure a guaranteed restoration of the base from recovery points made more often than once a day. At the same time, I note that backing up once per hour, creating 24 recovery points, provides more guarantees of recovery than copying once a day, making only 1 point.

Here is a practical example.

Suppose a business voices the following: “This is a very important, critical service, and if it stays idle for more than 5 minutes, <such-and-such a sum” -financial losses will occur, and after 30 minutes of downtime - 10 times more! And this is no longer acceptable for the company. ”

It would seem that you can argue like this:

"Using Instant Recovery will provide a technical recovery process in 2 minutes ..."

But! At the same time it is necessary to understand a few points:

We, first of all, need to track the moment of failure.
Determine the consequences of failure: everything is broken or something is available.
It is advisable to find the causes of the failure, or at least localize (isolate) the problem: if a fire is put out, before trying to restore something to the same infrastructure; if the virus corrupts the data, disconnect the infected server from the network and not feed the virus to the restored server.
Next, determine the methods of resuscitation (the most suitable recovery procedure: reboot the VM, wait for the cluster to switch to another node, or restore from a backup).
Decide on the recovery and actually run the recovery.

All this affects the total recovery time.

Therefore, we reason further:

“I’ve set up monitoring for this service, and the alert will work and will be noticed within a minute or two (the phone with the received SMS should be taken out of my pocket, or the mail client with a new letter should be opened). I will sit at the computer, pingana service, try to open a console on it, see what's up with the hypervisor and hardware. I will conduct primary resuscitation (I will try to overload the car). I will spend about 15 minutes on all this. If actions on fast resuscitation do not help, I will recover from the backup. But since the data will be copied from the backup for another 15-20 minutes, I will use Instant VM Recovery for 2 minutes and then start the online data transfer of the machine to production. ”

As you can see, in 5 minutes we hardly fit.

Now let's think, maybe you need a HA cluster with a recovery time of 2 minutes? But it will not provide us with protection against all types of failures: a restart of the machine in BSOD is quite likely, a VM disk is also a point of failure, and so on. Therefore need additional protection. So, we continue our reasoning:

“In the case of a cluster, I will recover in 2 minutes. And in addition, I will spend, as already estimated (a), 15 + 2 minutes when restoring from a backup, only 2 + 15 + 2 = 19 minutes, and 11 minutes still remain in stock. "

In summary, your answer to the IP business owner will be:

"OK. I will provide an RTO in 30 minutes. I will include this service in a cluster — to ensure a 5-minute RTO, and set up a backup — to protect against failures with more serious consequences. ”

Very important! The most important advice: after you have agreed on specific targets with the owner of the IP, you have agreed - be sure to record your agreements with him in writing, sign a service level agreement (SLA) with him.

Why do we call this the “accessibility concept”?

Have you noticed that I am constantly writing “data recovery and recovery of service”? Most often I write together in one sentence. These are two interconnected things that almost always cannot live without each other.

We restored the service, but lost all its data - this is unacceptable. We restored the database, but the DBMS does not start, cannot read the data - this is unacceptable. That is why we are talking about the availability of both data and services. Both the RPO and the RTO are important - together, they provide access to both.

15 minutes after the failure, access to the service with data for the entire previous period of operation (up to 1 hour inclusive) was restored - this is all about general accessibility.

Here is such a dualism ;) Together, the dual pair of RTO and RPO is an important indicator in that very Service Level Agreement ( SLA ) for a specific IS in terms of ensuring the availability of its services and data in the event of a failure. And the corresponding agreement is signed, as I said above, between the owner of the IP (customer of the service) and you, the IT department (service provider).