A few words about planning a recovery strategy

Take a moment from reading and answer the question for yourself: how critical is it really for you to have a simple service of 1 minute duration? Have you answered? I think if not all, then most of the readers thought: "We will survive." Now answer, how critical is idle time in 5 minutes? And at 30, an hour, a day? At some of the steps in my head you will hear: "No, well, this is already a bit too much." You have just laid down one of the important parameters necessary for drawing up a plan for ensuring the continuity of the IT service. About that this such, and what sauce approaches to it better, read under a cat.

Everything once fails. As a provider of rental services for dedicated servers, we periodically observe how different users solve problems in ensuring and restoring the efficiency of their services. And we made a sad conclusion: in spite of how much everything is written and said on the topic of data and equipment backup, some resources still do not have any elaborated recovery strategy. When something happens, they just start to agonize, randomly pull employees, and sometimes blame everyone and everything for something.
')

“Business continuity planning (also sometimes called business continuity and resiliency planning) determines the extent to which an organization is exposed to internal and external threats, and specifies the necessary hardware and software tools to effectively counter and restore the normal functioning of an organization while maintaining competitive advantage and system integrity” ( Elliot et al., 1999).

This term was originally introduced for more "serious" cases - violations in the work of offices or data centers caused by fires, natural disasters, criminal actions of third parties and other cases that usually occur much less frequently than, for example, failure of a hard disk. The British Standards Institute has even issued a special standard for managing business continuity - BS 25999. However, we will not go so far, but just try to help you understand for yourself how and how thoroughly you should prepare for possible interruptions.

What are you ready to lose?

Any business involves certain risks. And in order for a business to be successful, risks do not have to be something that lives by itself, they need to be managed. For IT projects and services hosted on the network, there is a certain set of characteristic risks that lead to the temporary unavailability of a project, each of which can be characterized mathematically by such parameters as the probability of occurrence, duration of impact, cost of full or partial smoothing / elimination of actions.

When an emergency situation occurs, there are three main parameters that can be “lost”: data, time and money. Related problems in the form of loss of reputation, loss of profits, etc. as a result, you can reduce to these three.

There is a very subtle relationship between the parameters. For example, the less you are willing to lose time and data, the more money you need to invest in the reservation of capacity and information. Reducing costs while maintaining recovery time may increase data loss. And so on.

Even before looking under the cat, you, I hope, have already determined what the maximum downtime is permissible for your project. In fault tolerance planning terminology, this parameter is called recovery time objective (RTO). This is the time during which the normal functioning of the service or business process must be restored to prevent serious consequences. Naturally, for you are heavy consequences, you also have to determine for yourself.

The second important parameter that you need to evaluate when planning is recovery point objective (RPO). This is another time interval. It characterizes the maximum acceptable time for which IT service data can be lost. This parameter is somewhat more difficult to describe. It is not easy to say that this is an acceptable amount of data loss, although in the zero approximation it is considered exactly like this. Roughly speaking, this is the time limit from the beginning of the creation of the last available backup to the point of the accident.

There are two more parameters - the actual time and the recovery point, but you can find them out either during the simulation or in the case of the accident itself.

In large companies, target indicators are determined by special analysts working on issues of resiliency, who then transfer the task to a group of specialists in technical support of specified indicators. They, in turn, determine where, what and in what quantities you need to store, reserve and keep on a rainy day.

But if your project consists of you and your programmer or sysadmin, this is absolutely no reason to completely abandon such an analysis and say that this is not about you. In our practice, there was more than one case when, due to the complete absence of a well-thought-out strategy for monitoring performance and recovery, people had problems ranging from subsidence in the search engines index to approximately half a day inaccessibility of a certain financial instrument or service, since all data was stored on one server, and no current online replication was performed.

Who is guilty?

First of all, the project manager and his responsible specialists. Providers do everything in their power to ensure maximum uptime, but in almost any contract-offer it will be written (perhaps in the third font) that the provider is not responsible for any interruptions and loss of data for any reason. Even if a drunk engineer accidentally formats the wrong server, you are not likely to expect anything more serious than sincere apologies and regrets. In addition, I recall the thesis: everything ever fails. Even the fact that it is positioned as a super-smooth service (just recall the large-scale downtime of the Amazon cloud).

The safety of your data and the performance of the services should concern you first of all. You should answer the following question:

What to do?

Learn from the mistakes of others as far as possible. The modern information space allows you to analyze the experience of a great many failures and assess the potential weaknesses of your project.

The first thing to do on the way to drawing up a plan for ensuring continuity of work is to get rid of illusions. There was a case in our practice when the user simply ignored the need to make backups. Automatic backup did not work correctly in the control panel - well, it is not necessary. He sincerely believed that RAID1 would save him. What was his surprise when the first disk in the array significantly degraded, and the second was a lot of errors in the file table. An attempt to quickly replace the first disk and rebuild the array didn’t lead to anything good, as you might guess. Our administrators had to return a disk running on complete failure and tormentingly take data bytes out of it for a long time. The argument why the user did not make backups surprised us: “I have never had this in 6 years of work.” Apparently, the sooner a large data loss happens in an administrator’s life, the better for his future projects.

Second, identify potential threats, their likelihood and duration of exposure. How long will it take to switch to the DDoS filtering service? How long will it take to replace a disk or server entirely in your data center? How long will it take to roll out a project in another data center, if a fire, flood or just a provider suddenly ceases to exist in yours? Where to deploy it, how long will it be provided new equipment, etc. If the numbers do not fit into the expected RTO, look in advance for other providers whose infrastructure will help you recover. Also decide how much data you are willing to lose, and select the appropriate backup scheme.

Third - count. As I already wrote, the less time and data loss, the more expensive it will cost you. Evaluate one-time and recurring costs to ensure that you need the values of indicators of continuity. Are you ready to pay the amount received? If not, then you are not as important data as you thought before. Reassess, but considering your recovery budget.

Fourth - implement. Just counting and evaluating is not enough. It is necessary to apply the necessary measures in practice. Order the necessary backup equipment and services, sign the necessary contracts, turn on monitoring. Write down for yourself in a text document, to which service in which cases to apply, what procedure of actions in this or that case. You can even once to simulate a failure. For the presence of clear and consistent instructions, you will thank yourself when something happens. Having a prescribed recovery plan will allow you to significantly save time and a bunch of nerves. The situation from the category of unforeseen will simply go to the category of emergency. You will not wander in the darkness like a blind kitten.

The value of something in our life is determined by how much we are willing to give to keep it. If you really appreciate the results of your work, do not forget to take care of their safety. Who, if not you?

Source: https://habr.com/ru/post/239223/

All Articles

A few words about planning a recovery strategy

What are you ready to lose?

Who is guilty?

What to do?

More articles: