📜 ⬆️ ⬇️

Disaster Recovery Planning. The second part of

Preparing for any falls




This is a continuation of a series of publications on disaster recovery planning. The previous article dealt with defining a planning zone and finding points of failure that could lead to user service failures. The next step is to rely on the information on the points of failure to determine the shortest possible time to eliminate incidents that can be provided by technical specialists with all the necessary resources.

Actually, the necessary resources will continue to be the subject of bargaining with company management, helping to find a balance between investments in information technology, downtime and data loss in the event of a failure. But this is later, but for now we need to determine what recovery time we can basically squeeze out of the IT infrastructure in the event of a failure. Go:
')
1. Preparing to quickly detect faulty elements - compiling localization procedures

The greatest downtime occurs when a technical support specialist persistently tries to repair the mail client on the computer of the requesting user, while it is necessary to repair the mail server itself. Our task at this stage is to ensure that the information on critical failures promptly finds the necessary specialists who are able to carry out the work of restoring the service without disturbing them at all. For this we:


After that, you can estimate the time for localization of the failure, as applied to each of the points of failure, and the largest of these values ​​will be your “localization time”, which will be useful to us in further calculations.

2. Determine the necessary resources and conditions for recovery

In the process of disaster recovery can be divided into four stages:

  1. User service is not working.
  2. User service works with restrictions (low quality or temporary solution).
  3. User service is restored in full, but with the degradation of one or more IT systems and / or lack of necessary reserves.
  4. All IT systems are restored, the necessary reserves are replenished.

When planning disaster recovery, we are primarily interested in the necessary resources and conditions to achieve the 3rd stage, as a necessary and sufficient condition for the full recovery of the end user service. Usually this:


Depending on the points of failure, there may be some specifics: in the case of power supply, either a diesel engine or a backup platform is required to start the systems, in case of a UPS failure, switching to power from the network is required, in case of failure of external DNS hosting, contact details are required under the agreement with the registrar to transfer a domain to a new hosting, etc.

Write down all the necessary resources, tie them to the points of failure and mark which of them you already have, and which you still need to get.

3. Determine the minimum guaranteed recovery time of the user service.

In general, the procedure for restoring a user service is as follows:



The greatest difficulty at this stage is determining the guaranteed recovery time of the point of failure. In the recovery procedure, there is only one route with a predictable period - when, after a small but sufficient study of the causes of the failure, a complete restoration of the point of failure is carried out. Yes, in most cases, it is faster to correct the error than to carry out a full recovery, but any time can be guaranteed only according to the second scenario, and for this reason we can only focus on it.

However, restoring a single point of failure does not always mean restoring a user service, since dependent points of failure can also be faulty (see the dependency diagram in the first article ). Having determined, on the basis of this scheme, the longest possible scenario, you will receive a “minimum recovery time” of a user service that an IT service can guarantee to a business. If this period, even in your opinion, goes beyond all reasonable limits, then this is a reason to think about optimizing it:


Actually, your conclusions regarding the timing of recovery and methods for reducing them should be documented - they will be useful later in a dialogue with management. At this point it would be possible to finish this stage if it were not for a couple of surprises that we have not yet considered:

4. We determine the risk factors of the disaster recovery procedure and plan measures to control them.

How unpleasant at the time of the accident to find out that there is no gasoline in the generator or the battery is dead, that the disaster recovery instructions (not to mention the passwords) were stored on the same fallen server, that the building security service simply does not allow anyone to the server at night time, well, or that such a necessary backup has not been created for several months in a row.

To prevent this from happening, it is necessary to determine in advance the reasons that may prevent you from obtaining the necessary resource at the right time, in the right place and in the right quality. After that, plan the tasks (or the whole measures), allowing to control the risk factors and if not completely eliminate, then at least reduce their impact on disaster recovery. An example of such tasks is:


and, of course, do not forget about the direct testing of procedures for the complete recovery of points of failure.

I recommend choosing the frequency of performing routine tasks at my own discretion, on the basis of the criticality of the risk factor, the likelihood of its occurrence and the complexity of the tasks to control it. I remind you that in order to perform routine tasks and, as a result, control of risk factors, you may need additional resources.

5. Determine situations that go beyond planning.



The strongest negative impact on business is not provided by single (or sequential) failures, for which technicians are prepared to some extent, but by force majeure situations leading to the parallel collapse of several identical systems. Fire, high voltage drops, virus attacks and even illegal actions of third parties can not only cause great damage, but also become fatal for business. In such situations, it is difficult to use the term “operational recovery”, but there are a number of measures that can soften the blow:


In general, the issue of force majeure planning is a separate big topic. As part of disaster recovery planning, this term is used rather to refer to situations that are not subject to recovery time. Such situations usually sound like “simultaneous failure of two or more units of equipment or software of the same class”, since Rarely does anyone have double reserves and a staff of specialists capable of conducting parallel work on two or more identical systems. Nevertheless, the situations are different and, perhaps, in your case, the manual will go to such an additional degree of reliability.

Summarizing all the conclusions, you can determine the set of necessary resources and routine tasks to minimize the recovery time of user services within the existing IT infrastructure, and select a list of situations in which it is not possible to guarantee any time frame. Schematically, your plan will look like this:



It remains only to relate it to the realities and needs of the business, and together with the management find a solution that suits everyone, but this is in the next article.

Part 1: habrahabr.ru/post/225719
Part 3: habrahabr.ru/post/228115

Successes!

Ivan Kormachev
Company "IT Department"
www.depit.ru

Source: https://habr.com/ru/post/226681/


All Articles