Preparing for any falls

This is a continuation of a series of publications on disaster recovery planning. The
previous article dealt with defining a planning zone and finding points of failure that could lead to user service failures. The next step is to rely on the information on the points of failure to determine the shortest possible time to eliminate incidents that can be provided by technical specialists with all the necessary resources.
Actually, the necessary resources will continue to be the subject of bargaining with company management, helping to find a balance between investments in information technology, downtime and data loss in the event of a failure. But this is later, but for now we need to determine what recovery time we can basically squeeze out of the IT infrastructure in the event of a failure. Go:
')
1. Preparing to quickly detect faulty elements - compiling localization procedures
The greatest downtime occurs when a technical support specialist persistently tries to repair the mail client on the computer of the requesting user, while it is necessary to repair the mail server itself. Our task at this stage is to ensure that the information on critical failures promptly finds the necessary specialists who are able to carry out the work of restoring the service without disturbing them at all. For this we:
- We create procedures for testing the operation of user services and points of failure. Within the framework of the dependency scheme ( Article 1 ), a technical support specialist should be able to diagnose the work of both the user service and the points of failure on which its work depends.
- We configure monitoring of points of failure. In some situations, it is before users can report problems. In others, it will allow to exclude part of the points of failure from the list of suspects.
- We define the rules of escalation. In case of detection of problems affecting the business, immediately inform the duty system administrator. Influencing the subdivision - conduct localization (no more than 5 minutes) and involve the appropriate specialists for restoration or inform the duty system administrator if it was not possible to localize the cause of the failure, etc.
- We provide training to technical support specialists so that they understand the role of certain infrastructure elements in the work of user services, have general skills in diagnosing points of failure, and also understand their goals and objectives and are not afraid to disturb their older colleagues once again.
After that, you can estimate the time for localization of the failure, as applied to each of the points of failure, and the largest of these values ​​will be your “localization time”, which will be useful to us in further calculations.
2. Determine the necessary resources and conditions for recovery
In the process of disaster recovery can be divided into four stages:
- User service is not working.
- User service works with restrictions (low quality or temporary solution).
- User service is restored in full, but with the degradation of one or more IT systems and / or lack of necessary reserves.
- All IT systems are restored, the necessary reserves are replenished.
When planning disaster recovery, we are primarily interested in the necessary resources and conditions to achieve the 3rd stage, as a necessary and sufficient condition for the full recovery of the end user service. Usually this:
- redundant units of equipment with similar functionality and capacity.
- backup copies of data / configurations and access to them at the time of the accident.
- software distributions.
- access to equipment and applications (both physical and password information).
- a specialist with relevant qualifications.
Depending on the points of failure, there may be some specifics: in the case of power supply, either a diesel engine or a backup platform is required to start the systems, in case of a UPS failure, switching to power from the network is required, in case of failure of external DNS hosting, contact details are required under the agreement with the registrar to transfer a domain to a new hosting, etc.
Write down all the necessary resources, tie them to the points of failure and mark which of them you already have, and which you still need to get.
3. Determine the minimum guaranteed recovery time of the user service.
In general, the procedure for restoring a user service is as follows:

The greatest difficulty at this stage is determining the guaranteed recovery time of the point of failure. In the recovery procedure, there is only one route with a predictable period - when, after a small but sufficient study of the causes of the failure, a complete restoration of the point of failure is carried out. Yes, in most cases, it is faster to correct the error than to carry out a full recovery, but any time can be guaranteed only according to the second scenario, and for this reason we can only focus on it.
However, restoring a single point of failure does not always mean restoring a user service, since dependent points of failure can also be faulty (see the dependency diagram in the
first article ). Having determined, on the basis of this scheme, the longest possible scenario, you will receive a “minimum recovery time” of a user service that an IT service can guarantee to a business. If this period, even in your opinion, goes beyond all reasonable limits, then this is a reason to think about optimizing it:
- Make presets to speed recovery.
- Reduce time spent investigating incidents (increasing the likelihood of data loss).
- Change the architecture of points of failure to increase the speed of recovery.
Actually, your conclusions regarding the timing of recovery and methods for reducing them should be documented - they will be useful later in a dialogue with management. At this point it would be possible to finish this stage if it were not for a couple of surprises that we have not yet considered:
4. We determine the risk factors of the disaster recovery procedure and plan measures to control them.
How unpleasant at the time of the accident to find out that there is no gasoline in the generator or the battery is dead, that the disaster recovery instructions (not to mention the passwords) were stored on the same fallen server, that the building security service simply does not allow anyone to the server at night time, well, or that such a necessary backup has not been created for several months in a row.
To prevent this from happening, it is necessary to determine in advance the reasons that may prevent you from obtaining the necessary resource at the right time, in the right place and in the right quality. After that, plan the tasks (or the whole measures), allowing to control the risk factors and if not completely eliminate, then at least reduce their impact on disaster recovery. An example of such tasks is:
- validation of backup copies,
- quality check of backup communication channels,
- control of the availability of necessary equipment reserves
- monitoring the status of uninterruptible power supplies and generators
- analysis of the compliance with the plans of full restoration to the current state of affairs,
- etc.,
and, of course, do not forget about the direct testing of procedures for the complete recovery of points of failure.
I recommend choosing the frequency of performing routine tasks at my own discretion, on the basis of the criticality of the risk factor, the likelihood of its occurrence and the complexity of the tasks to control it. I remind you that in order to perform routine tasks and, as a result, control of risk factors, you may need additional resources.
5. Determine situations that go beyond planning.

The strongest negative impact on business is not provided by single (or sequential) failures, for which technicians are prepared to some extent, but by force majeure situations leading to the parallel collapse of several identical systems. Fire, high voltage drops, virus attacks and even illegal actions of third parties can not only cause great damage, but also become fatal for business. In such situations, it is difficult to use the term “operational recovery”, but there are a number of measures that can soften the blow:
- To work out the issue of data backup in case of force majeure. The storage location for backup media should not only be the company's office, but also, for example, a safe deposit box. If the company has several locations - you can provide cross-backup.
- To prioritize the recovery of user services. There is always something unique, without which the business will not survive - everything else will wait.
- To protect the reserves from the influence of force majeure factors. If the reserves are completed in full, then at least one service you will run on them.
- Prepare (well, or at least outline) a backup site for deployment. Though in the apartment of the general director - in war all means are good.
In general, the issue of force majeure planning is a separate big topic. As part of disaster recovery planning, this term is used rather to refer to situations that are not subject to recovery time. Such situations usually sound like “simultaneous failure of two or more units of equipment or software of the same class”, since Rarely does anyone have double reserves and a staff of specialists capable of conducting parallel work on two or more identical systems. Nevertheless, the situations are different and, perhaps, in your case, the manual will go to such an additional degree of reliability.
Summarizing all the conclusions, you can determine the set of necessary resources and routine tasks to minimize the recovery time of user services within the existing IT infrastructure, and select a list of situations in which it is not possible to guarantee any time frame. Schematically, your plan will look like this:

It remains only to relate it to the realities and needs of the business, and together with the management find a solution that suits everyone, but this is in the next article.
Part 1:
habrahabr.ru/post/225719Part 3:
habrahabr.ru/post/228115Successes!
Ivan Kormachev
Company "IT Department"
www.depit.ru