Disaster Recovery Planning. Part Three - Final

We correlate business needs with its capabilities.

In previous articles ( 1 , 2 ) on disaster recovery planning, procedures were described for collecting and processing information about an organization's IT infrastructure, which would provide accurate information on:

IT services critical to the company's business,
The current time to restore their work in case of failure,
Minimum achievable disaster recovery time,
The necessary resources to achieve them.

And all would be nothing if it were not for the limited financial capacity of the organization, which does not allow to acquire all the necessary reserves for rapid recovery. For this reason, the final task of disaster recovery planning is finding a balance between the needs and financial capabilities of the business, and fixing it in the form of a Service Level Agreement (SLA) to eliminate the incidents that occur.
')
This stage consists entirely of coordination with the company's management of the following aspects of interaction:

1. Business support time by internal IT service

The willingness of technicians to begin disaster recovery immediately after receiving information about a failure is a major factor in determining the time of support. Eight-hour working day, vacation, illness, time off naturally limit this opportunity. If you do not have specialists with the competencies necessary for carrying out restoration work or there is not enough overlap by engineers both in time and in the absence of one of them, then the business should not count on support in the 24/7 schedule. If the current overlap by specialists does not guarantee the prompt response even in the 9 * 5 schedule, then the following options are possible:

To measure the time of recovery not from the moment the incident occurred, but from the beginning of the accident specialist’s work,
To make preliminary preparations for the possibility of restoring the user service by less competent specialists,
To teach the reserve specialist the necessary skills
Transfer the point of failure or a fully user service to an external contractor that meets the required SLA parameters.

However, even with external contractors it’s not so simple:

2. SLA with external contractors

Behind the external welfare of cooperation with an external contractor may be his inability to eliminate incidents within the time frame required by the business. Convenience and efficiency of work can turn into a headache at the very first problems due to the lack of understanding of the level of service you require from an external supplier.

If the existing external supplier service level agreement is unsatisfactory for your business (or is simply missing), then the following options are possible:

Agree to change the terms with an existing contractor. To secure the right to several random checks of the SLA,
Change the contractor to one whose standard SLA meets your requirements. And again, check its execution,
Connect the backup operator services to quickly switch to it in case of problems with the main,
To put up and leave everything unchanged if the contractor is a monopolist. To bring this situation to the management of the company and secure it with them,
Organize this service on your own.

After you have decided on the people and / or companies that will be engaged in the rehabilitation work, you can designate the support time for user services, which can be incorporated into the framework of the service level agreement between the IT department and the business. It remains only to agree on the deadlines for their recovery, and for this it is necessary to discuss:

3. Getting the reserves needed for disaster recovery

The availability of the necessary equipment reserves directly affects the ability to quickly restore the service. If you have one physical server in your company, then if you refuse, you will have no chance to restore work (for more information on determining the necessary reserves, see the previous article ). If at the moment your company does not have all the equipment reserves necessary for the restoration work, then the following options are possible:

Purchase equipment in advance, if the cost of idle time deliberately exceeds their price. For example, a backup switch costs significantly less downtime for the duration of its purchase,
Sign a service contract for the replacement of failed equipment, if the “next business day replacement” condition is acceptable to the business,
Coordinate the operational allocation of funds for the acquisition of the desired item in case of failure, if the cost of downtime is comparable to the reserve item,
To agree on a reduction in the quality of system operation in the event of a failure and / or disconnection of secondary services for the launch of business-critical systems,
Coordinate the operational allocation of funds for the purchase of less powerful equipment for the temporary launch of the failed service with the worst quality parameters.

In principle, at this stage you can already designate the time within which it is possible to restore those or other user services in case of any failures. If the terms do not suit the management even if all the necessary reserves are available, then this is a reason to discuss:

4. Pre-harvesting to speed disaster recovery

This can be either an additional monitoring system, a backup, or an additional server or network equipment configured and operating in hot-swap mode. You may need them in order to localize and restore the work of the user service a little faster.

After you have approved with the management all the necessary investments in people, service contracts, equipment and software, in addition to the support time, you can also agree on the deadlines for the restoration of user services. But to ensure that these deadlines are met, another little touch is needed:

5. The volume of the performed scheduled tasks

To guarantee recovery in case of failures, you must be sure that in the event of an emergency you will have all the necessary resources for recovery. To do this, you must constantly monitor their presence and correctness. Possessing information on previously agreed reserves and resources, you can make an accurate list of the necessary regulatory measures, the regular implementation of which may require the involvement of additional technical specialists. This is a necessary payment for reliability, but, unfortunately, sometimes it is even useless:

6. Situations beyond the scope of the SLA.

There are situations in which it is difficult to predict the timing of recovery and that are beyond the scope of planning. These are not only force majeure situations, but also events with the simultaneous failure of two or more elements of the same type, the occurrence of which is admitted by probability theory.

Often it does not make economic sense to prepare IT infrastructure and IT professionals for the prompt elimination of any accidents. In some cases it is much cheaper and more efficient to prepare the business itself for action in case of their occurrence. For example, to prepare blank invoices for manual registration of goods, in case of a complete failure of computer systems, or to organize strict accounting of primary documentation to restore business operations since the last force majeure backup of the database was not difficult. Possible technical measures to reduce the negative impact of such situations on the business were described earlier .

At this stage of coordination can be considered complete - only minor formalities remain:

We fix the agreed parameters and act

The results of your negotiations with the leadership should be fixed on paper, reflecting in it:

Business time support for custom services
Guaranteed time to restore their work in case of failures,
Money (including the timing of their allocation) and the activities necessary to achieve the goals,
Situations beyond the scope of planning and a list of measures to reduce damage in the event of their occurrence.

Arranged in the document, the agreement will allow you to move from a situation when "IT infrastructure pretends that it works, and business pretends that it invests in it," to a situation where a business understands what level of service it can expect depending on IT investments.

At this point, disaster recovery planning can be considered successfully completed. However, sometimes, after evaluating all the necessary changes and their cost, it becomes clear that it is cheaper to fundamentally change the existing IT infrastructure. But that's another story.

Successes!

Ivan Kormachev
Company "IT Department"
www.depit.ru

Source: https://habr.com/ru/post/228115/

All Articles