Determine the place where it is worth to lay the straw

Information system failures are events that cannot be completely excluded. Regardless of the reasons for the failure that occurred, at the time of its occurrence, the burden of responsibility on operational restoration of not only IT systems, but also the business as a whole, falls on the system administrator.
In a cycle of three short articles, I will try to describe the process of forming a disaster recovery plan, which allows you to transfer tasks to restore systems to the level of activities that have their own schedule, resources and budget.
')
The first article will deal with the definition of a planning area, or the search for those infrastructure elements, the failure of which negatively affects the pulse rate of the system administrator. So, in order:
1. Make a list of critical user IT services
The goal of disaster recovery planning is to ensure the operative recovery of the work of the final service that the user receives, and not of any particular piece of hardware or program. It does not matter to the user whether his printer is working or broken - it is important for him whether he can print documents or not. The user will complain not about the fact that the hard disk has failed in the server, but about the fact that “1C-ka” or “mail” does not work for him.
For this reason, the first thing we do is determine the list of critical user IT services for which we will plan disaster recovery. Usually this:
- Email,
- Telephone communications,
- Enterprise Management System
- Collaboration with documents
- Printing documents
- Internet access,
- And so on.
In essence, user services are those work tools that a business buys by investing in hardware, software, and salaries of specialists and which are critical for its operation. For example, the Counter Strike server, of course, is an important element in improving the working mood of employees, but not critical to the business.
2. Determine the points of failure of user services
If the user complains about problems in some final service, then it will still be necessary to repair a specific element in the IT infrastructure. Therefore, at this stage, it is necessary to detect all systems, applications and IT services, the failure of which will inevitably lead to a halt or decrease in the quality of the work of critical user services. Simply put, your task is to find all points of failure.
By the point of failure, we mean that infrastructure unit, about which we cannot say more than “it does not work”. For example, if your router is modular, then both the chassis itself and the modules inserted into it may refuse. If your competence is enough to localize and replace failed blocks in case of failure, you have several points of failure in one device, if not - then the point of failure is one.
So, the Email service may have the following points of failure (including but not limited to):
- Server OS,
- Server mail application
- Kernel switch
- Power supply,
- External DNS zone,
- Blacklisting,
- Air conditioning server.
Important! It is not necessary to exclude super-reliable equipment from the points of failure, with which “nothing will happen for sure”. When (exactly when, not if) your ultra-reliable storage system loses all data, whether you continue to laugh at the circus or not, will depend only on your readiness for this situation.
3. Determine dependencies of points of failure.
Failure of some points of failure can cause failures in the work of others. For example, the failure of the UPS will lead to a server shutdown and, as a result, when you restore the power supply you may not earn something else. Also, stopping the hypervisor can cause errors in the virtual servers running on it. At the same time, the failure of the client switch does not affect the operation of other equipment or services, and if it is replaced correctly, everything will work as before.
For the custom Email service, the points of failure dependencies might look like this:

Scheme 1. Dependencies of points of failure.
In this scheme, you must add and other critical user services and the corresponding point of failure.
A clear understanding of the influence of points of failure on each other and on user services will help you with further planning, namely, in drawing up procedures for localizing points of failure, determining recovery conditions and risk factors. But more about this in the next article.
Part 2:
habrahabr.ru/post/226681Part 3:
habrahabr.ru/post/228115Successes!
Ivan Kormachev
Company "IT Department"
www.depit.ru