My job is to wait for IT disasters

The best thing that can happen is if the results of what I do are never useful to anyone.

I can say that I am a professional paranoid: my task is to develop contingency plans and train people to respond competently in such cases. Why do you need it? Quite simply - in case of unforeseen situations there is always insurance.
')

For example, do you know what will happen if an earthquake destroys the main Moscow data center?

Automation will work and transfer some services to other data centers . All that was active-active, will continue to work (this is the basic functions of the network, such as calls and SMS).
Then the basic reaction scenario is included . Immediately after the incident, recovery teams are formed from specially trained people at the facility who are trained in all aspects of the operation of this facility. For example, from a shift engineer, security guard, system administrator, and so on. They give up all their current business and are engaged only in recovery.
During the first 10 minutes, the “bronze” recovery team analyzes the situation . On the 11th minute, the team leader reports to a higher level team (“silver”, as a rule, not present at the facility), for example, the chief engineer and the head of the department.
The “silver” team makes a decision on its level . In our case, the problem is clearly particularly important, so the team is associated with the "golden" team - the leaders of the highest level. It takes another 10 minutes to decide that the situation is an emergency (this is very fast). For another 5 minutes, our disaster recovery plans are activated.
The leaders of the "bronze" teams gather people and go to restore what they can on the spot. At the same time, a crisis committee is gathering, which includes the well-known experts described in the plan for this case.
Further, the crisis committee interacts with HR, PR, security guards and other services . In particular, exactly PR at this moment will definitely need information - subscribers have already half an hour without mobile Internet, you need to come up with data on recovery periods.
The reserve point is deployed. The infrastructure layer is restored within 20-30 minutes. Then comes the restoration of the DBMS and, where necessary, the restoration from the archive from the tape. Next - restore applications (from half an hour to a day).
In parallel, during the first hour, it is checked how everything moved.
Then detailed reports appear. The disaster recovery plan ends and we fall asleep again to the next situation.

Risks, Plans, Reservations

There are basic risks (for example, stopping a data center or a fire at the main office), there are known risks from management, there is our analysis of the criticality of technical systems. This list of risks is constantly updated, and each of the threats must be closed.

For a company, such an approach means protection against heaps of threats, plus banal savings on possible downtime. At a rough estimate, the cost of our work amounted to only 5-10 percent of the amount of possible losses in case of emergency situations, which were prevented.

It is important to understand that we do not make separate plans for each case. Our plans are aimed at eliminating the worst situation. The documents contain references to all sides of the participants, just depending on the situations some branches of the plans may not be used. This means that we do not make a special plan in case of arrival of aggressive aliens, but recovery teams will be able to work on the “data center damage” branches and so on.

Since we are a technical division, our plans are aimed at eliminating problems in the technical field. Because technical services we are involved in all plans. Call centers are involved everywhere, as they alert subscribers. PR employees are also members of the crisis committee, they know about all our emergencies and restorations and, depending on the decisions of the anti-crisis committee, they know what to say.

Apart from the fact that we are building all sorts of plans for restoring and backing up systems, we are also doing preliminary work, in particular, we are initiating projects for backing up systems within the company.

Reservation Approaches

Load balancing is geographically separated active-active services, if one point is disabled, the second continues to work. In the example above, balancing simply transferred the basic functionality of the network to another data center. These are the so-called continuous solutions.
Full, Enhanced and Basic DR are high availability applications. The Basic class is a recovery from backup (it takes a few minutes), Enhanced is no longer a backup, but synchronous replication, and Full is a replication to a complete system that can be switched to in a matter of seconds.
Best Endeavors - a class that assumes the presence of selected and pre-installed equipment.

The table below illustrates the process of assigning a recovery class and backup solutions to IT applications, according to the RTO (acceptable recovery time) and RPO (valid return point).

The distribution of recovery classes and technical solutions for redundancy is not exactly the same as the table above. This table is for explanation purposes only.

Plan life

There is no such thing that we write a document, and he lay "in the table." Plans are constantly changing as the situation changes. People come and go. Constantly changing the hardware and software of the system or service. Technology is changing. The criticality of the service changes, respectively, the requirements for recovery time change, and the technical solution has to be changed. That is, it is a living organism.

Example of an accident

Employees have long been calmly perceived that we constantly conduct both “paper” testing, such as “what will you do if”, and imitations, when emergency situations are played up on the spot.

Of the last major examples of real-life situations: in 2010, in the summer, during abnormal heat, air conditioners began to refuse mainly to the data center and stop one after the other. The temperature inside the data center has become dangerous to grow.

Prior to the activation of the plan, it did not go: our reservation training worked. Due to the transfer of a large part of the systems to the backup data center, the heat release was reduced, and we did without an accident. And many other cases, such as tape backup in the backup data center, the availability of backup equipment during the billing cycle, the presence of replicated data on the array in the backup center, helped restore the service in a short period of time.

It is significant that now no one is afraid of unplanned outages and heat in the summer season. Yes, there will be some “sagging” of the service, but it will be “raised” very quickly.

How it all began

In the early stages, we were treated only as paranoid.

In 2003-2004, the topic of disaster recovery in IT was completely unknown in Russia (and nobody even heard about disaster recovery of a business as such). The development came from the security officers: initiative people in the information security department began to promote this idea. They showed a bunch of presentations, they took an assistant consultant who helped draw and persuade. The key point was the fire of one of the integrators, who then worked closely with VimpelCom. After their data center burned down, the management seriously thought. They called in from England the specialists who developed the first policies and strategies, plus they gave the motion vectors.

One of the first steps was the introduction of a total backup . The idea is simple: all information is backed up. Even paper documents are scanned and also folded onto tape. There is a cross-site backup scheme, when everything that is backed up is also duplicated in the backup data center and stored there by the same policies.

We regularly conduct training and test the work of the recovery service teams — these are like training alerts. We provide courses, certify, give practice. There are simply no ready-made specialists on the market.

So, if at first we were paranoid, and we were not really understood, now everything has changed. In many documents, our requirements become blocking. Heads of departments are serious about possible risks - and at the same time they understand that partly due to our work, in the case of any complex IT problems, there is an opportunity to recover quickly. I think it saves a lot of nerves.

Source: https://habr.com/ru/post/183164/

All Articles