While the thunder did not strike, or Continuity and GOST R 53647.4-2011 / ISO / PAS 22399: 2007

A few years ago, in the Moscow office of one of the companies, a depressurization of the gas fire extinguishing system occurred. The threat to the life and health of people was more than real. As a result of an emergency, 1 person died, 13 with varying degrees of poisoning were taken to hospital, 60 were evacuated. Such threats are highly relevant - after all, dozens of administrative and office buildings in Moscow, St. Petersburg and other cities are equipped with exactly the same fire extinguishing systems.

Another situation is possible: the bank interrupts work due to the threat of an explosion. An alarm can come from an attacker or a bully, or from a bank employee if they find a suspicious object or package in the room. Whether it is dangerous, it will be found out later, however, actions according to instructions in such a situation are strictly obligatory.

Another extreme option: a smoke bomb was thrown at the organization’s office. It can do without victims, but the panic is provided. Smoke streams making their way into the corridor, and the noise in the next room is unlikely to leave anyone indifferent. And certainly not contribute to the working atmosphere. Someone may become ill simply from excitement, and where to get the medicine is unknown.

To avoid casualties and other serious consequences, certain procedures must be followed. For example, notify everyone who is in the organization, evacuate employees and visitors, inform emergency services and security agencies (and in some cases in the media), contact with relatives of employees, bring information about the incident to management. At the same time, all employees should have the skills to act in a crisis situation - not just managers or specially designated people. Similar information should contain the current regulations in Russia in the field of business continuity.
')
Algorithms of actions of employees in the event of any incidents - this is part of a much wider area - ensuring business continuity. Below we analyze the existing ISO / PAS 22399: 2007 standard in our country (Guideline for incidental preparedness and operational continuity management): can the guidelines presented there really help in preparing for possible emergencies and improve the response processes within companies?

To our chagrin, there is little information in the standard about incident preparedness - it is more indicative of business continuity. Despite the ambitious title, many questions remain unanswered. We will try to answer them yourself, guided by our experience.

How to determine the scale of the incident?

Here a pre-compiled list of questions is useful:

Are there any injuries? Does the incident threaten people's life and health?
How quickly is the situation changing? Can it change for the worse?
Is there a violation or threat of violation of business processes?
Breaking business processes can be lengthy?
Whether / damage can be caused:
- the image and reputation of the company;
- partners, clients or counterparties;
- the material condition of the organization?

It is clear that when the collapse has already come and there is no certainty, no one will sit and answer these questions. Therefore, it is worthwhile to draw up a table of damage by types and ranges of losses in advance (see Table 1).

Tab. 1. Example of damage table

Loss range	Financial losses	Control loss	Damage due to violation of laws / regulations	Reputation damage	Personnel losses
Catastrophic losses	over ...	Violation of production processes, product recall, letters with explanations, etc.	Unscheduled inspections of controlling and / or inspection bodies, revocation of a license, violation of legal requirements, etc.	Negative comments, reviews, articles, customer churn, increase in the number of complaints, partners' doubts, etc.	Victims of the incident, one way or another affected by the incident, processing, firing due to the incident, etc.
Big losses	from to …
Sensitive loss	from to …
Low loss	from to …

A table with ranges of measurable parameters will help you make an informed decision about the scale of the event.

Who initiates the action?

PE can happen anywhere, respectively, and a signal of its occurrence can give any employee. To manage incidents, 2 directions of information flow need to be developed: from the bottom up - the escalation tree from the initiator to the decision maker, and from the top to the bottom - the tree that alerts employees about the decision made by management. There are several types of escalation:

if there is a support service, the appeal goes to its operator;
if not, the appeal is sent to the immediate superior of the one who discovered the incident;
if the company practices the principles of incident management, then you need to act according to the established scheme. In this case, it is worthwhile to work out the schemes of escalation and alerts.

How do incident boundaries change over time?

The faster the incident is detected and localized, the less affected. Over time, the boundaries of the incident expand. For example, with fast recovery, a server crash may not even be noticed. But a lengthy downtime can disrupt internal processes (for example, a report or payment order will not be prepared). In some cases, it may affect the company as a whole (failure to report to the regulatory authorities or payment for goods / services may result in significant financial damage or damage to the company's reputation).

It is necessary to clearly define possible boundaries in advance - during the incident only the scale is determined, i.e. choose the option of boundaries that most faithfully describes what happened. To facilitate the selection, as already mentioned, you can use the damage table.
How to limit the level of escalation (do not call the same general whenever a failure occurs in IT)?

If the company has formalized instructions, which describe the order of escalation, they should be followed closely. You can argue with the illogicality of what is written in a calm atmosphere, but not when the reaction speed is critical.

If there are no such instructions, but there is a support service or security service, you need to inform them of what happened. They understand their areas of responsibility and represent the sequence of actions in these areas.

Finally, if there is nothing of this, and you need advice, report the incident to your immediate supervisor or his deputy. If they are out of reach, go higher up the hierarchy.

Who participates in the crisis committee?

The crisis committee must have the authority and competence to make decisions on incidents promptly. It must include representatives from all areas of the company:

financial director (allocates funds for emergency purchase of equipment, rental of additional premises, settlement of relations with partners, customers, suppliers, etc.);
HR Director (resolves issues with injured employees, recruiting additional employees, with mass layoffs, has contacts of relatives of employees);
Operations Director (industrial accidents, customer complaints, product recall);
administrative and business director (problems related to transport, logistics, supply);
Director of Information and Telecommunications Technologies;
director of public relations (contacts with the media, coverage of the incident in the press, social networks and the Internet);
director of security (physical and informational);
Director of Government Relations (his participation may be decisive in cases where crisis situations are caused by ill-considered decisions by government agencies).

Who coordinates all actions in the event of an incident?

The person responsible for the management in the circumstances of the incident should have the authority to make decisions that are binding on all other employees of the company. This does not necessarily have to be the same one who carries out management in the normal mode - for managing in a crisis situation, stress resistance and the ability to quickly make decisions are required.

It is important to develop in advance typical schemes for the interaction of employees in various incidents, descriptions of authorities and the structure of subordination.

What are the incident assessment options (incident rating scale)?

You can use several scales to evaluate the incident - qualitative and quantitative.

Tab. 2. Quantitative assessment: the frequency and scale of the impact of incidents.

	Almost never	Seldom	Often	Regularly
Catastrophic losses	High risk	Critical risk level	Unacceptable level of risk	Unacceptable level of risk
Big losses	Low risk	High risk	Critical risk level	Unacceptable level of risk
Sensitive loss	Negligible Risk	Low risk	High risk	Critical risk level
Low loss	Negligible Risk	Negligible Risk	Low risk	High risk

Tab. 3. Qualitative assessment of the incident (an extensive description of these terms is given in the annex to the article)

Term	Description
Failure	A situation in which resources, such as IT infrastructure, do not work as expected. The impact of such a situation is considered minimal.
Critical situation (serious incident)	It occurs when, as part of incident management, it is not possible to solve a serious incident of the first priority in the allotted time.
Crash	Such a destructive event, in which the processes in the company are not executed, as expected. The availability of these processes and related equipment cannot be restored in a given period of time.
A crisis	The situation is different from the normal state. Despite the preventive measures taken, such a condition can occur at any time and cannot be overcome by ordinary procedural or organizational measures.
Catastrophe	An event that a company cannot limit in time and space and that has a large-scale impact on people, wealth and the environment. The very existence of the company, the life and health of employees are at risk.

What technical tools support incident management?

As part of incident management, there are several separate tasks:

storage of required data: contact information, a list of actions to be performed, addresses of reserve sites and offices;
Alert a large group of people about the incident, meeting place, executive orders, etc .;
logging incident response actions;
Operational analysis of the restoration of the company's normal operation (i.e., how much the duration of the actions performed differs from the planned one)
analysis of completed actions, incident reports (the timing of its occurrence, the time of elimination, the number of participants, etc.);
creation of a platform for sharing information on the recovery process and solving / discussing problems arising in this process.

There are products on the IT market that solve most of these tasks.

How to develop the necessary response measures?

It is impossible to foresee all incidents, but it is possible to work out measures in the main directions: they can be combined and modified for a specific situation. What are the main activities of the company:

purchase of materials / services;
delivery;
production and assembly;
providing products and services to customers;
marketing;
technical support;
manufacturing processes;
personnel, training;
IT and IB.

How to maintain relevance in normal mode?

Nothing better than regular workouts / testing has been invented yet.

How to make changes? How often? What little things you should pay attention to what should be considered in the plan?

To make changes to the company, there must be a special formalized change management process. Possible options for change: change in organizational structure, the emergence of new posts, a change in technical solutions, changes in risks, the emergence of new products / services.

How to conduct testing?

There are several arguments that can help interest the top management of the company to personally participate in testing.

Managers are used to solving problems. It is unlikely that they will train to perform a detailed formalized plan. They are likely to be attracted by the solution of many problems arising from the elimination of a hypothetical incident. And instead of a plan, they may have enough leaflet with four steps / questions:
- data collection - what happened?
- analysis of the collected information - what of it?
- formulating a plan of action - what now?
- notification of subordinates about the decision.
The test script must match the level of the task manager. Events affecting VIP clients, appearing on media pages, affecting a company's income level, changes in legislation, and government decisions - this is the level of problems that leaders deal with.
In testing with management participation, good preparation and high-quality preliminary analysis are very important. Scenario and behavior should be realistic. So, in this incident, the information is never served ready. The scenario should also be unexpected: for example, what to do in case of a fire is more or less clear, but how to act in case of a leak of confidential data is not very good. So, you need to work out the second option.
It is important to check yourself during the onset of a crisis, and not just others. The position “I will wait until others are struggling with the crisis” is unacceptable for the leader. Otherwise, the employees should not expect heroism or even just conscientious attitude.
Top management prefers facts and figures. He will be interested in 2 types of stories: about the difficulties of companies that did not bother to ensure continuity in time (with the help of external or internal specialists), and about competing companies that went out of business due to the lack of a well-tested plan.

Participation in the “desktop” test is sometimes enough for the top management to be convinced that their own company is not ready to react correctly to the incident.

Now several options to increase the involvement of ordinary employees in the testing process:

It is important for participants to feel their importance and the importance of the process itself. You should not limit testing to only one point of view of a business continuity specialist. Encourage any non-standard train of thought of employees.
Each testing participant must have a role. There is nothing more boring than just being an extra. For those who are not involved in the testing itself, you need to pick up other roles, for example, an outsider, employee of an external organization, a client, etc.
People are inspired when top management is involved in a business continuity project. In this case, its importance is emphasized.
Ensure that continuity is included in employee duties, and that continuity policy is explicitly supported by top management.
Add interactivity to testing: organize the visit of the managers to the backup site, show what conditions they and their employees will have to work in, demonstrate the living conditions and the means of communication available there.
Let the fact of participation in testing be a reason to reward an employee for his work in preserving and strengthening the company's business.

What information should the incident report contain?

The incident report should include the following information:

list of affected business processes (stopped information resources);
causes of the incident;
description of response measures (including whether there was a move to a backup office / backup data center);
what other measures need to be done to eliminate the consequences;
responsible for the occurrence of the incident;
the duration of the impact of the incident / downtime of information systems;
conclusions on the outcome of the elimination of the incident, which will help to avoid its repetition in the future;
tasks for the elimination of defects;
log progress eliminated.

Application:

Failure is a situation in which resources, such as IT infrastructure, do not work as expected. The impact of such a situation is considered minimal. That is, the amount of damage does not prevent the company from carrying out its tasks (or the damage is negligible compared to its annual turnover). However, if the failure is not corrected in time, it can grow to the scale of the accident. Note that failures are related to incident management (dispatch service, 2nd and 3rd support lines), and not to the IT continuity process.

A critical situation (serious incident) occurs when, as part of incident management, it is not possible to resolve a serious first-priority incident in the allotted time.

Accident is a destructive event in which the processes in the company are not performed as expected. And their availability cannot be restored in the allotted period of time. Business operations are seriously affected. Performance of SLA becomes impossible. The damage ranges from large to very large, i.e. the accident has an unacceptably large negative impact on the company's annual revenue.
It is impossible to react to accidents as to critical situations, i.e. stay within the staff incident management procedures. Their elimination requires a special response within the business continuity management process.

A crisis is a situation different from the normal state. Despite the preventive measures taken, such a condition can occur at any time and cannot be overcome by ordinary procedural or organizational measures. There is a need for crisis management. There are no clear, formalized procedures for managing in crisis conditions, only general recommendations. A typical feature of the crisis is its uniqueness.

Accidents affecting the course of business processes can grow to the extent of a crisis. That is, a crisis is an expanded accident that threatens the existence of a company or the life and health of employees. The crisis affects the company, but does not have a large impact on the environment or public safety. The crisis can largely be resolved by the company itself.

There are a number of crises that do not have a direct impact on business processes. These include economic crises, liquidity crises, management crises, fraud cases, large product reviews, kidnappings, or terrorist threats. Such crises, as a rule, cannot be eliminated by the company itself, require the involvement of external organizations (internal affairs bodies, regulators, financial institutions) and can be considered examples of disasters.

A catastrophe is an event that a company cannot limit in time and space and which has a large-scale impact on people, wealth and the environment. The very existence of the company, the life and health of employees are at risk. The consequences of an event of this magnitude cannot be eliminated by the efforts of the organization itself; this requires the participation of emergency services.

The article was prepared by Konstantin Musatov, a consultant in the direction of business continuity of Jet Infosystems. We welcome your constructive comments.

Source: https://habr.com/ru/post/309900/

All Articles