The concept of HumanOps was born in Server Density as a result of the accumulation of significant experience in monitoring computer systems and, accordingly, the team in a state of constant readiness. In the early years of the company, I was in touch 24/7 for a long time. However, as the team grew, we introduced processes and policies that aimed to distribute the workload and reduce the negative impact of cases when employees were separated from their current work or called outside of office hours, including at night.
In the process of creating and distributing a product designed to wake people up, we noticed that our customers experience problems similar to ours associated with being on constant alert. Communication with customers convinced us that such problems are typical for the industry, and therefore we decided to take a critical look at the approaches used, as well as to study the best practices used by other market participants. In the end, this led to the creation of a community to develop and discuss a number of principles, which were called HumanOps.
We hope that, similar to the DevOps practices that accelerate deployment, bring new tools, and unite development and operation teams, HumanOps will help organizations master a more "humane" approach to building systems and working with them.
Under the cat you will find 12 principles of HumanOps and a description of their work on the example of Server Density.
The focus of HumanOps is on people who create, use and maintain systems. For some, this may seem obvious, but it is still extremely important to formulate this idea as a first principle, because without recognizing that systems cannot work without people, it’s too easy to start thinking only of servers, services and APIs.
In practice, this means that when designing systems, from the very beginning you take into account people who will interact with them one way or another.
What to take into account:
From this we must begin the study of process improvement. Using computers, we reasonably expect that they will work equally regardless of the time of day. This is certainly one of the key advantages of computer systems - they can, without getting tired, reliably perform their tasks.
However, many mistakenly try to apply the same logic to people. It is important to consider that people in different situations act differently. Emotions, stress and fatigue reduce the predictability of system behavior, so these factors must be taken into account when designing.
As an example, consider the human factor. Computers do not make mistakes. They do not press the wrong button because of fatigue. People are capable of it, and even, most likely, they will make a mistake somewhere, if proper training has not been carried out, and the corresponding protection is not built into the system. The human factor is an integral part of a system that must be very well understood. This phenomenon should be viewed more as a symptom rather than as a problem in itself , and its manifestations should lead to a more careful study of the situation in which a person made the wrong decision.
One of the ways to improve the situation is training. Training should be as close to reality as possible, so that at the moment when the problem actually occurred, it would be perceived the same way as during training. This reduces stress, because employees know what to do. Stress arises due to insecurity, coupled with the realization that the system is not working, so you should try in every way to mitigate the similar effect of abnormal situations on the psyche. To simulate various failure scenarios, we run war games . Thus, all employees in any of the worked situations know what to do.
The Service Level Agreement (SLA) is a proven method of determining what you can expect from a particular service or API. You should be able to easily determine if the service is SLA, and what to do if it does not.
Similar to the principle number 2, in contrast to computers, which can work non-stop for months and years, people need rest. Reacting to abnormal situations and dealing with complex systems, people quickly get tired, so time to rest and recover should be an integral part of the process. A person can maintain concentration only for 1.5–2 hours, after which he will need a break. Otherwise, performance begins to decline.
In Server Density, we solve this problem with the rotation of duty . The primary and secondary (primary / secondary) roles are taken by team members in turn, and we have documents defining the response time depending on the role. This reduces the feeling of being unable to move away from your computer. For example, a specialist in secondary safety (secondary) is not required to respond instantly, so he does not need to constantly be near the computer.
Moreover, time to rest after work during off-hour hours is automatically allocated for us. An employee may optionally refuse it, but the company never asks to do so. Thus, we are confident that the employees have enough time to recover and no one pushes them to force them to abandon it. For example, this is achieved due to the automatic provision of rest - the employee does not need to do anything to get it.
It may seem that we give our employees a rest after responding to incidents in after-hours only from a good human attitude towards them. However, there is also business logic here: overworked people make mistakes, and we know many examples of major accidents that have been aggravated or provoked by operator fatigue.
As with insurance, which you hope to never use, it is difficult to calculate the direct benefit of this approach. The point of reducing the likelihood of a human error is to avoid something bad. It is difficult to measure, but, without a doubt, there is an iron logic to striving for the well-being of the operators, since in this case they will make better decisions.
Anxiety fatigue arises from receiving too many alarms. There are so many of these signals that the operator begins to ignore them, risking to miss something important. This phenomenon significantly reduces the effectiveness of the warning system, which should be triggered in rare cases, by letting people know that something really serious has happened.
To solve this problem, it is necessary to audit the monitoring system, checking whether the generated alarms require any actions, and also that these actions are actually performed.
This principle is related to the previous one, since alarms should be transmitted to people only in cases when the system cannot repair itself. No need to wake a person to restart the server or perform another simple action. If something can be automated, then it is better to do so. People should be involved only to assess difficult situations and perform non-standard actions.
Unfortunately, after putting the system into commercial operation, it is rather problematic to automate something further. This is due to the fact that, using such modern technologies as Kubernetes and cloud APIs, you can configure automatic recovery after almost any failure, but it takes a lot of effort to adapt new technologies to older systems. Of course, the use of redundancy, and the introduction of something new are worth the money, but these costs are paid off by saving man’s time and increasing the overall reliability of the systems.
The correct approach to building an infrastructure is that nothing in industrial operation conditions should be done manually. Everything should be template, formulated as scripts and run automatically.
When modernizing the old infrastructure, a balance must be maintained, since, for example, the containment of the main components of the system may be impractical. However, there are ways to achieve similar goals, for example, transferring a database hosted on own equipment to a managed service such as AWS RDS.
In real life, no one likes to write documentation, but as the team grows and the system becomes more complex, it quickly becomes a necessity. You need a sufficient amount of documentation that can be used by a person with a limited understanding of the internal mechanisms of the system. To solve problems with the help of documentation, checklists and playbooks can be used.
Training is equally important. It among other things helps to identify the shortcomings of the documentation. It is also vital to conduct realistic simulations with the participation of incident responders and explain to them how the system works.
To make the documentation easy to find and easily accessible for all company employees, we use Google Drive in Server Density. However, there are many other options for its placement.
When searching for the cause of the problem, an employee who has made a mistake, did not plan all possible scenarios, made a wrong assumption, or wrote bad code will almost always be found. This is normal, and people do not need to be ashamed of it, because next time they are unlikely to want to help in the investigation of the incident.
There are no perfect people. Each of us at least once broke something in production. Here it is important not to blame a particular person, but to be able to make the system better and more resistant to this kind of problems. This almost never happens so that the system is deliberately broken, so people should not have problems recognizing their mistakes immediately after realizing what they have done. This is important for quick elimination of consequences. Failures should be seen as an opportunity to learn and make the team better.
This can be achieved with the help of no one accusing the analysis of the causes of failure , during which the details and the main cause of the incident are clarified, but the people who made the mistake are not called.
There is a tendency to treat people's problems and systems problems separately. It is much easier to justify the additional costs of increasing server performance and fault tolerance, rather than staff-related issues. All of the above principles are designed to emphasize that issues related to people are no less important and they need to be given appropriate time for consideration and budget.
In Server Density, when planning development cycles, we often prioritize tasks based on their potential to reduce the number of emergency situations requiring the participation of people in after hours. The implementation of corrections of problems identified during the analysis of the causes of failure, receives a higher priority if in the process of responding to the situation had to lift people out of bed or the problem, at least, has such potential.
Since the health and well-being of people affects their work, including the ability to contribute to the common cause of solving problems that arise, and the problems of the system lead to a decrease in profits and damage to reputation, people's health has a direct impact on business health. The process of finding new employees takes a lot of time and effort, so taking care of people is beneficial for business.
Despite the fact that, of course, it is important to consider people and systems as being at the same level in terms of the degree of influence on each other and on the business as a whole, by and large people are still more important. After all, why does a business exist? To sell products and services to other people! And why do people do a certain job if not for their livelihood?
Improving the working conditions of your team is fairly easy to justify. In order to be able to hire and retain the best specialists, your workflows must be well established. When people are constantly lifted out of bed, ashamed of mistakes and do not correct problems, this negatively affects their condition. Stress experienced over long periods of time can cause high blood pressure, heart disease, obesity, and diabetes. Thus, many organizations, without paying enough attention to staff well-being, can inadvertently undermine the health of their employees, sometimes very seriously.
We in Server Density believe that this is an unacceptable price for business success.
References:
Source: https://habr.com/ru/post/331678/
All Articles