A brief essay on mutual understanding or worth remembering when working in medium and large organizations

Hello everybody!

My name is Roman and I work in a medium-sized company (~ 500 people), which provides consulting services for the development of high-load systems for large companies in the US market. One of the customers is a company that is in the TOP-5 of the largest online stores in the United States selling clothes, shoes, haberdashery and other consumer goods.
')
In addition to our company, consulting services to the customer company are provided by another 12-15 large vendors working on an ongoing basis, plus a bunch of small ones. A large number of these same vendors (both large and small) have outsourced on the peninsula to the south-west of China. Traditions and mentality in India are significantly different from ours, and even more significantly different from the western, so with enviable regularity in the course of the work there is a lot of funny and not so situations, to describe the reasons for which the best fit non-Russian word "miscommunication". This opus, however, is not about working with teams of other cultures scattered around the globe, but about the fact that this very miscommunication can easily occur on level ground even within the same company, whose employees speak the same language, but also about the importance of understanding the context when communicating with other people.

A bit of context

The system we are working on is a lot of components. The main web-store is spinning on a cluster containing about 80 machines, smaller systems work in clusters of 5-15 servers. All this is necessary in order for the whole application to work properly at peak loads. Theoretically, any of the components may fail, so the system is designed so that the failure of any subsystem does not lead to the failure of the entire system, and thus does not lead to a breakdown of sales, which run 24/7/365 and are the main source of customer profit. As they say, the skill of the programmer is not to write programs that work without errors, but to write programs that work with any number of errors. If I remember correctly, the availability of applications for 2015 is now in the region of 99.98%, which is very even for this type of system.

Under the contract, our company has the so-called ownership over a number of components in the system. This lies in the fact that we are responsible for the development / implementation / maintenance / expansion of any software modules, and no one except our development team climbs into the code of this module without our permission. Well, in theory, it does not climb - given the geographical location of the other vendors, their penetrating energy and the regular desire to cut corners, in practice this is quite difficult to achieve.

In order to conduct parallel development of many modules, the infrastructure provides for many of them, each of which either has a full copy of the web-store or some part of it. Well, how much. Real pieces 150-200. All envs are supported and developed by our DevOps and NOC (Network Operation Center) teams, whose duties are to ensure that all enes work, have required required versions of components, and if something fell, it was quickly picked up, preferably without prejudice to the development.

Since the availability of all components in production is critical for the customer, our company also provides the services of round-the-clock support for modules developed by our teams. It would not be particularly correct to oblige all developers to provide all such support, so there is a special team SRE (Site Reliability Engineering), which is responsible for L2 / L3 support. They always come to the dudes with technical problems in production, and then they solve them. Usually, a solution to 95% of problems comes down to either giving recommendations from the series: “And here you have one instance left the cluster, so the latency grew by 20%. Restart to help. ”(L2), either:“ Oh, we have a jamb in the code here, we’ll see how to fix it quickly. And if we cannot do it ourselves - we will call the main team of developers, and we will tweak with them ”(L3).

Ending Holiday Season (Black Friday, plus Cyber Monday, plus a few days before and after) for retail companies is happiness and pain in one bottle. On the one hand, for this one week, the company can make about 15-25% of annual profits. On the other hand, an insane number of buyers who want to buy a bunch of items at a discount, create a peak load and the risk that a module will not withstand the load. If the component is somewhere far in the stack and represents, for example, the cache of already completed orders - this will simply create a disadvantage to the order processing teams. If a service that authorizes payment transactions suddenly falls down, and for some reason buyers will not be able to give money to the retailer, it will not seem to anyone much, because the amount of revenue per day is measured in tens of millions of bourgeois money. In short, Holiday Season, or, as it is also called, the peak - this is an extremely exciting, but stressful period for all those involved in the development and support.

At the peak of SRE, the team gives enhanced support 24/7 and actively keep abreast of all our components. To do this, the team set up a custom monitoring of components in production at one of such development sites. In fact, they wrote small utilities, which (in addition to the main monitoring) regularly send clever test helms checks to each of our systems, write some statistics, and if suddenly one of the components starts to blunt and respond for a long time, or not to respond at all - immediately shouts loudly on all channels of communication to the engineers on duty.

The property is called QI58. Formally, the env belongs to the SRE team, i.e., in theory, no one but them (and DevOps / NOC commands in case of problems) should not go there. Before the beginning of the peak, the team just in case wrote a letter to all our teams that the QI58 mission critical spins important services, it is not necessary to touch it, otherwise it will be atat.

The essence

It so happened that also QI58 is our only properly configured environment, on which one of the components can be tested. Therefore, in the usual time this env scurries with our team, and we use it for testing. Our development team for the last couple of days has been working on a small improvement of this particular component. Improving this with a peak has nothing to do, just planned expansion work. And today it has become necessary to test it on Enve. We wrote to the SRE team, saying that you don’t mind if we now test for QI58. Yes, yes, we remember monitoring. Yes, let's do everything carefully. It will take half an hour, at the end we will write off.

And then something like this happens:

One fighter from my development team is starting the manual deployment of the module on QI58 (yes, the automated deployment for this component, alas, is not working now for various reasons). When trying to send a test curl request from the same machine, it gets an error: sed: couldn't write 440 items to stdout: No space left on device . Understands that the problem most likely arose due to the large number of logs on the disk, can affect the monitoring, spinning on the same machine, and tells me about it.
We are talking about the problem of the SRE team. We say that you have a problem with a place in your car, it may be critical for monitoring.
We get immediate feedback, yes, yes, really critical. We also get the go-ahead so that we can get a ticket in JIRA on the NOC team, so that the dudes will look and solve the problem.
My fighter starts a ticket. I look at it, and I understand that, just in case, I should remind once again that QI58 is mission critical, and important applications are running on it. I write comments to the ticket, in the spirit of that: “NOC team, pay attention to this environment. “Please contact people of A, B or C in any case.”
The ticket is taken into development, and after 10 minutes it goes into Resolved status with a comment about the following content:
The problem is resolved by removing the following non-standard files and folders:
- gigaspace-monitoring.jar
- gigaspace-monitoring-libs /
- heartbeat.jar
~ 150mb was released. In case if the problem appears again, please reopen the ticket.

Well, then follows a few-hour general fun, mats, and other pleasures of life.

Lessons learned

I must say that, by and large, nothing terrible happened. Deleted files were promptly restored, no critical problems occurred with the components of the observed system during this time, not a single NOC engineer was hurt. But the fact that the most important thing from all that could be deleted on the server was deleted is, of course, the epic :)

By the way, if you think a little deeper, the NOC specialist did what he did, not because of his stupidity, and not because he forgot about the fact that QI58 is a critical env, and not even because he is wildly lucky and his eyes first The queue got the files that just did not need to touch. The root of the problem turned out to be that no one clearly communicated to him exactly which processes were critical, and that it was not necessary to touch. Moreover, all the work done by him was done professionally and within the framework of the instructions:

the problem was solved promptly and efficiently;
the server was not restarted, was not removed and created from scratch from the image, and so on. Those. availability enva was not affected, as required;
non-standard components were removed that are not applications for which the environmental was originally intended;
immediately after the discovery of the problem, the NOC team promptly set to work and restored the deleted components.

So the main conclusions to be made:

Not everything that is obvious to you is obvious to other people on another team that does other things outside of your context. And when people work with complex systems and large volumes of information, the incomplete context of a particular person is a frequent occurrence. A person assumes that everyone in the team is a specialist and everyone works with an understanding of what they are doing, but he misses that different people have different context of tasks and not always a complete understanding of all the details, if he has not worked with any element tightly.
If something is mission critical, it is necessary to do this not only so that everyone understands that it is mission critical, but what exactly is such (what OS processes, applications, files on the disk, etc.). In addition to this, it is important to make sure that there are physical or procedural access restrictions that do not allow anyone to create chaos where this is important. At the same time, there is always a risk of bending the stick and getting bogged down in bureaucracy, which will interfere with normal work, therefore balance is important.
In some cases, it makes sense to talk to people live, transfer the context, and be sure that everyone understood important details before starting a standard problem-solving process.
The human factor is the main cause of all problems. It is not about programming. Not about technologies. Not about business processes. It is all about people.

Communicate more, formulate thoughts clearly and clearly, and remember that the picture in the head of any other person is always different from yours.

All good!

Source: https://habr.com/ru/post/297022/

All Articles

A brief essay on mutual understanding or worth remembering when working in medium and large organizations

A bit of context

The essence

Lessons learned

More articles: