How does the CROC engineering service work - and what happens if, at 3 am, a cluster breaks somewhere far away

DL360 is a hot-swapped Pentium I server. Somewhere far in Siberia, under the constant load, his twin brother has been working for many years. If it fails, we have a replacement that allows you to simply continue to work without a radical changeover.

But with such a picture often begins the morning on the road

Good morning! My name is Alexander, I work as the head of the CROC service team.
')
Across the country there are many facilities where the failure of a cluster immediately leads to the local main hit the TV. These are various scientific research institutes, industrial enterprises, nodes of banks, insurance companies, oil companies, airports, and so on. And we put there the hardware and software, and we keep it all on support.

Let's start with the fact that installation without adventures almost never happens. Well, if we just forgot to give food or a network. Worse, when the server rack is outside the building, because someone incorrectly indicated the dimensions of the door. There are also moments like: “Guys, we have prepared everything, connected it, only there is a nuance - your server was dropped during unloading. Well, just a couple of times. ” Now I will tell and show how our work looks.

Sense of work

During my time at CROC, I traveled almost all over the country on installations and support. Now I am in charge of the department, so I myself travel very rarely.

My work place. Folders, as can be seen, more than glands

The usual scenario for a shift shift is: we are sitting and waiting for a call. When something breaks, we have quite stringent standards for the time to fix the damage. For example, at critical sites in Moscow, the time to replace iron is 4 hours from circulation. In Novosibirsk and other cities, too, there are particularly important objects, since there are no problems with booking tickets now.

From the team that is waiting for the call, you must be on site and on duty. As a rule, the fighters at this time either pick a new iron and study it, or are engaged in self-study. In general, we train and improve skills.

Sometimes we lick at new solutions and order them to "see for ourselves". From this it turns out many interesting projects - from the office lighting system that adapts to the weather and open windows, and ending with different solutions for our own security.

Tests

Another part of the engineers is engaged in regular installation and maintenance. They do not need to break and run to the terminal or rush to the airport. They know in advance what, where, how and when. It does not get any easier because, I repeat, each installation is a separate adventure. And to prepare for it is also better carefully, which in practice is much more nervous work than rushing to help, like Chip and Dale.

Outside the battle shift, we also work with our hardware, but we can already do it outside the office. Another important aspect is our engineers. These are people with very large practical experience, and some of them often speak for internal training as well as at various technical conferences. Except for those fighters who work under the service, of course. Although in theory, if we have several critical situations at the same time, a full-time engineer can also interrupt his speech in half-word and run away. But this was in my memory only once.

Cups are not mine. But it is very good to, for example, put all sorts of small details there so that they are not lost.

Departure for installation

For example, in the case of a regular cluster installation, as a rule, more than one specialist is needed. One is the person who does the operating system and the cluster setting itself, the other is the stack manager, and the third is the application, depending on whether the customer puts the butt on it or not. It happens when we get by with two, networkers often happen on the spot, but it happens that sometimes there is no one at all at any particular IT point.

Starts with unloading. It happens, they beat iron. We take photographs when we need to prove a malfunction (for example, that the equipment came broken through the fault of the transport company). Then we understand for a long time.

Suppose everything came as it should. We put the system, the same cluster. Everything is good: there is a specification, hardware, software, we work on setting up, there are some agreements between managers. Everything has been discussed a hundred times, all difficult moments are experimentally agreed. An engineer arrives, and he understands that the ideal world is not here.

He comes up to, say, a network salesman and says: “I need to allocate eight interfaces on the switch”. And they say to him: “I have only six, and two more will be tomorrow or the day after tomorrow. We must order them from the warehouse. ” The engineer runs, everyone asks for something. When they give him everything, when he is pushed a place in the rack, they will connect electricity, pull cables to it - it can take a couple of days.

Then he starts calling the administrators who register him in the domain, then he calls the specialists on the DBMS, who begin to tell him how everything is arranged, the administrators also enter him into their system. Every time he works with someone new, and not the fact that he is prepared. A combat system, and the password engineer does not know, which means that the admin should be sitting next to and drive it for him. They also do not have much fun. Yes, and people can be different. For example, the SQL boxer likes to drink, and someone walks in a T-shirt with the Simpsons at minus thirty, because his wife left. To each need to find an approach. It is clear that all these people help, because there is a common task, but still there is some kind of fan in that you need to learn something from everyone in order to complete your work. Everyone must explain to you how and what is arranged. Very often, documentation with reality is somewhat divergent, and the concept of editing may change. Or suddenly it turns out that a certain type of packages in the network is prohibited by the policy of Moscow (and the belt is different, and in Moscow it is deep night, you will not call).

At about this stage, it may turn out that the backup has been going on for a year now. Haha And it begins, again, a lot of erotic adventures. Of course, we can set up without a backup, formally we, like, have nothing to do with it. But then the negative will remain: some came, they say, broke everything and left.

Accessories

We should also say about our warehouse. We have about eighty thousand items for hot replacement in stock. Understandably, when you have a 4 hour replacement SLA, the warehouse must give up a piece of iron before you go down the elevator. Therefore, our storekeepers methodically keep accurate records and check everything.

The accounting system says: "Your piece of iron in a box is such and such in a block of such and such." Regardless of whether it is small or large.

You come up - it’s immediately clear what lies here.

In one of the sections of the warehouse we have a “museum” - the place where such exhibits lie

They are really working and really needed for hot swapping. When a system is complex, critical and “don't touch while it works,” it is easier to change a failed node to exactly the same one than to reconfigure and redo. Therefore, we keep reserves worthy of the museum.

Source: https://habr.com/ru/post/228529/

All Articles

How does the CROC engineering service work - and what happens if, at 3 am, a cluster breaks somewhere far away

Sense of work

Departure for installation

Accessories

More articles: