How we won the mess with iron and became bureaucrats from scratch

The difference between documentation and knowledge base: the documentation says that this device cools the air to +18 degrees Celsius, and the knowledge base suggests that there is a rare bug when two sensors immediately show -51 thousand degrees and the device starts to feverishly warm the air for servers.

When you start a new small project, then you have a piece of iron lying on the floor, there is no documentation, there is no nifiga at all, and work can be done. Then the project grows to the size of several hundred people and thousands of pieces of iron, and you need to know where everything exactly lies, how to do it, and so on.

Need a normal account of everything. Need documentation. There is no need for situations where you do not know how much and what is in your warehouse. There is no need for a story that when an engineer gets sick, the others call him at home and ask how he configured the server a year ago. There is no need for a story when someone said to raise 10 servers and two different people did it differently.
')
But we started with a simple one. The questions were: Who updates the server firmware? Who is responsible for the result? How it's done? Who should be warned? How to write a rollback plan and what to do if the server crashes? Someone recorded all the phones you need in advance at least?

In general, the very first rake will either kill you to hell or teach you to do everything right. We had the second and without a rake. Almost without a rake. If you already have chaos, then our experience may be useful, because now we feel better.

Accounting for iron

The very first and simple thing is to consider where the iron is and what it does. It is necessary for working with changes, for inventories and for proper accounting. For example, the memory bar burned in the server - you need to start an incident with the vendor and change it. The vendor will immediately ask what the FRU number is. The number of the strip is written on the piece of iron itself, and also the number can be viewed in IMM (but only when it works). That is, if there is no accounting, the engineer will go to the computer room, turn off the server, pull out the bar (that is, because he does not know the FRU, but he knows the DIMM) and, squinting his eyes, will tell her number.

It would be correct to scan all the numbers from the bar itself and start the server, or after starting the server, read the data from the bars remotely (workers can be viewed remotely) to quickly understand where the failed memory is and what is its number.

Well, in general it would be good to understand what the server is in, when the warranty expires, which project is used for, which configuration, and so on. When servers become more than a hundred, it is already difficult to manage infrastructure without a constantly updated map. I do not want to leave it to the carriers of unique knowledge, such as experienced operators, who know as a souvenir as a souvenir.

You also need to have the described service-resource model, have some kind of spare parts, understand what is in case of an incident. In the case of a new client in the same part that is closed under the GIS PD, we must understand well what equipment and how many licenses we have, what resources we have, which we could connect.

We started with such cards for each server:

As you can see, all serials have been taken into account, the hardware performance characteristics have been rewritten and there may be service operators comments on specific pieces of iron. And the card indicates when it started and when support will end, that is, you can set up reports on this — this is useful for planning expenses.

And the card says where it all stands and what is stuck inside. You can see the following hierarchy:

Cards are visible here in this list:

We started with the servers, then we went to the switches, transceivers, then we began to make the elements of the SCS - how much is there, where everything is, how it is connected. All servers are described - you can open any, see where it is, in which data center, in which room, in which rack, in which unit, what is plugged in, how it is connected. We made a bundle with the accounting department with fixed assets - they have the server counted as a unit, and now we know which card and network drives are installed in it. The disks are generally the same, differ in serials, and where it is - before that it was unclear. And now even a contract is attached to each one, according to which it is delivered. Practical convenience is that in the accounting department a large amount of equipment is included in one main asset, and in our country it is separated by racks, which can be in different halls, or in general in spare parts. Now, with any query (inventory) of the accounting department and the client, we can quickly find all items of the asset.

An example of a disk card, we know everything about it

Now the processes

When ordering the soul with iron, for its further maintenance it is necessary to begin to prescribe processes. For example, a bank drives into a cloud and you need to add capacity for it. We need a formulation of the problem with the architecture and the description to determine what to add, where to add, how to configure, how to organize the network.

Previously, we discussed this in the mail, the task was described in the task tracker, and then, when everything was agreed, the operation went and did.

Now there are two types of processes. If the changes are standard, then all the individual parts of the processes “select the port there” are already described (including the work plan, the return plan, the test plan, whether the executors are defined, etc.), and each can get on the calendar changes in ServiceNow. Since the process is standard, there is no need for approvals, responsibility is immediately appointed, deadlines are set, and all this falls on specific performers. There is an SLA for each change, the applicant understands the time frame for his task.

An example is port allocation. The port can not be allocated everywhere. First, there are, for example, 100 switches, different, each with 48 ports. Ports 100 * 48, but this does not mean that any can be distinguished. The switches are located in different data centers, halls. You need to understand which ports you can select, which ones you don’t. According to the standard change plan, the engineer knows where and which ports to allocate, does not allocate previously reserved ones. The second is that the port should interact with something, respectively, different settings should be applied (security, speed, band limitation, QOS setting, etc.), all this is either described in the request or indicated in the change plan. Third - we know how much is allocated for us, how much is reserved, this is very useful for the subsequent design.

If the changes are non-standard, then the change after registration falls on the change-manager. There is a design of what and how, the architecture for implementing changes, work plans, risks, rollback plans are evaluated, the idle time is determined, who is responsible for the work, after which everything is sent for approval ... In general, the same script sequence is written with all forks from the last item. The person who writes this describes the risks (which he sees) and prepares a rollback plan for each case. After that, the process of change falls on the change committee. CAB members are pre-determined for areas of responsibility. Then each participant looks at the change plan and rollback plans and agree or not. If necessary - refinement. If necessary, the windows will agree. After all approvals, tasks are set for the performers. But in practice it turned out that we are not so big. In the combat system, the KAB is left “for growth” as an entity in the system; in fact, these are managers and / or architects.

Each change has a change-manager (executive officer), he is responsible for the change, a shift watch follows all changes, monitors monitoring and communicates with customers in emergency situations. The change is closed only after the shift on duty confirms its working capacity, the change-manager checks according to the test plan and updates the CMDB and documentation.

The fact is that the result should be recorded in the documentation, and everything that was done during the change should be reflected in the asset database. Elementary update connectivity. That's just after that, everything will be closed.

These are the processes we began to write:

The difference with the process and without it is that there is more documentation and bureaucracy with the process, but it is clearer what to do. Previously, whoever he wanted, he did so. It is clear that while there are few people, all this was kept on unwritten standards. Grew up - it took them to write, plus without this it is impossible to manage a large number of processes. Appeared area of responsibility. Appeared different SLA for different cases. There was a duty to prepare plans, to warn all concerned, and also to coordinate and document everything at the end. Then it all automated in ITSM:

Knowledge base

All work within the changes associated with the configuration elements. By registering a change, we know that there is a standardized process: how the work should be performed, on what equipment, and a circle of people is defined - who agrees, who performs, who does what, if something goes wrong, and so on.

We immediately see the risks, and in the future - and the cost of work.

But this is not enough.

You need a knowledge base that describes what the process does not describe. For example - a rare bug in the firmware. Or how best to assemble a rack. Or how to connect. Here is an example entry:

This is done mainly from the documents that the architect writes at the planning stage. But sometimes engineers and change-managers add something new after doing the work.

You have no idea what a thrill to look at the records about the piece of iron and know what was done with it and how it was before. And what a thrill it will be in 3-4 years - just can not convey.

Stock

The next part of the story is warehousing. We have spare parts, tools, packaging from test equipment. We kept all this right in the offices where the shift shift, the maintenance department and the architects were sitting. When engineers assembled for installation in the data center, they sometimes went to cabinets to collect equipment and consumables (patchcords, transceivers, etc.). This leads to chaos, and chaos is cool at first.

Again, a little bureaucracy - and here we have described the spare parts (more precisely, most, a number of things are still in the "box with small things"). There is address storage in the software, but for now the warehouse is not ready to implement it organizationally - each piece will have coordinates in the form of a place on the rack. When the engineer is assigned the task of installation, something like “take this switch there, then these SFPs here, these patchcords are attached to the task. Here are their shelves, shelves.

Roles to change

In part, I have already described change management. Another thing in this story is that we have a role distribution:

That is, if you are from the VMWare network group, you can submit an application only in the VMWare zone. Each role has a list of processes that it can do, and a list of tasks that its members can set. That is, the entire organizational structure of the company can also be entered into ITSM.

Who fights chaos?

We fight the whole team, but I am responsible for the result. By the way, there is still a lot of work. My position is called “program manager”. This is something like a project manager with the task of maintaining a plan and a list of interdependent projects that must be implemented to achieve a common goal.

Before that, I was engaged in putting things in order in retail, in particular, automated accounting and deliveries in a large retail network of stores for correct accounting, reducing costs for acceptance and accounting of goods, and reducing the time for goods to be delivered to the shelf. I can say that in an IT company the process of change goes much more smoothly - everyone understands why it is.

When it became clear that any new engineer could raise the documentation, pick up what was done, read the tuner logs (not only those where someone took care of the descendants out of spiritual kindness, but any) and did not collect the bikes from colleagues I understand where we are going.

There was nothing at the start, and we started work from several sides. The full result has not yet been achieved, but in six months we have become more understandable in working with iron, processes and changes. The cloud is evolving, and this is his logical step. It was not necessary to do it right away, but now is the time.

The other day we are introducing a new warehouse into test operation and continuing to fight chaos along other vectors. Work before the fence, but the results are visible to all. If it is interesting, I can tell a couple of stories from retail - there the same projects are complicated by the fact that they, it happens, are actively resisted by the end users and the leaders of other departments. Well, and plus - all need to be taught, and for many - to win thinking "we already worked normally, do not touch."

The text was prepared by Boris Kosolapov, project manager for automation of Technoserv Cloud .

Source: https://habr.com/ru/post/352038/

All Articles