I check the maintenance list of the UPS by the contractor on the checklist.
Hi, Habr! My name is Kirill Shadsky. Now I design and build data centers and server. Prior to that, he led the DataLine data center operation service for a long time (at that time, about 3,000 racks). Together with my team, I undertook an Uptime audit on operational processes (Management and Operations) with a score of 92 points out of 100 possible, and also together with my colleagues participated in NORD 4 certification. Today I want to tell you how to correctly divide the operation of a data center or server between your team and contractors.
It is difficult to steer a data center only with your own forces or with the forces of a contractor For all my experience I have not seen any one option in its pure form, mostly some kind of hybrid. What will have its own team, and that the contractors - each company determines itself, based on finance, convenience, availability of qualified engineers (try to find a specialist in DDIBP in Tula), and sometimes policy. No matter how wonderful your contractor may be, there are moments that are best left to yourself. We will talk about them below.
Before we go to divide the operation between our own team and the contractor, remember what is involved in this process. I will not describe in detail for each item - you can write entire books on this topic. I will highlight only the main points that can be conditionally divided technical and organizational .
Technical points:
Organizational highlights:
Everything that is recorded in the technical part can and sometimes needs to be outsourced. In this case, you only have the function of managing and controlling contractors. Who should do this on your part, I will tell you below.
With the organizational component more difficult. Almost all of this list will have to do on their own. Let's see why.
Record keeping . Regulations and instructions are needed to ensure that the entire operating team has the same understanding of the processes and algorithms of actions (for example, how to test the diesel generator set). And also so that the “sacred knowledge” does not disappear along with the sick or outgoing engineer Vasya. In theory, writing the documentation can also be entrusted to the contractor, especially not every server engineer will be able or willing to do paperwork. But the truth is that no one knows your processes better than you, but to keep track of all the changes and keep the documentation up-to-date without working constantly on the site, which is completely “mission impossible”. Alternatively, together with the contractor, you can develop documentation, and monitor its relevance already on site.
Collection and analysis of statistics . The situation is about the same as in the previous paragraph, so we take a pen / keyboard and methodically we write down the “medical history” of each air conditioner, DGU, and further down the list of equipment. Once a quarter, six months or at least a year, we look there to understand what and how often we break. The information is useful in drawing up a budget for operation, planning spare parts, and also helps to identify whether there is equipment that repairs will not help, and it needs to be completely changed.
List of breakdowns and types of repair for one of the air conditioners.
Control of installation of IT equipment and power management . About this, many people forget, but in vain. IT guy saw the free unit and stuck the equipment, not seeing whether there was enough power in the rack, cold, and in general, correctly installed . And then all the claims to the operating engineer are for the blinking power (due to the fact that the server with one power supply unit is connected without ATS or with both power supplies to one PDU) or the equipment brakes due to local overheating.
To reduce the number of problems in this area, make clear instructions, checklists for those involved in installing equipment, and periodically check how IT equipment is installed (especially carefully if the loading of the hall has exceeded 50%). The frequency of inspections will depend on how often new equipment appears in the engine room.
Algorithm for processing the request for the installation of new equipment.
Planning work (MOT and work orders) . Together with the contractor we coordinate the work schedule, based on the staff load (there should not be work on all systems in one week) We also issue work orders and agree with the contractor on the form of work acceptance (act, check list, etc.).
Budgeting Better to do it yourself. Depending on how you got - every month, quarter, or immediately for a year, operational or investment. About budgeting on my own will soon write separately. If you give the contractor, guess what will happen to the budget? That's right, most likely, he will grow. It will not even come from the contractor’s mercenary intent, but simply because he will not care so much about saving as you would have done.
Even if somehow we managed to give the contractor everything described above, then sit with our legs on the table and just pay the bills will not work: contractors need to be trained and controlled .
To learn contractors , first of all, you need of life rules of work in the data center and server. In addition, "do not drink, do not smoke and do not row up," there are technical nuances. For example, the contractor should find out from you that when servicing air conditioners it is impossible to disconnect more than one at a time, and before disconnecting, you need to check that the other air conditioners are working properly.
Control over access to the object will also remain on your shoulders. Check the relevance of the lists, the schedule of access to the object (round-the-clock or only on working days), the presence of electrical safety peels and other necessary certificates is your task and only.
In general, remember that the performance of the server or data center is ultimately the responsibility of you, not the contractor.
Excerpt from the rules of work in our data centers for contractors.
The number of people in your operation will depend on the declared SLA, the amount of infrastructure and how much you plan to do on your own. I will not tell you a universal formula, but here is what you can rely on.
In what mode do we provide services? If 24x7, we need round-the-clock support service from at least four people who will work in four shifts - a day after three. If 8x5, then people will need half as much.
How many engineers are needed? Here a lot will depend on the functions. If you just need to follow the monitoring, then one is enough, if you need to make rounds - at least two people. If you have to do something with your hands (pull crossings, install equipment, change filters in air conditioners), then you will need three.
Do you keep spare parts and consumables? If you store almost everything, you will need a stockman or a purchaser who will monitor the balance and order new ones.
This is how the team of our site NORD stands at 2720 racks.
The name of posts and the number of people will be different for each case, but one function must be present in any case. This is the function of "being responsible." Conventionally, I call this position "chief engineer". In our hierarchy, this is the head of operational services. Its main function is to make decisions that are not discussed: should an emergency call contractor be called up, can a backup air conditioner be postponed. He also gives the command to turn off the equipment at the time of maintenance, coordinates urgent repairs, unplanned purchases, and manages the data center rescue operation in case of accidents. It can be addressed as an arbitration court if the operating engineer or contractor suddenly cannot agree with the power engineer on the test runs of the DSU.
In general, the “chief engineer” is ultimately responsible for the entire operation and engineering infrastructure to the business or customers.
Let's sum up. The program "minimum" for the service operation of the data center or server is as follows:
If you have questions, write in a personal or come to my next seminar on July 4th, you can personally ask about everything.
Other articles on the management of the engineering infrastructure of the data center and server:
→ The path of electricity in the data center
→ Errors in the project of the data center that you feel only during the operation phase
→ About vital data center operation
→ How to test the DSU in the data center
→ Monitoring of engineering infrastructure in the data center. Part 1. Highlights
→ Monitoring of engineering infrastructure in the data center. Part 2. Power supply system
→ Maintenance of data center engineering systems: what should be in the contract
→ Dumb ways to die, or why data centers “fall”
Source: https://habr.com/ru/post/331902/
All Articles