My name is Sergey Kubasov, I’m deputy technical director of the Mail.Ru Group. Recently on the DCDE forum, I talked about our experience in creating and organizing our own data center. Now I decided to share our discoveries with Habr's readers.

How it was
At the beginning of the history of our own data center, we collaborated with five service organizations that serviced air conditioning and ventilation, diesel generator sets, uninterruptible power supplies, automatic electricity and gas fire suppression systems. At that moment it seemed that this was the best solution, since “The same people” are built and exploited — that is, they know the system completely.
Having worked in this way for three years, we summarized the results and made a list of the most serious risks that can be encountered in this mode of operation. The first is
oversale services ; this is when the contractor promises to react to an emergency within five hours, but cannot comply with this deadline if the accident occurs even among one of his clients. For example, if it serves three facilities and an emergency has already happened on two, then on the third it will not have time not only during the time specified in the contract, but also in any adequate time period. By the way, geographic features and traffic congestion in Moscow play their role here. No matter how hard you try, in rush hour, in less than two hours, do not reach the object located on the outskirts of the city.
The second risk in using the full range of services from contractors is related to the
supply of spare parts and consumables . When buying a package of services from a contractor, he will surely wind up to at least 10% of the cost. We are faced with a surcharge of 50–100% of the base price, for which the same parts can be bought in the “nearby store”. At the same time, the contractor also tried to save on time, delaying delivery to 4-6 weeks, while consumables were sometimes needed right now, or urgent supplies of spare parts were also offered at an inflated price “for urgency”.
')
And last but not least,
after a long cooperation, some companies relax , and regardless of how critical the situation is, they react only after payment. There was a force majeure, and the contractor says that he is ready to come to you only after you transfer the money for the services and they will be credited to the account (within 1-2 business days). What happens to the equipment left without air conditioning or electricity backup, I think everyone understands.
After evaluating all the risks, we decided to create our own operation service.
Now Mail.Ru Group projects are located in five leased data centers and in our own data center.
Filling: data center service
In our data center (total area - 2100 sq. M., The building has about 450 racks, the equipment on which is able to consume up to 4 MW of electricity) we have created our own data center service system. It can be divided into two components.
Maintenance of engineering systems - power supply, ventilation, diesel generator sets, UPS, fire extinguishing systems and others. The direction is led by a leading electrical engineer, who has shift engineers on duty as well as an air conditioning and ventilation engineer serving the entire facility. It is these people who perform daily rounds, respond to incidents, repair malfunctions right on the spot, without ringing up and figuring out who to call and how to escape. Thus, we have provided for our data center an immediate response to any emergency. In addition, employees themselves carry out planned work and measurements, correct deficiencies in the operation of the system, and provide facility upgrades.
Server fleet maintenance . This direction is managed by the head of service of hardware-engineers. Also here is a shift on duty. Their main task is to monitor and ensure response: in other words, if they see red instead of a green signal, they need to open the corresponding instruction, find out who is responsible for such situations, and contact this person.
Monitoring as a guarantee of health data center
Speaking of monitoring. In the work of the data center it is extremely important to detect in time a deviation from acceptable indicators and react to it promptly. Naturally, competent monitoring plays a major role here. We organized monitoring of all the critical elements of our system, while trying not to get into the technological wilderness, when another service is required to maintain the monitoring system itself.
ElectricityWe monitor the status of the UPS from several sides, as it is a key element of the entire system. Here we have several control systems involved; for example, if there are any changes in the network state, the entire management team of the operation service receives an SMS with information about the failure. About six people were signed to the newsletter, of which someone will necessarily respond and take measures to eliminate the problem.
To monitor the state of the introductory lines, we decided to use not modern, but tried and tested military technologies used in rocket mines. These are ordinary control panels on lamps: the green lamp is on - everything is fine, the red lamp is on. In the case of a sophisticated monitoring system, a situation is possible when the data center rushes, and you don’t even know about it, because you are busy repairing the system itself. And our control panels (at a cost of several thousand rubles) are very reliable - there is simply nothing to break!
Temperature and humidityAnother important point is the control of temperature and humidity in the data center. We have developed a system for monitoring these indicators, while also taking the cheapest solution - sensors that collect statistics and visualize the situation on the map. We did not use the boxed solution, but simply bought about a hundred sensors and assembled ourselves a system that constantly shows in which corner of the building we have what temperature. In the event of a change in the allowable limits, it also sends SMS to those responsible and issues alerts to the centralized monitoring system.
Bypass checklists and permissible limitsIf earlier we had just instructions for rounds, now we have made checklists. The checklist is a piece of paper in which it is written, what to look for during the tour, and the lower and upper permissible limits for each item are indicated. There is also a graph, which describes what to do if the value has gone beyond the specified limits. According to the results of each inspection, we have a completed checklist; rounds are made every 4 hours. Completed checklists are filed in a journal, at which you can look at any moment and find out what the situation was yesterday or, say, last spring. Our magazine is very ordinary, paper; we chose this option of data storage because people who are engaged in electrics are sometimes very far from IT technologies, and it may be necessary to check the indicators. A person with any qualification can take a folder from the shelf, open and view the necessary information; we simply removed another barrier to responding to an emergency.
Hole placement strategy and other discoveries
Unfortunately, in practice, not everything went so smoothly. We found out many things when creating a data center solely on our own hard experience. As a result, we managed to resolve all emergency and unforeseen situations, but some cases are still remembered with laughter and shudder at the same time.
There were funny cases that could lead to very serious consequences. For example, we entrusted contractors with the purchase of air conditioners for the data center. As it turned out later, contractors were guided by what was massively represented on the market, and settled on one of the simple solutions from a well-known company.
To ensure a proper level of redundancy, we use dual-circuit air conditioning. During regular checks, we were surprised to find that on one evaporator the
boiling point of the refrigerant is normal and an expected + 3 ° C, and on the other it drops to -0.5 ° C. That is, our air conditioner has become a freezer. And this did not happen at any time, but only under certain weather conditions.

It turned out that the engineers had made a mistake in the location of the evaporators relative to the air intake opening - they were offset from the center, and as a result, the air flow and heat load were not evenly distributed between the circuits. We finally repaired this file from the manufacturer very simply - we saw the missing window in the case.

Results
After all the perturbations and changes, we were able to organize such a service, which previously we could only dream of. We reduced the response time to the incident from two hours to five minutes, optimized the purchase of spare parts and consumables, reduced maintenance costs.
In our own data center, we are constantly improving our engineering systems, working on upgrading the monitoring and dispatching system. Since we ourselves operate the system, we have the opportunity to analyze the existing shortcomings and take measures to eliminate them.
For three years we have worked exclusively with contractors. Two years ago, we had our own exploitation service, which we are pleased with. Of course, we continue to use the services of third parties, but now within the framework of risk reduction. We have our own engineers who alone serve the whole facility. In case such a person cannot go on duty for some reason, we have a service contract with a minimum cost for each of the systems - just to let us call and request help if necessary.
Another big bonus is that we now have the qualifications and extensive experience to control the work of contractors. Since we ourselves serve our data center, we can evaluate the quality of the work of third-party companies: from complete solutions to the weld. And, of course, we have the opportunity to choose the performer of work. We are not tied to any service organization: we can solve the problem ourselves, but we can order inspection and repair from a third-party company.
We apply the accumulated experience in rented data centers. I hope that it will be useful for you too - and maybe you want to share your own in the comments.