Part of the post-accident inspection: thermal inspection of the machine roomThis story happened to the data center of one company for quite a long time, all the consequences of the accident have been eliminated, plus improvements are being made to prevent a repetition of situations. However, the report of what happened, I believe, will be interesting to those who are engaged in data centers, and those who love almost detective IT-stories.
Expected scheduled outages. Two lines came to the data center, the owners of the data center knew in advance about the situation, prepared and carried out all the necessary tests. All that was needed - just go to diesel engines in the standard procedure.
')
Hour X
The outage happened as it was intended by the power engineers, the uninterruptible power supplies worked properly, then the data center segments switched to diesel generators (to avoid the influence of high inrush currents). Everything is as it should be. At the same time, the data center has not yet reached its design capacity, therefore, diesel generators gave almost two times the power margin.
Hour P
Diesels worked for an hour and a half, after which the fuel in the tanks built into the containers ran out. According to the standards, a diesel fuel can not be stored in large quantities right along with a diesel generator set, otherwise you will have to undergo very complicated certification on fire prevention issues. Therefore, separate storage facilities are used everywhere, from which diesel fuel is pumped. The system began to pump up fuel, and here the diesel generators "coughed" and got up.
Understandable, panic. Diesel engines quickly shut down abruptly one by one, there is simply no time to do something - and the power in the data center disappears. Already a little later, it became clear that the fuel was stored in the tank for quite a long time, and because of temperature drops there was constantly condensation. Next - a banal physics: the water is heavier than fuel, and it sinks to the very bottom of the tank, where the fence comes from. The result - all diesel engines scooped clean water. At the same time, the installations themselves did not even have water filters, although they would not help either - the filter holds about half a liter, and there was much more water there. At the same time, even if it were possible to understand everything in time, it would still not have been possible to bring diesel fuel in a short time.
The result - during a scheduled outage, the company received an emergency shutdown of the entire data center. At about 8 o'clock, which for a large company can be almost quarterly profit. The data center stood from the moment of scooping up diesel with water until the moment when the power was supplied from the city again.
Get up is not enough: you should never go to bed
Management made the appropriate conclusions and decided to prevent the repetition of the entire class of problems in the future. This required an audit of the entire infrastructure for finding bottlenecks. At that moment we joined the project and began the analysis.
The methods of the Uptime Institute were used - an organization that collects a knowledge base on data center failures all over the world and forms accurate recommendations on how and what should work. By the way, more information about how to build a data center for such stringent requirements can be found in the topic about our
data center increased responsibility . One of the interesting points is that the city’s power input is always one source (even if the channels are from 5 power plants) and is considered additional as it is not fully controlled by the owner of the data center. In the sense that it can be (this is good) and it can not be (this is almost normal), plus it can turn off without any warning - and this situation is quite frequent. Therefore, when analyzing the infrastructure of the data center, the Uptime Institute experts always consider city input like this: there is and is, they don’t go into details and they don’t look at the city scheme.
On the other hand, the autonomous part is very important by the standards of the Uptime Institute. The quality of diesel engines, fuel, maintenance, and so on are carefully studied. Explain that in Russia it is customary to do the opposite, probably not necessary. They tell everything in detail: we have this beam from there, and this one from here, this substation is very good ... If anyone remembers the case of Chaginskaya when all the beams are turned on at once, it will be clear why the Uptime Institute treats such lines like this. According to the standard, you need to maintain in good condition the infrastructure that you, as the owner of the data center, have, and correctly design it.
Survey
The data center was designed and built on a fairly modern level, with the use of modern solutions, but as you know, the devil is in the details. The first thing our core engineers said was that a very confusing and illogical power distribution scheme was implemented. It is based on programmable controlled controllers, with a manufacturer with a closed architecture and one “head”: if it breaks, there is no reservation. And here we again come to what our colleagues at the Uptime Institute are talking about: the main principle is expressed essentially as one sentence for Tier III (according to our feelings, the most demanded):
one element can be decommissioned, and the system should maintain its full performance.In the elements of the project documentation, there is no mention of compliance with the standards, except for construction and PUE: this is a characteristic symptom that one must dig deeper and audit all trifles. Another similar symptom is when standards are specified that do not require external certification with respect to the designer body. Here it often happens that one thing is stated on paper, and quite another on an object. Fortunately, in our case, the project documentation exactly corresponded to the actual implementation.
Nutrition
Once again: any element can be decommissioned at any time - and the data center should continue to work as if nothing had happened. On our particular surveyed object, almost all the elements were duplicated, except, for example, the controller of the automation system. It manages numerous systems of automatic input of a reserve, commutes power distribution lines from a specific load (server racks) to input switchboards. It looks as if the lines are reserved, the shields are partially reserved, the UPS and the diesel generator set. But only one element, which costs very little compared to the total cost of infrastructure, is not reserved, and may be an additional cause of failure and accidents. And while doing something quickly will be unrealistic: you have to disable the data center to replace one piece of metal.
Enlarged graph of measurements of current values ​​at the input to the shield. The phases are uniformly loaded, starting currents periodically occur.
Increased schedule-2 reduction of current at the input to the shield. After current failure, the load stays at the same level for 5-6 seconds, then stepwise begins to increase to the previous level.The next moment is the operation of the fuel supply and synchronization system of diesel generators. The fuel system is considered as critical for the functioning of the data center and requires its redundancy. Well, the absence of separators, this is already a question of underestimating by the Western designers of our realities. Synchronization of diesel generators is another important point that the Uptime Institute is testing with passion. If you run all of them, then one further appoints himself the leader, sits on a common bus, and all the other cars take turns trying to synchronize. Roughly speaking, sine waves of alternating voltage must coincide so that there is no short circuit. As soon as one of the machines is synchronized, it turns on the contactor and turns it on to the common bus. So all the diesel engines in turn are recruited on a common line. In order for all this to pass normally, modern diesel generators provide for communication between themselves, that is, they are combined into a common information network. Another plus of this method is that machines can balance the load.
The Uptime Institute proposes to proceed from the following postulate: Imagine that there is no external network, you work on diesel generators all year round. Here it is time, for example, to replace the control panel of one of the diesels. You turned it off, the rest is enough to provide data center. And then you must this bus, which unites the machine, remove and remove the control panel from it. This test, none of the standard diesel generator, which we sell, will not pass. Because they have a tire, in fact, linear. They are connected to each other: open in the middle, and the right side does not know what the left is doing. For our
data center, increased responsibility "Compressor" specifically under this requirement made other control systems, where the tire is looped. And we can break it anywhere, the data is not lost, and the machines will continue to communicate with each other. When designing data centers, few people know about it, and, accordingly, there is another bottleneck. It seems petty, extremely melodic, but the chain of such probabilities creates the most terrible situations that it is highly desirable to avoid.
Further tests
We saw a non-optimal arrangement of racks inside the data center. It would seem that a banal thing, a lot is said about it, everything is heard, but nevertheless, undoubtedly competent and experienced specialists who are engaged in the operation of the data center on the customer’s side, made several mistakes in terms of equipment placement.
Our air conditioning system specialists offered an uncomplicated solution that allowed (more precisely, after implementation) would significantly improve the efficiency of the air conditioning system. Moreover, the customer is considering the possibility of increasing capacity - oh, how useful it is with increasing load. Plus, we painted recommendations on what to do to increase capacity. Additionally, we evaluated the implementation of the SCS and the complex of security systems, there were no comments here, everything was done correctly.
The main observations were on the most important systems. For example, we took thermograms, searched for local overheating zones. For example, discovered the injection zone:

The effect of ejection through the latticeThere was a local overheating zone: there is a stand, not enough of the jet range. Downstairs cools, but upstairs is not enough. And if the rack is not completely filled with servers, plugs are put in order to separate hot and cold corridors. Otherwise, there are overflows. By and large, you just need to raise the fan fan coil speed: before that, the lower server was cooled more than needed, and the upper one was warm. The second example - there is a wall of blocks, on the thermogram seams are visible: they have a thermal conductivity higher than that of foam concrete. Naturally, this does not promise any problems, just a good demonstration of the capabilities of modern thermal imagers.
Rack area
Hot Stay Corridor Zone
Zone “cold” corridor racksAt the same time, they looked at the quality of the data center power supply: they checked what the quality of the power supply at the input was, used special glands for this, everything is fine there. Then we tested several parameters in the SCS, the air flow rate, the correspondence of the installed power of the fan coil units and the real one — everything corresponds, everything works fine.
Summary
It all started with an accident, and all this resulted in an analysis, which resulted in revealing more shortcomings and recommendations were made for an optimal increase in power.
In Russia, instead of Tier III, according to the methodology of the Uptime Institute, Tier III is often stated on TIA 942. TIA, in fact, describes the requirements that must be met so that your data center looks like a Tier-third, but whether you want to do so or not. This is an advisory standard, and the assessment procedure is declarative: they built it and said that it was Tier III. And considering that the Tier III level is in both the Uptime Institute and the TIA, customers are often misled. To check whether the site has received a certificate, just go to the website Uptime Institute.
We have carried out data center audits many times (and helped build them), so we know that even if the customer and the contractor are serious, it is still better and safer to attract another part for the audit. Because no matter how good suppliers, designers and installers are, they may be mistaken. When working on Tier III, for example, both the project and its implementation on the spot are checked by the guys from the Uptime Institute, who for years look through all they do is go through the data centers and look for bugs in them. It is difficult to get certification from them, but with careful approach this quest is completed, and the data center really works with an access level of four nines.