Accident history on the UPS

Continuing and complementing the topic of the issue “Failures and accidents in the data center”, we will share several observations without a claim to a serious cause and effect analysis. Perhaps some moments will seem funny and funny to the reader, although everything that happened was very serious. Hopefully, these rather instructive stories will allow the reader to draw conclusions.

Textbook script

Ironically, the accident, which will be discussed below, occurred only three months after the customer and the general designer received a warning letter from our company recommending the installation of an external service bypass to the uninterruptible power supply. But there was no reaction to the warning.
')
During the construction of the data center, our company supplied and installed the flagship Trinergy UPS system with a capacity of 1 MW. Although this UPS is equipped with a built-in service bypass, we still recommended that the head organization make an external common service bypass to the system - in case of an accident, so that in case of adverse events it would be possible to fully service this source without interrupting the power supply. But the general contractor’s specialists objected that the UPS was already equipped with a maintenance bypass that would allow the internal components of the system to be serviced in any situation. Nothing foreshadowed the bad, and there was no chance that the entire system would be completely disabled.

The ideology of the Uptime Institute's approach to ensuring resiliency in accordance with the requirements of Tier III implies the use of an external bypass, in order to be able to serve the internal bypass. But in this case, this principle was neglected. The parent organization refused to complete the system with an external bypass, either due to budget constraints or because of the desire to increase margins.

Meanwhile, the object was designed and built in an already existing building, which was adapted to the needs of the data center. And the object was reconstructed, as always, in a hurry. Waterproofing in the old building was done poorly, but did not redo it. Three months after the installation of the equipment, spring thawed water began to flood the UPS, and the flooding did not come from below, but from above, actually from under the ceiling. Water leaked quite a lot - in the UPS short circuit occurred, the source, "loudly banged" (according to eyewitnesses), burned out.

And only at that moment it became clear that it was simply impossible to repair it and make waterproofing without shutting down the entire data center: the source centrally fed the data center with its built-in bypass. As a result, despite the high level of redundancy (block system, N + 2 redundancy), after the failure of two power units, the power of the data center ceased to be uninterrupted, and everyone became hostage to this situation.

It should be noted that the UPS system itself has shown itself from the best side. The system has resisted, it has not "thrown" the load. Only those power blocks burned onto which the most water was spilled, and the remaining three power modules to which less water was spilled remained in working condition. But, since the source centrally held the entire data center on itself, that is, the entire power supply of the object passed through it, and there was no external maintenance bypass, to eliminate the damage to the power units, the UPS had to be turned off completely.

As a result, no matter how painful it was for the customer, we had to choose the time and stop the data center, after which the UPS was fully restored and the source was equipped with an external service bypass. For a company that owns a data center, its stop was very critical, painful.

In this case, the accident has several causes. The first is the rush during construction and poor waterproofing. The batteries were on archival racks with plywood shelves, that is, there was a complete eclecticism in the data center: a neighborhood of the most advanced equipment and "artifacts" of the end of the last century.

In terms of the Uptime Institute, a Tier II-designed system does not imply servicing any element without disconnecting the load, which is fine and demonstrated in this case. This accident refers to such incidents that cannot be eliminated without stopping the data center.

This is a textbook case when the customer is warned about possible risks, but he prefers to brush it off, and then there is a situation about which he was warned! At the same time, the level of costs for a maintenance bypass unit for a 1 MW source is incomparably small compared to losses from a data center shutdown.

As a result, for a long time (more than six months), while choosing the moment to stop the data center, all IT systems worked without any protection at all! Such is risk management. Well, it should be understood that, like any “drowned” machine, the uptime of the UPS system after such an accident has drastically decreased: its various components began to fail more often than one would expect from a system that did not survive such stress. .

DPC-term construction
This story could be heard from companies specializing in the maintenance of ventilation and air conditioning systems. However, from the lips of electricians, it seems incredible. But, unfortunately, this is true - an example of how during the construction of a data center equipment is disabled during the construction phase, without being put into operation.

The data center is being built on the outskirts of the city. The contractor, seeking to shorten the construction period, is forcing suppliers to deliver equipment to the site, although the site’s construction readiness is still very far away. At the same time, "upstairs" (to the customer) reports are being sent that the equipment is on site. But it was at this time that the most amazing things can happen with such equipment.

For example, at one of these facilities under construction, the delivery of the UPS was made obviously earlier. The source stood unclaimed and unconnected for several months, and the rodents did not fail to take advantage of it (judging by the traces of life), who built their nest there and began to live and live and be kind. In the same room, the workers took food, the remnants of which were not disdained by rodents. Animals divided their apartments into zones: there was a nest on one “floor”; on the other they took food; on the third — just in the place where the printed circuit boards were located — they arranged a toilet.

When the time came to connect the equipment, service workers, overcoming disgust, in respirators and rubber gloves came up to the formal yet new equipment. Of course, it was not possible to launch the IPB, since the printed circuit boards were destroyed by a caustic liquid and required replacement.

As a result, it turned out that since the equipment was delivered to the site, it no longer belongs to the supplier. And the general contractor has received this equipment and has not yet begun to operate, but it is no longer serviceable. What happened is the payment for the short-sighted and unreasonable customer request to bring all the equipment provided to the object at once, despite the fact that he does not need it yet. These six months, equipment would be much safer at the site.

Authors: Sergey Ermakov, Stanislav Ilyenko

More than 20 incidents occurred in the Russian data centers, read the new issue of TsODY.RF magazine No. 13 on the topic “Accidents in Data Centers ”.

Source: https://habr.com/ru/post/272057/

All Articles

Accident history on the UPS

More articles: