The most reliable uninterruptible power supply (UPS) is a cabinet, with one cable running through. This joke sometimes makes sense. Especially in Russia, where the human factor often comes to the fore, and anyone, an engineer, and a “grandmother-caretaker” behind the data center can make a mistake. Therefore, the number of nines in the availability ratio does not always determine the real reliability of the facility.

It is clear that an uninterruptible power supply system (SBE) should ensure maximum reliability of power supply for the operation of facilities, applications and services. However, occasionally there are situations when the SBE itself becomes the cause of the failure — an additional point of failure. This is a problem that many of you may face. The winged Latin phrase “Praemonitus praemunitus” (“forewarned is forearmed”) reflects the purpose of my article. The material is based on real events and personal experience from different places of work. Now I work in the system integrator R-Style, where I actually apply my knowledge and successfully avoid the situations described below. I hope that the article will help you to “arm yourself”. For political ethical reasons, details about the objects will not be disclosed. All matches are random. Accidents and their types of paint for the items on the technical side.
Reference: The cause of the most popular type of data center failures is power failure (46% of all cases, according to the US IT Industry Information Association data).')
Accidents after one actionAccidents due to frequent errors during operation, from which it is impossible to protect oneself from the design stage of SBE.
Such situations are associated with the human factor. There are two main types of people who can bring down the data center:
- Fully competent engineers and electricians, but without the practice of working at important facilities and the necessary knowledge. They can do everything according to the instructions of more competent colleagues on the phone and "trembling" hands.
- Experienced electricians, but with a pronounced rationalization grain. This is from the series: “one rationalizer is worse than two saboteurs”.
An example of the error of the first engineer:A rather frequent procedure of transferring the SBE to the “bypass” may result in de-energizing the responsible load. This is due to the difference in the input voltage range of the UPS, for the main line on average ± 30% of the rated voltage, and for the “bypass” ± 10% (this is GOST). It turns out that with a voltage drop, the UPS can still work from the city network without switching to batteries, while the bypass line is no longer available. The engineer sees that the UPS has power at the input, but does not attach importance to the message about the blocked bypass (most often the warning about this situation is not threatening, for example, when switching to batteries) and stops the inverter with a button or an output switch, as a result load de-energized. To avoid such a situation, it is necessary to analyze the situation before each action, sometimes it is enough to read the warnings on the UPS display and understand them.
An example of a second engineer error:Almost all UPSs have two types of battery tests: autonomy test to full discharge (1.65 V average per cell) and time test, for example, minute test, to evaluate the discharge curve, which can be used to judge the state of the battery. The engineer at the facility, considering that the autonomy test is more correct, periodically made it, logging all the readings. But once a failure of the external power supply occurred immediately after such a test, the batteries were discharged, there was no backup, the diesel generator set did not start up, the load collapsed. In most cases, automatic tests are sufficient to assess AB, and the autonomy test is necessary, for example, when commissioning on the ballast load to sign conformity acts. And if all the same test is necessary, for example, to understand autonomy after replacing AB, then you need to do this in a special window for maintenance and insure, immediately running the diesel generator set.
More cases briefly:- Wipe UPS control panel with inadvertently pressing the red EPO button (Emergency Power Off)
- Drilling overlapping diamond drill with water supply just above the UPS, followed by a short circuit after penetration.

- Interfacial closure on the UPS tires during excavation work with a cable break with an excavator bucket when placing the switchboard equipment and the UPS in different buildings.
- Turn on the UPS when the wrong sequence of actions. On some UPSs, when starting up, a huge charge current of capacitors of direct current flow from the battery switch at the initial moment, battery fuses are lit (true only for certain models, there are UPSs that must start up strictly with the batteries turned on).
- Cleaning of dust with a vacuum cleaner during service with the separation of elements on the board. Cleaning is required only by blowing air from a safe distance and an acceptable pressure.
- For switching the loaded lines, the UPS switches were constantly used, which led to constant sparking at the moment of switching and, as a result, to the burning of contacts in the switch. For switching, it is necessary to use UPS breaker circuit breakers in the switchboards, not UPS switches, which are not designed by the manufacturer for permanent switching under load.
Accidents after the chain of actionAccidents caused by errors that were made at the design and (or) implementation stage (most often can go unnoticed), and the accident itself occurs after the “control shot” during operation.
Examples of accidents laid down in the design:- A maintenance bypass switch without a “dry” contact was designed, at a signal from which the UPS automatically stops the inverter. In this case, if the engineer erroneously transfers the SBE to the bypass line, the UPS inverter begins to fight the Territorial Generating Company. It is clear who will win. In the best case, the output fuses of the UPS are burned, at worst, the IGBT transistors of the inverter burn. Data Center Stopped - Data Lost.
- A differential circuit breaker was designed only in the main switchboard (MSB) (there was no differential protection downstream), supplying, in addition to the system with particularly important loads, SBE, still important loads, allowing short-term interruption (power from the diesel generator set). In the UPS of many vendors, the main switches do not tear the neutral conductor, however, for repair work with the complete withdrawal of the UPS from work, this opportunity must be present. Most often in the UPS there is a separate switch for these purposes. Thus, when repairing an UPS neutral circuit not cut off, the neutral conductor hit the grounded UPS case, since there was no differential protection in the circuit between the UPS and the main switchboard, the main input automaton worked, de-energizing the entire object.
- When designing a power supply system with an SBE with great autonomy, the charging current and the efficiency of the UPS were not taken into account, and the transformer capacity of the substation was almost equal to the load. As a result, when the output reaches the rated load capacity, the feeder overload occurred.
- When designing, it is not taken into account that the first time when the system is entered, the active data center load is 20% of the nominal. At the same time, the SBE and the air-conditioning system connected to the network of the guaranteed power supply of the SGE (in case of an accident are operated from a diesel generator set) were started up completely. When an external power supply fails, the reactive currents from the UPS without a pre-charge chain of its internal capacitors and from the air-conditioning systems begin to “deceive” the SGE, since the voltage of most of the diesel generator sets is controlled by current. A decrease or increase in output voltage begins, depending on the nature of the reactive load (capacitive or inductive), the voltage goes beyond the acceptable UPS input, switches to batteries, the reactive component from the UPS ceases to act on the diesel generator set, the voltage approaches the nominal, the UPS returns powered by DGU, and then everything is in a circle. There is a so-called buildup of the system, followed by a power failure after the batteries are completely discharged or the diesel generator set output is blocked. The situation is solved by choosing a UPS with the ability to automatically turn off and turn on non-operating modules using a ballast load or using a reactive current compensator.
Examples of accidents inherent in the delivery and installation:A short circuit with an arc on the tires of the UPS, caused by a passing piece of foil that remained after the installation work of air-conditioning systems or broke off from the back side of the raised tiles and flew under the raised floor.
Supply of UPS and batteries too in advance:- UPS were installed before the end of the painting work. Hired workers painted ceilings while standing on the UPS. The protective film on some devices was damaged due to treading and building mud from the shoes fell into the UPS through the grilles of the upper fans. During the NDP, it was not possible to completely clean out; later, during operation, there were several failures, most likely triggered by the contamination of the internal components.

- Commissioning SBE 8 months after delivery. Irreversible reactions in batteries due to long-term storage without charging. Lead plates are covered with a film of large crystallites of lead sulfate (sulfation), which prevents the flow of current-forming processes. Immediately after the start of SBE, a battery failure occurred (did not pass the test).
Dry residueLet this be just a small part of possible situations, but from experience the described cases tend to be repeated several times. I hope that this article will help interested parties to avoid accidents that can occur at facilities both with and without the use of a UPS. The cost of the error can be very high. For example, a data center as a result of accidents may be idle for several hours, and there are known cases of such “catastrophes” in which an object completely fails for two days. An emergency shutdown of the entire data center for 8 hours for a large company can equal almost quarterly profit, and this can often be avoided by caution, vigilance and attention to detail.
UPD: About grandmothers the question is very interesting, and it also applies to simple and harsh guards. There are still server rooms where there are no engineers on duty. The funny thing is that expensive monitoring stations, integrated adapters and sensors in SNMP engineering equipment, become unclaimed, for example, at night. There are, of course, options with a close-living specialist and a system for sending emergency messages to a mailbox or telephone, but this, to put it mildly, is not always and not always. Some customers are still asking for solutions from suppliers on “dry” contacts, and the whole DCIM comes down to a block with light bulbs, opposite which phone numbers are written that the grandmother or security guard should call. By the way, from experience, the grandmother betrays more nines than the guard (I mean the accessibility factor).