The summer of 2017 was rich in hurricanes. But we also had a case. Exactly 7 years ago, our first data center on Borovoy survived a hurricane that buried chillers under a layer of 10 tons of iron, flown in from a nearby roof. Sensory photographs of worn chillers have been posted online a long time ago, and the story of restoring the data center without cold has never been published. I decided to raise the archives and fill the gap.
In 2010, DataLine was a novice data center operator. At the OST site, only three halls with 360 posts were launched, in the north of Moscow (NORD) there was one building with one hall with 147 stands.
This is how the scale of our infrastructure has changed since 2010.
Although we ourselves designed and built, then we did not have a dedicated maintenance service. There was, as now, some specialists in diesel power plants, air conditioners, electricians - they gave everything to the maximum to the contractors. Infrastructure volumes were small, as was our experience. The production director, the chief power engineer and I, the technical director, were then responsible for the engineer. There were also engineers on duty in the pickup (three in a shift), but they were engaged in client requests and monitoring.
One of the first halls of the OST-1 data center at the end of 2009.
This is how new halls look in OST today.
Three rooms on Borovaya were filled only half. There were few customers, one hand would be enough for recounting.
The site operated an ethylene glycol chiller circuit with three Emicon chillers in a 2 + 1 redundancy scheme. I must say, these chillers did not reach the capacity declared by the manufacturer, but since the load was small, one chiller was almost enough for all three rooms.
July 20 was the heat for thirty. In this weather, the chillers were bad, so when it began to rain towards the end of the working day, I was delighted, hoping that the chillers would feel better. A strong wind rose with the rain, and now I see metal sheets flying by from the window of my office. He went out into the street, and on the other side of the road there were pieces of roof. Surprisingly, none of the cars of employees, parked near the data center, did not suffer much.
Roof iron hung from wires like linen.
Then I thought that it would be necessary to check the chillers, because the metal flew from their side. Together with colleagues we climbed onto the roof and saw a terrible picture: all three chillers were inundated with iron beams and sheets.
By recording the video surveillance system can be seen that all the iron flew in one powerful gust of wind. Here is what we later saw from the recording of one of the cameras:
In the video clock is lagging behind. When everything happened, it was already 18.18.
The scale of the disaster turned out to be impressive. At one chiller, the flown iron struck a free-cooling heat exchanger (external circuit of the chiller), the second one had damaged fans, and the third, besides all the above, rotating twisted fans had time to touch and chop the freon tubes inside the chiller. By the time we were on the roof, two of the three chillers had already stopped.
Damaged frame and heat exchanger of the first chiller free cooling. The heat exchanger is a “sandwich”: on the outside is a free cooling jacket, inside, with a gap of five centimeters, the same kind of freon condenser heat exchanger.
Twisted fans of one of the chillers.
From the punched heat exchangers frikulinga lashed glycol. The pressure in the cold supply system has plummeted. The pumps stopped for protection against dry running, the last working chiller passed out, and the entire cold supply system rose (it was 18:32, two minutes after the working day ended). For a few seconds we were in a stupor and did not know what to do. Then they called the cold supply contractor and called an emergency brigade. On the phone, the contractor advised to block the external circuit, explained where the necessary valves and faucets of the makeup system are located. We shut off the valves supplying external heat exchangers, the glycol ceased to flow.
It was getting hot in the cold halls of the engine rooms. Realizing that we will not be able to quickly restore the cold supply, at 19:10 we started calling our customers not just with an alert about the accident, but with a request to turn off the computing equipment in order to avoid its failure. We have not seen another option. Some customers refused to shut down and took risks. Some brought portable air conditioners for their racks to the site.
At 18:51, they began to refuel the glycol circuits with tap water and gradually brought the pressure in the system to a worker.
At 19.45 an emergency crew arrived.
At 19.53 the pumps started up, but only one of three chillers started working. The other had damaged fans, and the third one also had a freon circuit.
While we were doing all these exercises, the temperature of the glycol managed to rise from working values ​​(7–12 ° C) to 20 degrees. One live chiller was working with overload, and periodically one of its two circuits stopped by mistake. After that, it was necessary to manually reset the error on the remote, and after five minutes (guard interval) the compressor started. Or did not start. Then a complete blackout of the chiller with a reboot helped.
All those who were in the office at that moment participated in the release of the chillers from the “flying” scrap metal and helped the emergency team to pick up another worker from the two killed chillers.
The capital construction director then tore his back, throwing off the steel beams from the chillers.
Fans were removed from the chiller with punched freon tubes. Not without powerlifting - each fan weighs under 30 kg. By 23:00, at the very least, the second chiller was assembled and started up, and the temperature in the halls began to drop slowly.
By that time it was dark, but the most interesting was just beginning. Chillers began to bounce on protection due to overheating of compressors: the glycol temperature was still high despite the shutdown of most customers.
The head of production went and bought Karcher, hoses and headlamps so that you can work at night. We poured cold water on the chiller compressors, but this didn’t help much, as the compressor is a piece of iron weighing more than a ton and can not be cooled quickly. Now, when the chiller stopped by mistake, instead of five minutes, it was necessary to wait several tens of minutes until the compressor cooled down and the Compressor Overload error disappeared.
The error message that we were shown in turn one or another chiller.
In the deep of the night, something happened that we were afraid of: the accident simultaneously stopped both chillers and could not bring them together anymore. Of the four units of the first and second chillers, one or two worked, the rest in turn remained in a comatose state on the occasion of an overload. The temperature in the halls stopped at about 30 degrees. All the doors to the engine rooms were open. This allowed to somehow get rid of the accumulated heat.
Together with the contractors we went to study the chiller circuits. After long and serious deliberation, they suggested under our responsibility to do what is impossible to do: bypass the protection by putting jumpers, i.e. short circuit the thermal protection relay. It was a direct way to finally kill the compressors, but there were no other options. At three o'clock in the morning the chillers started and did not stop anymore. The temperature in the cold corridors began to come in line with the SLA.
Temperature change in cold corridors from the beginning of the accident to its elimination.
1 - time of the first stop of all chillers; 2 - start time of the first chiller; 3 - start time of the second chiller; 4 - restart chillers; 5 - start chillers with thermal protection disabled.
From the beginning of this disgrace, for the first time we had the opportunity to take a breath and, in a slightly calmer mode, think about what to do next. According to forecasts, tomorrow they promised a hot day again, and we have two chillers working on parole.
The morning of the next day caught us assembling a home-made irrigation system: water pipes were put on the roof and holes were punctured in the garden hose.
The hydrometeorological center did not deceive: the inferno again was below 30 ° C. From this system collected on the knee and Karcher, we watered the chillers practically without stopping, which continued to work with the thermal protection turned off.
But the historical frame: the chillers are rescued by the network engineer on duty, Grigory Atrepyev, now the head of the integrated projects department.
The temperature of the glycol was able to return to normal. In total, they worked for three days in this mode, after which they restored the thermal protection of the compressors. Within a couple of days the punched freon tubes of the third chiller were sealed, evacuated and filled with freon. While waiting for the supply of fans to replace the broken, only half of the third chiller worked.
Replacing the fans on the third chiller. Chiller Emicon RAH1252F with the option of free cooling (free cooling) consists of two modules, each of which costs 8 axial fans and a Bitzer compressor.
Freon refilling.
View of the backyard the next day. Scrap metal was taken out for a long time.
Chillers The damage was serious, and we spent some more time on repairs. After the bullying, the compressors served for about a year, after which they began to fail: for two chillers, work without protection affected, in the third chiller we seemed to be in a hurry to refill the freon circuit (we didn’t evacuate well enough, leaving traces of moisture). Oil samples taken from the still-living freon circuits showed a high level of acidity, heralding the imminent end of the motor winding. During the second year after the accident, we replaced almost all of the compressors of the affected machines. We tried to repair one of the compressors, giving it to rewind, but after repairing it, it stretched for a few months and burned again, so we considered it good to buy new ones in the future.
The water with which we refueled the glycol circuit did not affect the frost resistance of the system. Measurements showed that the concentration of ethylene glycol remained at a sufficient level.
Since the chillers did not deliver the declared cooling capacity (and the IT load grew as the data center was filled), it was necessary to continue to water them in the heat. The heat exchangers did not survive the water procedures: over the years, they were overgrown with lime deposits, and any dirt that the design did not allow was crammed into the gap between the free-cooling heat exchanger and the freon condenser. After a few years, we planned to replace two of the three chillers (this will also be a fascinating story, this time without sacrifices), and the remaining heat exchangers of free cooling were cut off. Now on the OST site there are 4 chillers: two Stulz, Hiref (added when the data center grew) and one old Emicon.
Chillers at the OST site in 2017.
Customers Despite this nightmare of exploiter, clients reacted with our misfortune with understanding and even no one moved away from us.
It was remembered that in order to obtain insurance for chillers and for a report to the affected clients, a local hurricane certificate was obtained from the Hydrometeorological Center for a long time.
It is difficult to prepare for such force majeure in advance, but it is important to draw the right conclusions from any accident. Our sweat and blood were:
In Moscow, too, there are hurricanes. It is now that every day, the storm warning, but then it was a novelty. After that accident, when choosing a site or a finished building for a data center, we especially carefully consider whether there are dangerous conditional sheds and other flimsy buildings in dangerous proximity. Of course, the roof, which arrived on our chillers, was blocked by the neighbors under our strict control.
We began to buy spare parts (fans, compressors, freon stock, etc.) and keep it in our possession. The restoration would have been faster if we had at least spare fans on the site. At that time, the delivery of the required quantity had to wait for several weeks.
Willy-nilly figured out the device chillers, they are no longer for us "black boxes". This was useful to us later, because the wonderful refrigerators did not stop breaking.
Conducted on the roof of the water. For new data centers we do this by default. Water is useful for flushing chillers or external blocks from dirt accumulated during the autumn and winter, will help fight off poplar fluff in summer and make life easier for the cold supply system in conditions of abnormal heat.
They blew monitoring and began to measure everything that is possible: pressure at several points, the condition of the pumps, the temperature of the forward and reverse glycol, the power consumption of chillers, etc. In this situation, alerts would help us detect the problem earlier and begin to act faster.
We set up remote control of chillers from the monitoring center.
Synchronized clocks on all systems in order to have a clear picture of developments in the analysis of accidents.
We also took great care to work through all the key processes, we documented and accompanied with charts everything we could reach, and introduced regular military exercises. And if some Armageddon happens tomorrow, our data centers will not be saved by 3.5 people in the improvisation genre, but by a large and experienced operation service with clear, tested instructions. This allows us not only to manage an ever-growing network of seven data centers, but also to successfully pass audits and certifications of the most respected and rigorous organizations like Uptime institute.
And what kind of natural disasters did your server / data center have to go through and what useful conclusions did you draw for yourself?
Source: https://habr.com/ru/post/333578/
All Articles