Accidents on server farms

Continuing the theme of accidents on server farms. The reasons for the offline maintenance of powerful data center infrastructures are quite different: interruptions in power supply, problems in cooling systems, operation of a backup diesel generator, equipment, inadequate maintenance of the same equipment, etc. Do not forget about the human factor.

As they say, they learn from mistakes, and it is good if not their own. Server farm operators can learn useful lessons on how to prepare for a potential accident, eliminate its consequences, and avoid any blunders altogether, which entail considerable losses.

Cogeco peer1

The Cogeco Peer1 data center in Atlanta went offline due to problems in the backup power system.
')

The company Cogeco Peer1 (Atlanta, USA), which provides services in the field of managed hosting, after leaving its server farm offline, became the center of discussions and criticism in social networks. Many customers of this service provider expressed their "fe" in relation to the company, many threatened to change the provider and transfer all their workloads to AWS. AWS was happy to join this kind of statement and tried to lure the disgruntled customers of Cogeco Peer1.

The server farm went into downtime due to a partial power cut. It took almost five hours to fix the problem, it all started at half past one, the data center started working at full capacity only by seven in the evening. Due to a power outage that occurred, the infrastructure in certain areas of the server farm was completely disabled. As Cogeco Peer1 claimed, the downtime was caused by a failure in the data center backup system.

TeliaSonera and the human factor

TeliaSonera provides telecommunications and network access services. Recently, due to an error of a server farm engineer when configuring a router in a data center, many users of such well-known Internet services, websites and applications like WhatsApp, Reddit, CloudFlare and AWS have suffered losses. Most of the traffic, instead of going to Europe, was sent to Hong Kong. Millions of users have experienced this error on themselves when connecting to the Internet and working with popular applications. At first, experts assumed that this problem was caused by damage to the transatlantic backbone telecommunications cable. It took two hours to fix the problems on the TeliaSonera server farm. A letter of apology was sent to customers, and a record appeared in the company's blog that the company plans to make every effort to automate its systems. Such a solution will minimize the occurrence of downtime due to human factors.

Many companies are often silent about the reasons that lead to failures and downtime in the server farms. The owners of data centers are very reluctant to share information about accidents at their sites. The site of one of the largest American credit companies Lending Club has gone offline. The company during its work (since 2006) issued loans in the amount of $ 18 billion. It is not surprising that this simple concern has greatly disturbed the investors of the company. Failure of the work was observed last week, the cause was identified problems in the data center (not specified). For several hours the data center was in downtime.

By the way, according to Emerson, the most common reason for the failure of data centers is the failure of UPS batteries. This study involved 450 server farm operators. The second problem is the overload of the UPS, more - errors in the installation of electrical connections, malfunctions in the ATS and short circuits. Half of the problems are connected with the same human factor. One-third of data center malfunctions occur “due to” cooling systems, in 35% of cases due to water leakage.

If we talk about our market (Ukrainian), then the owners are very reluctant to share information about the failures that occurred and the reasons for leaving the infrastructures of their server farms. And everything begins, no matter how trite, with the designs for the placement of the data center. Older buildings, worn out building structures, camouflaged cracks in the floors, a load-bearing wall with an embossed opening half a meter per meter ... Poplar fluff that in summer clogs the heat exchangers of the external blocks, and in winter the same blocks often stop due to freezing or seizure of the fans from - for the icicles that fell into them from the roof. Saving on the ventilation system, namely the installation of a heater in it, leads to the fact that in the winter condensate flows from there. UPS failures also occur due to the non-core load being connected to the electrical section of the server farm. A powerful air conditioner in the director’s office, an electric kettle with secretary Glasha, etc. Here is just a short list of reasons for off-line server farms.

Source: https://habr.com/ru/post/305304/

All Articles

Accidents on server farms

Cogeco peer1

TeliaSonera and the human factor

More articles: