Here is another story about fault tolerance. All external channels of one large site were concentrated in one building. A full reservation was provided. Two internal routers, two external, two VPNs, duplicated power supply from a diesel generator set. Each DS3 channel was terminated on its external router, used separate media converters, and left the building from different sides.
Optics bypassed the building and eventually converged at the provider's site. There she was stuck in two media converters that turned her back to DS3. Both media converters lay on a shelf, plugged into a regular home power filter, which in turn was plugged into the same outlet.
Another story with a generator. One data center de-energized inputs, and diesel engines started. But someone left a pile of wooden planks on one of them, and in an hour the diesel engine ignited.
More about diesel engines, it was 12-13 years ago. I worked in a large British provider (not BT), and one hot day (yes, this happens with us) I did some work in one of our large data centers. I arrived early and found the delivery of a huge container. When I asked what it was, they told me that it was a generator that would give a little extra energy - the cooling systems worked to the limit, and there was not enough power. I thought “cool” and started my work.
Late in the morning, the fire alarm was triggered, and the entire data center de-energized, only emergency lighting remained to work. I went outside and realized what was happening: a new generator was installed close to the air intakes of the central ventilation system, and when the diesel engine started up, he spat out a huge cloud of smoke that sucked into the ventilation, to which the smoke detectors inside the building responded
When I arrived there the next day, a huge pipe was installed on the exhaust of a diesel engine, which diverted the smoke far away from the data center building.
That site had a lot of resiliency, but in the end nothing helped ...
Everything was reserved at the EDU. But one thing we could not fix. The machine room was located directly under the toilet of the art department. In general, once we were naturally flooded by shit. Did you know that it is possible to order from Sun Microsystems the departure of their fighters to wipe the equipment of the storage system with cotton swabs with alcohol?
One of the universities of my city has its main data center located in the basement of one of their central buildings. They had just completed the construction of a neighboring building, and they needed to check the water supply system. For the period of testing, they opened the drain outside the case, but forgot to close it for the night. In the end, all the water flowed to the basement entrance. The entire basement was flooded with water for 30 centimeters.
One day, a client was going to move part of the server hardware to another building — to free up space and add a bit of fault tolerance. The connection was started on two OC-3 from one OPM, but on two independent routes. We had worked out the relocation plan in detail, provided for every little thing, and when it was time to - I drowned the ports, turned off the equipment and started moving it to another building. The engineer was ready to pull out the optics, the provider was given a green light to cut the now unused channel. Unless ... Someone somewhere once upon a time, when the scheme was just being commissioned, mixed up the channel identifiers in some places. So half of our data center was in the process of being transported from place to place, and in the second half the only external communication channel was cut. Not very nice.
A few years ago (in the region of 2005–2007), one of the major highways connecting Queensland with the rest of Australia (yes at that time and with the whole world) broke down. If I remember the sequence of events correctly, it was like that.
The highway went in two different ways, one along the coast, the other through the mainland. At about 3 in the morning, the line map that terminates the coastal optics began to make mistakes, and then fell completely. Not a problem — all traffic has rerouted across the mainland. The engineers were told to arrive at the site at 9 in the morning to replace the board ... But at about 6 in the morning the excavator cut the optics going across the continent.
10 years ago (I hope, since then, the developers of iron have become smarter), I lost an array of RAID5. On ten discs. It all started with the fact that the disc number "3" flew. The engineer goes to the array, takes out the disk three - and the array falls. It turned out that the control interface numbered the disks from 0 to 9, and the marking on the front panel ranged from 1 to 10, so the engineer took out a working disk.
Large logistics center with all types of redundancy, UPS (batteries and diesel) and everything else. Once, the entire quarter is turned off. It does not matter - the batteries intercept the load, the diesels start up, the office continues to work.
Power is restored. The quarter is lit with all the lights, the office is de-energized.
As it should be according to the laws of Murphy, the diesel engine was properly muffled, but only the relay switching the power from the diesel to the city inputs did not work ...
The data center in which I used to work had city power supply and UPS connections. UPS taken with the expectation of 6 hours. In the event of a crash, it was intended to migrate virtual servers to another site. Not the best solution, but it seems to come down.
Once our data center really lost external power, and we learned that air conditioners were not connected to the UPS. Mashzaly instantly overheated, and after half an hour, all systems began to shut down.
In one of the countries of the third world there was another case. When the building was de-energized, the diesel engines did not start. It turned out that they had merged diesel fuel.
We are customers of a large data center, everything is backed up - powered by batteries, diesel engines, duplicated optics with different tracks - an idyll. Machine halls expanded, and the brave guys broke down a couple of walls, pre-separating the equipment from dust. Then these two clowns came up with a brilliant idea to wash the floor in the hall. Unfortunately, they chose a bucket of water and a rag like the good old days. Of course, one of them inadvertently turned the bucket over.
Source: https://habr.com/ru/post/159997/