Reliability and durability of server hardware

I decided to write this article after becoming acquainted with the publication “HP, Dell and IBM: components responsible for server reliability” , because I have a different opinion about some points. This article does not pretend to innovative approaches, but simply describes the experience gained and, I hope, will prevent trivial errors.

So, let's start by trying to figure out why uninterrupted and uninterrupted servers? Actually, servers are not required uninterrupted, but it is needed services that provide these servers. The best continuity is ensured only by distributed systems that can operate independently of each other with automatic switching between them (for speed) and geographically separated (disaster recovery). But this places special (not always realizable) software requirements. The disadvantages of such solutions are increased cost, problems with data replication, transfer of state for seamless switching to a backup system. Additional advantages are that if the system is properly implemented, performance can be improved - clients are divided between two or more locations, and redistributed if they fail.

But there are tasks that are so critical and specific that they require special server continuity; special servers, such as menframes, are made for them with the possibility of hot-swapping of all components, including processors, memory, and even motherboards. But such solutions are much more expensive than ordinary servers, and those who buy them - I understand why this is necessary.

Let's go back to the primary and secondary servers. The possibility of hot-swappable components significantly increases server continuity.
')

Hot-swappable power supplies

In my practice, the burned-up power supply units (power supply units) were few, but the presence in the server of a hot-swap power supply unit connected by the N + N scheme in many cases significantly increases the uninterrupted operation of the server. If the server has more than two power supply units, then the N + 1 scheme is often implemented, which does not allow to power the server from two independent sources or power lines. Power supplying two independent lines to a rack increases uninterrupted operation in a wide variety of situations, for example, when servicing or crashing power supply systems in a data center. There was a case, in the server the power supply failed and created a short circuit, which led to the activation of the PDU protection and its disconnection, the neighboring servers with the power supply according to the 1 + 1 scheme, also connected to another PDU, continued to work. Redundancy of a power supply unit allows changing the server's connection to the power supply network without interrupting its operation, for example, to optimize cable management (of course, it is necessary to correctly install the cable when installing the server, but we live in a not ideal world).

Contrary to the misconception, 80 Plus certification indicates the energy efficiency of the power supply, and does not oblige the manufacturer to ensure any level of reliability.

Also, redundancy power supply prevents most of the problems associated with power cables. Poor contact of poor-quality cables, accidental pulling them out by personnel during work. If you have a server with a single power supply, using a high-quality and unworn cable for it, which fits tightly into the socket, and under load does not make any extraneous sounds (crackling) more important - it is impossible to replace without stopping the server. In the case of a server with redundant power supplies, poor cable contact can lead to power supply failure.

Hot swap drives

Hot-swappable drives can be made with almost all interface options. Of course, there are some limitations.

IDE devices rarely transfer the disconnection / connection of the second device to the loop - there is a great risk that the working device will be lost from the system. The main problem of the IDE interface is the correct handling of this event by the operating system. Since the IDE interface does not provide hot-swapping, in most cases it is necessary to manually start a device scan to identify new hardware. The important point is that the interface is connected / disconnected to a de-energized disk (connection: first, interface, then power, disconnect: first power, then interface).

DISCLAIMER: by disconnecting / connecting IDE devices, you do this at your own peril and risk - no one guarantees the preservation of the equipment’s performance and the stability of the OS.

Interfaces FC, SAS, SATA (AHCI) - support hot swapping of disks in full, problems may be in the operating system. If the SATA disk controller is in IDE compatibility mode, then you may need to manually start the bus scan. In AHCI mode, in most cases the disk will be detected automatically. I recommend using AHCI, if your OS allows it, because this mode also improves disk performance; TRIM is supported only in this mode of the controller.

When disconnecting disks to prolong their service life, I recommend to pre-disable them programmatically and remove the spindle after it has stopped, i.e. after about 30 seconds after shutdown for 7200RPM drives. If the disk cannot be disconnected programmatically and it is installed in the hot-swap basket, I recommend pulling out the disk for the minimum distance at which the disk will be disconnected, wait for the spindle to stop and remove it completely. In most systems, this is the distance of the fully retracted basket handle. Of course, these actions do not bear practical sense if the disk fails, but perhaps it just “hung” and you will not change it under warranty and have to be used in non-critical equipment.

It is also important to understand that the disk is part of a RAID or as a separate block device. When using a separate disk, you must first unmount it to avoid malfunctions in the operating system and software. Even if the disk is not used at the current time, after removing a mounted disk, the lags of the entire OS are often observed. Of course, the disk on which the operating system is installed cannot be removed without a “freeze”.

Most servers allow you to highlight the indicator disk command from the server, if possible, use this function to minimize erroneous extraction of disks. For example, on SuperMicro servers, the basket number is indicated on the basket itself, and may not coincide with the slot number on the backplane. Many manufacturers have the same problem.
Also, before disconnecting, it is advisable to obtain information about the disk (model, volume, serial number) for comparison immediately after removing the disk. In many cases, if you mistakenly remove another disk, this will eliminate the error immediately, and sometimes even prevent a malfunction or loss of data.

In the case of RAID-arrays, I recommend disconnecting disks programmatically (mark as failed), before removing it, this will eliminate the performance degradation of the disk system immediately after turning off the disk.

I did not notice any problems with SSD drives with frequent hot plugging / ejecting, although I used several in this mode.

This concludes the first part, in the next sections, about RAID arrays, memory for servers, remote control systems and the importance of monitoring.

Source: https://habr.com/ru/post/271961/

All Articles

Reliability and durability of server hardware

Hot-swappable power supplies

Hot swap drives

More articles: