A day in the life of a new server: how we check and restore iron

image alt text

In this article I want to talk a little about the inner kitchen of the Server Mall and how the servers are tested and restored. I will try to visually show the difference between an ordinary "second-hand" server from Refurbished and talk about the process of turning a little "tired" iron into almost new.

To study the marvelous inner world of the pre-sale department of Server Mall, I talked to one of the engineers who shared professional ingenuity and experience. Andrei was working on the newly arrived IBM System X 3650 M4 , so the tour turned out with a practical bias.

Let's say you decide to sell a server

This can be done not only by the organization, but also by an individual by contacting the Server Mall (SM) for an assessment . On Habré was already a detailed article on the process of buying, so I will tell you about what happens next.

According to the information received during the telephone conversation, the specialists decide on the advisability of the purchase, usually positive. Nevertheless, the acquisition of some second-generation ProLiant is unlikely to be useful, so the initial assessment of the prospects for iron is quite important. If all is well, then a specially trained forwarder comes to the seller, conducts a visual inspection of the server, checks for obvious errors in the work and takes the hardware with him. Purchase of servers is carried out by the company across all Russia.

image alt text

Inspection allows you to roughly estimate the cost of server restoration: the same large chip on the case can easily be caused by a fall, with the subsequent occurrence of floating errors due to microcracks in the PCB textolite. Servers do not drop at all often, but very accurately. He himself witnessed the transportation of three DL380 cars in the trunk of a sedan, from which one of the glands was awkwardly removed. Visually, the server just fell off his ear and hemped a corner, but at the start we received errors in the cooling system and periodic reloads.

Even when viewed from the server include, look at the self-diagnosis indicators and console errors. If nothing critical, then the transaction is made and the car is passed on to the authorities.

The entire sequence of checks described below appeared not from scratch - under the spoiler there is some information about MTBF and their extraction.

Initially, our engineers asked about the time-to-failure statistics of the main nodes, so as not to struggle with what needs to be replaced. The main indicator of reliability is MTBF (Mean Time Between Failures), that is, the time during which there will be no failures. For each component, the figure is different, and official data on all components is not so easy to get.

But for reference, you can use the reports of some OEM manufacturers, whose hardware is used in any brand server. For example, with SSD intel 520 , MTBF is 1,200,000 hours. Of course, this does not mean that the disc will work for 136 years, since this characteristic is statistical and is displayed when testing a large batch. An indicator of AFR (Annual Failure Rate) derived from MTBF according to the formula AFR = 1-exp (-8760 / MTBF) is more convenient for understanding.

For our example, the probability of failure of the SSD drive in the first year will be approximately 0.007, that is, 0.7%. For a less accurate calculation, the formula 8760 / MTBF is used. Quite a few articles have already been written on the calculation of this indicator, so that curious people can refer to the published materials .

Server maintenance

All newly arrived servers go through a mandatory testing and cleaning cycle. In addition, serious physical defects, such as bent mounting ears, are restored.

Absolutely cosmetic things, like scratches on metal and scuffs, remain as it is. By the way, the metal of the server systems is coated at the factory with a special antistatic lacquer, which is not so easy to restore. The composition of the substance itself is not exactly known - almost as a seasoning in KFC - therefore we sacrifice aesthetics to protection against static.

image alt text

If the so-called “ears” are damaged, for which the server is conveniently extended from the rack, they are usually changed to new ones. In the case of HP plastic parts, they simply change to new ones, just like the power supply hinges. Assembly slides are simply re-ordered. With significant damage to the body itself (deep and complex dents, for example), it simply changes to a new assembly.

Damage to IBM metal mounts for the entire experience of Server Mall engineers has never been met. Apparently, the well-known "indestructibility" of the systems of this manufacturer manifests itself even in trifles.

By the way, the time between failures for the body is quite long.

For example, here is an example of one of the manufacturers MTBF data:

The indicator of the case itself is 5 000 000 hours;
Disk basket and IMPI-modules will work 700,000 hours;
LEDs are designed for 2 000 000 hours.

Once a request was made to sell a server that had been cooled by tobacco smoke for several years. He just stood in the server, the air intake in which was made from a nearby smoking room. The aromas of the products of burning tobacco-containing products were already felt on the way to the patient. The model was relevant, so we decided to take a chance. Have you ever washed an even layer of tobacco tar? And the engineers at the Mall Mall were laundering - one even quit smoking. True, the hardware was not sold for sale and was used for internal needs.

After the inspection, the engineer removes the case cover and starts the car to listen to the sound background of the fans, power supplies and disks. Some coolers do not give out any errors in the diagnostic system, but their sound leaves no faith in the future future of the bearings. Such coolers are simply changed to new ones. The MTBF for Intel cooling systems is only 100,000 hours, so replacing the fans with new ones is common.

image alt text

No less popular sound is the sound of the capacitors of the power system, which until recently glows green in monitoring. In relatively fresh servers , power supplies with solid capacitors are used, but models with electrolytic cells are still relevant and therefore require careful diagnostics.

Time between failures of modern power supplies can be 967 300 hours, according to data from the OEM-manufacturer Intel . In the case of whistles and suspicions of malfunctions, the entire BP changes to a new one, because any work on soldering is not economically feasible and fraught for the future buyer.

Light and digital diagnostics

Most modern servers are equipped with self-diagnostic systems. These can be LED indicators on the front panel, individual modules with a list and status of all components, just a pointer to the presence of any error. In any case, serious problems with the components are immediately visible.

A brief excursion into basic diagnostics on the example of IBM, HP and Dell solutions.

Option from IBM is called Light Path and is a sliding panel with indicators and explanations;

image alt text

Dell in most servers uses an LCD panel for basic configuration and error display with a brief description;

image alt text

There are also simplified indicators:

image alt text

HPE offers Systems Insight Display LED self-test, the panel of which is similar to IBM.

image alt text

After a quick glance at the indicators, a long program check begins using standard diagnostic tools:

All these programs are launched locally or using IMM, DRAC, iLO. If the diagnostics is not "sewn" into the server's control controller, then simply boot from the manufacturer's diagnostic disk. Full diagnostics take 2 to 3 hours and find most problems with memory, processor, diagnostic controller, fans, power supplies, and disk controllers. Hard drives are not involved in the process, since the sale is almost always put new.

Traditionally, the weakness of motherboards were electrolytic capacitors. They swelled, overheated, exploded and led to complete inoperability. At the maximum temperature, the MTTF of such elements was up to 8,000 hours, which is fraught with unscheduled repairs after a couple of years of operation. Therefore, in modern server systems solid-state capacitors are used, which will be enough for several server "lives". The total MTBF of the motherboard on the example of the Intel S1200V3RPM confirms this and amounts to 371,523 hours .

image alt text

After a thoughtful check, the server is completely disassembled to the "bare case and components on the table" state, after which all components are thoroughly cleaned and rinsed with alcohol. Alcohol does not harm the conductive paths, circuitry and lacquer of the motherboard, and therefore is widely used to give the boards a pristine appearance. In order to avoid overhead costs and as a measure to combat alcoholism, isopropyl alcohol is used.

image alt text

Careful attention is paid to the motherboard connectors. In particular, the engineer examines the processor socket on the subject of bent pins through a magnifying glass, because even one damaged leg can cause the most unpredictable consequences. PCI slots and main memory slots are not ignored, network port links are checked. As a "cherry on the cake," we change the BIOS battery, just in case.

image alt text

After bathing, the server is transferred to the warehouse, where bar codes for the internal warehouse base are read from all components. Then the iron waits on the shelf of its buyer, along with the test logs and warranty sheet, which contains the serial numbers of all components.

And then came the order for this server

It’s rare for a customer to choose the configuration "as is" and don’t want to add anything. Therefore, the ordered iron is equipped with new disks, processors, power supplies of a certain power, memory and necessary controllers. After that, the server is passed back to the testing engineers for a pre-sales check.

The tools use built-in diagnostic software of the server manufacturer and a couple of utilities from an external disk. Pre-sales inspection takes about ten hours and is conducted in stress mode:

Processors and memory work at their maximum capacity;
Power supplies give up all the power, even if there are several;
Under load, most defective hard drives are detected;
The whole element base of the server works as it is unlikely to work in everyday use.

image alt text

At this stage, by the way, "thin" defects of power supply units are detected. So they are not limited to just checking for a whistle in Server Mall. At the same stage, an unconditional replacement of a power supply unit with a new one is possible if the customer decided to purchase a server with one power supply unit, despite the prospects for using fail-safe options .

New hard drives are not tested only when the customer asks for their reasons to send them unpacked.

For a complete check of all network interfaces, the machine is loaded from an external disk in a specially prepared environment based on Windows 2012R2. The server connects to the local network and the engineer sequentially starts copying one large file and many small ones. If packet loss exceeds 1% - the network card must be diagnosed and replaced.

With Memtest, memory is additionally tested on all systems except IBM. The fact is that Memtest checks on IBM machines almost always find non-existent errors on one of the slots. Such is the technical feature.

image alt text

If any of the server components fail, all testing begins again, thus avoiding possible compatibility problems with the replaced components.

Once a curious problem with the RAID controller in the Dell server came up: all the tests were successful, but after rebooting the BIOS, the errors of the rather rare H710 controller began to show. Because of the search for an equivalent replacement, the server had to be delayed by one day, which was compensated by replacing it with a more modern H330 adapter with twice the bandwidth.

Total for each server takes about 16 hours:

2 - 3 hours initial testing;
3 hours for cleaning and bathing;
10 hours takes pre-sales testing.

Complete with all the iron tested, the buyer receives a flash drive with a testing log, server instructions, useful links and an offline version of an article about common errors of this particular manufacturer.

image alt text

Special mention deserves the preparation of the server to send. The packaging is designed independently and, according to reviews, surpasses the quality of the original. The server is sealed in a film with silica gel (moisture absorber), wrapped in foamed polyethylene, packed in durable cardboard and sent to the customer.

Instead of conclusion

Server Mall provides its own warranty for 3 years on the machines restored in the manner described above. Moreover, the standard set of services includes both the replacement of failed components within a couple of days, and the complete replacement of the entire server in case of critical malfunctions. You can learn more about warranty support and its differences from HP, IBM and Dell branded offers in one of the past articles .

By the way, during the existence of the company, a complete replacement was required only once. Gluck was not reproducible, and in the presence of Server Mall engineers everything worked like a clock. Here it is, admin aura in action!

Source: https://habr.com/ru/post/313172/

All Articles

A day in the life of a new server: how we check and restore iron

Let's say you decide to sell a server

Server maintenance

Light and digital diagnostics

And then came the order for this server

Instead of conclusion

More articles: