📜 ⬆️ ⬇️

How I Fought Death Screens on Legacy Blade Servers

A post about how I struggled with the problems of the new software on the old hardware, which arose after adding additional equipment.



Anyone who is interested in server hardware and the fight against errors, please under the cat.

We ordered for the HP C3000 server shelf two additional switches cisco and a mezzanine card to each blade server in order to do everything in mind. I wanted the networks to share on the physical level, as well as improve performance and reliability.
The configuration was the following:
')
Hp c3000 shelf in it




Each blade has two mezzanine cards (HP NC382m Dual-Port 1GbE and HP NC364m Quad Port 1GbE) and embedded dual-port FlexFabric 10GbE.

The mezzanine cards look like this:


HP NC382m


HP NC364m

Servers are running Vmware ESXi 5.5.

Initially, everything worked stably without tsisok and mezhanin chetyrehportovyh. One hp switch was for a network of virtual machines, the second for management and iscsi networks. The performance of the second was not enough and it was decided to move the iscsi network to separate switches. To do this, and acquired two tsiska and mezzanine cards.

As you understand, the 460th servers are rather outdated, but still need to be supported. A current hp service pack distribution was received, the whole shelf was updated.

From the cluster, vmware brought out the 460th hosts, inserted the mezzanine cards there, stuck it in the regiment, and ... when booting immediately, PSOD.


In this case, the error code is the string
PCPU0: 32840 / helper14-0

At first I thought that maybe this is a motherboard problem, since one of the blades had already changed the motherboard, precisely because of problems with network adapters. They disappeared from time to time.
But when the problem was duplicated on the second blade server, I dropped the idea. It is worth noting that I tried to start the server with one any mezzanine card in different slots and everything worked without problems, which means that the problem is not in the card and not in the slot.

Blade server translated into debug mode, read logs, read vmware forum. It says that this is a problem with the equipment and refers to the manufacturer's forum. I turn to the HP forum, they write that when using modern vmware products, there are often difficulties with old equipment. I put vmware esxi 4.1 - everything works stably, but the problem is that the license for esxi 5.5 is related software for this license, such as Vgate 2.7. I put Windows Server 2012 R2 to make sure that the problem is really in the software and ... BSOD.


NMI_HARDWARE_FAILURE

The next time you start windows, everything seems to be stable, leaving it to the tests. The next day, I discover bsod.
At the same time, there is an error in the Integrated Management Log (IML) Uncorrectable PCI Express Error (Embedded Device, Bus 0, Device 9, Function 0, Error status 0x00000000) IML (Integrated Management Log). Those. unrecoverable hardware error, and device 9 is just the second mezzanine card.

I continue to read the forum hp, it is written that the ilo firmware can influence. I discover that there is a newer ilo firmware and I am re-installing both blades, but it does not help. Further more, the forum says that there is an incompatibility between FlexFabric firmware and drivers. Remaking FlexFabric is still an error.

I try different distributions: the standard distribution of vmware esxi 5.5 and the distribution of the manufacturer of HP of the same build. The result is one.
I read that in the logs, and there is an error specifically on bnx2 (this is a network FlexFabric adapter). I put Broadcom drivers from the vmware site (and overwriting the driver works only from the esxi console itself. If you install from under vcenter, then vcenter does not overwrite). Restart and normal flight! It was the same with the Emulex FlexFabric on 490 blades. I also updated the FlexFabric BIOS and re-recorded the driver. Everything worked stably, quickly,
... but not for long.


In this screenshot, the error code is the string
PCPU0: 32802 / UplinkWatchdogWorld

There was a second problem with the mezzanine map.
After some time, a four-port mezzanine card was completely missing on one of the blades, even from the host BIOS. Rebooting, resetting the BIOS, nothing helped, until a point about working with pci mezzanine adapters was found in the BIOS. Along the pci lines, it became possible to choose the level of signal amplification (only two points, 6db and 3.5db). Yes, it became, because this item appeared when adding a four-port card. Switched the gain level and immediately after the reboot the map appeared in the BIOS.

Two working weeks passed and there was not a single purple screen.
After updating the firmware on the network cards, the wake on lan function appeared, which was not there before, and power management was configured on the vcenter. Now hosts wake up as needed.

And as a conclusion I want to say that you need to be attentive to the functionality that appears when adding new hardware (such as additional items in bios), and also that not all uncorrectable hard-ware errors are irreparable. Some errors are caused by standard drivers and outdated bios.

I hope my torment with the blades will be useful to someone.

Source: https://habr.com/ru/post/275611/


All Articles