Monitor the health of HP Proliant servers at nagios / icinga. Plugins check_hpasm and check_ilo2_health.pl

There are a huge number of plugins for monitoring systems. You can look and find the right directory exchange.nagios and monitoringexchange . When searching for the right plugin, it is better to check in both repositories - despite the apparent identity, their contents differ.

Another thing is that the quality and functionality of plug-ins, even similar to each other, vary greatly - there are khaks that are quickly crafted on the knees, working in strictly defined conditions and solving a narrow problem. After writing the author of the plugin did not throw it into / dev / null, but decided to tell the world about it. Other plug-ins are well-made products that work with entire families of devices and provide extensive information about target systems.

I would like to talk about the latter, especially since during the work with nagios / icinga it was found that there is very little Russian-language information on plugins for monitoring systems.
')
This article focuses on monitoring HP Proliant servers, and the author sincerely hopes that it will help those who have HP equipment to work and would like to more fully monitor its parameters.

Manage HP Proliant Servers in Different Types of Execution (iLO).

Most Proliant servers are managed through iLO - integrated Lights-Out. The basic version of iLO allows you to remotely control the server through the http interface: turn on / off the power supply, turn on / off the blue UID indicator, which allows you to find the server in the rack, view the ILO / Integrated Managament (IML) logs, as well as the current internal settings components - temperatures of processors, fan speeds and other information useful for the health of the system. There is iLO Advanced License - this is an advanced version for which you need to buy a license (activation key). It is inexpensive, something around $ 50 and allows you to redirect the terminal / keyboard / mouse to a remote browser, after which it becomes possible to control the BIOS loading, as well as enter the already booted system.

Currently there are four versions of iLO. Just iLO, sometimes referred to as iLO1, placed on the servers of the Proliant generations G1-G4, iLO2 (G5-G6) and iLO3 put on the G7. From the generation of the G8, which is now called gen8, iLO4 is placed on the server. If you are interested in what is being done inside iLO4, then there is a good article on Habré about this Shedding light on the HP ProLiant iLO Management Engine .

All iLOs have their own independent network connection interface. It is active, even if the server is in the power down state, as long as the Proliant is physically plugged into the outlet. An iLO is assigned its address, the interface itself is usually included in a separate management switch with a separate VLAN and its own subnet.

Proliant server blades also have iLO control interfaces; each blade has its own. A separate management interface (Onboard Administrator) is also available on the c3000 / c7000 blade rack. In addition to information about the general condition of the power supplies, temperature sensors in the rack, from the Onboard Administrator interface there is access to each iLO blade. On the blades themselves, at least iLO2 was set, even on G1 generations. The latest generation of blades (gen8) are also equipped with iLO4.

Integrity iLO (Integrity iLO-iLO3) versions are also installed on Integrity servers and Superdome 2 blades - such servers are not common, so we will not consider them.

In some systems, now outdated, but still working (DL760 G2 - who will throw out such an 8-processor horse?) That were not originally equipped with iLO, you can install a RILOE II card (full-size PCI) with a separate physical LAN RJ45 interface. RILOE II - (Remote Insight Lights-Out Edition II) is quite a funny thing - it has a KVM interface (keyboard / vga / mouse) for rack control, RJ45 for network connectivity and remote control, as well as ... an adapter for external power supply.

There is also a truncated version of the iLO - LO100i, it was put on the G6 and G7 generations of some entry-level models, such as the DL160, DL180, DL320, as well as the low-cost ML series. In the generations of gen8 servers, even the initial line, the LO100i will be gone (at least this was said at the HP conference). LO100i works on one of the two Proliant network interfaces, and can do it both in dedicated mode and in shared mode. In dedicated mode, one of the server's network interfaces is fully occupied under the LO100i, in the shared mode, the total band is divided between data and control. Management takes up a small band and has virtually no effect on the data. The LO100i also has its own separate network address, which is independent of the main server address. In some entry-level models (for example, ML110 / ML150 G2), there is no management interface, but if necessary, it can be organized by installing a special RMP (Remote Management Processor) management card. The card is not slot - it is mounted on the motherboard connectors (piggyback), and you cannot put it on other Proliant models.

In real life, with the LO100i, when working in a shared mode, not everything adds up well (in the selected mode, everything is fine). LO100i works great if two Proliant network interfaces work independently or are configured in Network Fault Tolerance (NFT) or Transmit Load Balancing (TLB) with hot spare. When trying to join links in LACP (and this is the most effective mode of using multiple network interfaces), the LO100i becomes unavailable, although the data on the interfaces go bang. Moreover, it is not available from the external network - from a workstation located in the same VLAN on the same switch, or from the system console. Since the documentation claims the opposite, in this regard, a case was opened at HP, even at the end of November last year. At the moment, the problem has slowly been escalated to L1 (developers), but HP did not give out any specific recommendations (or at least information on understanding of the causes), although judging by the tracker, the engineers are doing something. Of course, there is a problem, but it only indirectly affects the monitoring task, although it does not enable the server to be quickly managed.

The exchange.nagios.org website has a fairly large number of plugins that work directly with ilo making requests via http via XML. But they only work well with iLO2 and newer versions. iLO1 is very unintelligible through XML and provides little useful information, but the server with iLO1 is still alive and healthy, and you need to understand what is happening inside them.

Plugin check_ilo2_health.pl

To check iLO check_ilo2_health.pl seemed the most effective. Download here . On exchange.nagios.org is an older version that does not understand iLO4.

Installing the script does not require much work. The script is copied to the directory where other nagios / icinga scripts are located, then the verification command is written to the configurator and assigned to the hosts.

$USER1$/check_ilo2_health.pl -H $HOSTADDRESS$ -d 1 -u $ARG1$ -p $ARG2$

where $ ARG1 $, $ ARG2 $ is the user / password of the iLO interface

The script requires the installation of Perl packages (installed via CPAN): Nagios :: Plugin (be sure to read UPDATE at the end!), IO :: Socket :: SSL and XML :: Simple.

Options:
-e - plugin ignores syntax error messages in XML output. It may be useful for older firms.
-n - without temperature indicators
-d - temperature data compatible with PerfParse
-v - output full XML message (for debugging)
-3 - iLO3 and iLO4 support
-a - check the fault tolerance of the fans (if supported by the equipment)
-b - check HDD bays (if supported by hardware)
-o - check for fail-safe power supplies (if supported by hardware)

In addition to the returned information, the script is needed to check the physical availability of iLO from the outside. Normal ping here may not be enough.

HP Server Insight Manager

It would be a mistake to think that HP cannot offer anything to the administrator. HP Server Insight Manager (HP SIM) is a solution designed to manage and monitor the Proliant platform and, importantly, for HP products it is free to use (download here ). Support - for the money. Add-ons for monitoring some third-party systems are also paid. To complete its work on managed servers, installation of drivers is required.

A good question arises: Can HP SIM information be added to icinga / nagios capabilities for collecting information in a single monitoring application? And what is surprising: there was a man who managed to solve this problem.

Plugin check_hpasm

The check_hpasm plugin was created by ConSol labs. Its author is Gerhard Lausser (Gerhard Lausser). The plugin is designed to collect information from the following Proliant systems:

• Linux, where the HP System Health Application and Insight Management Agent (HPASM) is installed
• Windows 2003/2008/2008 R2 / 2012 - with installed HP SIM drivers.
• HP Blade Racks (c7000 / c3000) with Onboard Administrator.

In all three cases, SNMP must be raised and configured.

What is it checked. The monitoring system Icinga 1.8.4 (fully compatible with Nagios) is installed on FreeBSD 9.0, so if you have another monitoring system or OS, make corrections for dependencies and paths. Perl is installed in the system, with all the necessary additions.

Installation check_hpasm.

Downloading the latest version. Go to the page of the plugin and look for check_hpasm, a link to it . It is located at the bottom of the page. At the time of writing, version 4.6.3 was available.

The plugin is subject to assembly and installation:
1. We drop it on our monitoring system via ftp or in another way.
2. Make it tar xvf check_hpasm-4.6.3.tar.gz
3. cd check_hpasm-4.6.3
4. run ./configure

The keys for configure:

 --libexecdir=/usr/local/libexec/nagios

(the place where the scripts of the form check_ * lie. I have it clouded in the old fashioned way on the Nagios place, although there is also / usr / local / libexec / icinga)

 --with-nagios-user=icinga

(username under which the monitoring system is launched, for nagios there will be nagios)

 --with-nagios-group=icinga

(the name of the group to which the user belongs, from which the monitoring system runs, for icinga = icinga for nagios = nagios)

 --enable-perfdata=YES

(output or not information for collecting statistics - YES, by default = no)

 --enable-extendedinfo=YES

(whether or not to display extended information - YES, by default = no)

The final configuration line will be:

 ./configure --with-nagios-user=icinga --with-nagios-group=icinga --enable-perfdata=yes --with-degrees=yes --enable-perfdata=yes --enable-extendedinfo=yes --libexecdir=/usr/local/libexec/nagios

We are looking at the output of configure, making sure that everything suits us.

 checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... no checking for mawk... no checking for nawk... nawk checking whether make sets $(MAKE)... yes checking how to create a pax tar archive... gnutar checking build system type... i386-unknown-freebsd9.0 checking host system type... i386-unknown-freebsd9.0 checking whether make sets $(MAKE)... (cached) yes checking for gawk... (cached) nawk checking for sh... /bin/sh checking for perl... /usr/bin/perl configure: creating ./config.status config.status: creating Makefile config.status: WARNING: 'Makefile.in' seems to ignore the --datarootdir setting config.status: creating plugins-scripts/Makefile config.status: WARNING: 'plugins-scripts/Makefile.in' seems to ignore the --datarootdir setting config.status: creating plugins-scripts/subst --with-perl: /usr/bin/perl --with-nagios-user: icinga --with-nagios-group: icinga --with-noinst-level: unknown --with-degrees: yes --enable-perfdata: yes --enable-extendedinfo: yes --enable-hwinfo: yes --enable-hpacucli: no

Then make make install

In the directory that appears in the key libexecdir , the check_hpasm script should appear, about check_hpasm in size, if it is not there, or it is noticeably smaller, then the script is not going to. If everything is good, then check what permissions on it and set the correct ones, if that.

You can check the functionality of the plugin by running the command line:

 check_hpasm –H <proliant_ip> --community <SNMP_community> -v

After that, the script will give detailed information about the state of the remote system. For example, the query for the c7000 blade rack at nagios / icinga would look like this:

 OK - System: 'bladesystem c7000 enclosure g2', S/N: 'GBXXXXXXXX', hardware working fine, temp_1:1:1=20 temp_1:1:12=20 temp_1:1:13=20 temp_1:1:2=31 temp_1:1:4=20 temp_1:1:5=19 temp_1:1:6=19 temp_1:1:7=20 common enclosure Blade7000 condition is ok (Ser: GBXXXXXXXX, FW: 3.70) fan 1:1:1 is present, location is 1, redundance is other, condition is ok fan 1:1:10 is present, location is 10, redundance is other, condition is ok fan 1:1:2 is present, location is 2, redundance is other, condition is ok fan 1:1:3 is present, location is 3, redundance is other, condition is ok fan 1:1:4 is present, location is 4, redundance is other, condition is ok fan 1:1:5 is present, location is 5, redundance is other, condition is ok fan 1:1:6 is present, location is 6, redundance is other, condition is ok fan 1:1:7 is present, location is 7, redundance is other, condition is ok fan 1:1:8 is present, location is 8, redundance is other, condition is ok fan 1:1:9 is present, location is 9, redundance is other, condition is ok Chassis temperature is 20C (42 max) Blade Bay temperature is 20C (42 max) Blade Bay temperature is 20C (42 max) System temperature is 31C (75 max) Blade Bay temperature is 20C (42 max) Blade Bay temperature is 19C (42 max) Blade Bay temperature is 19C (42 max) Blade Bay temperature is 20C (42 max) manager 1:1:1 is present, location is 1, redundance is redundant, condition is ok, role is active manager 1:1:2 is present, location is 0, redundance is notRedundant, condition is ok, role is standby power enclosure 1:1 'Blade7000' condition is ok power supply 1:1:1 is present, condition is ok (Ser: 5AGUD0AHLZ93AC, FW: ) power supply 1:1:2 is present, condition is ok (Ser: 5AGUD0AHLZ93AE, FW: ) power supply 1:1:3 is present, condition is ok (Ser: 5AGUD0AHLZ93AL, FW: ) power supply 1:1:4 is present, condition is ok (Ser: 5AGUD0AHLZ93AK, FW: ) power supply 1:1:5 is present, condition is ok (Ser: 5AGUD0AHLZ92LT, FW: ) power supply 1:1:6 is present, condition is ok (Ser: 5AGUD0AHLZ93AD, FW: ) net connector 1:1:1 is present, model is HP 1Gb Ethernet Pass-Thru Module for c-Class BladeSystem (Ser: TWTXXXXXXX, FW: ) net connector 1:1:2 is present, model is HP 1Gb Ethernet Pass-Thru Module for c-Class BladeSystem (Ser: TWTXXXXXXX, FW: ) net connector 1:1:3 is present, model is BROCADE HP B-series 8/12c SAN Switch BladeSystem c-Class (Ser: CNXXXXXXXX, FW: ) net connector 1:1:4 is present, model is BROCADE HP B-series 8/12c SAN Switch BladeSystem c-Class (Ser: CNXXXXXXXX, FW: ) server blade 1:1:1 'BLADE1' is present, status is ok, powered is on server blade 1:1:10 'BLADE10' is present, status is ok, powered is on server blade 1:1:2 'BLADE2' is present, status is ok, powered is on server blade 1:1:3 'BLADE3' is present, status is ok, powered is on server blade 1:1:4 'BLADE4' is present, status is ok, powered is on server blade 1:1:9 'BLADE9' is present, status is ok, powered is on

If a faulty component is detected, the plugin will return WARNING and indicate the cause. For example,
This is the message about the problem battery in the accelerator:

 WARNING - controller accelerator battery needs attention, System: 'proliant dl380 g4', S/N: 'GB8640P5NS', ROM: 'P51 04/26/2006' checking cpus cpu 0 is ok cpu 1 is ok checking power supplies powersupply 1 is ok powersupply 2 is ok checking fans overall fan status: system=ok, cpu=ok fan 1 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 2 fan 2 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 3 fan 3 is present, speed is normal, pctmax is 50%, location is ioBoard, redundance is redundant, partner is 4 fan 4 is present, speed is normal, pctmax is 50%, location is ioBoard, redundance is redundant, partner is 5 fan 5 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 6 fan 6 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 7 fan 7 is present, speed is normal, pctmax is 50%, location is powerSupply, redundance is redundant, partner is 8 fan 8 is present, speed is normal, pctmax is 50%, location is powerSupply, redundance is redundant, partner is 1 checking temperatures 1 cpu temperature is 38C (62 max) 2 cpu temperature is 37C (87 max) 3 ioBoard temperature is 34C (60 max) 4 cpu temperature is 40C (87 max) 5 powerSupply temperature is 31C (53 max) checking memory dimm module 0:1 (module 1 @ cartridge 0) is ok dimm module 0:2 (module 2 @ cartridge 0) is ok dimm module 0:3 (module 3 @ cartridge 0) is not present dimm module 0:4 (module 4 @ cartridge 0) is not present dimm module 0:5 (module 5 @ cartridge 0) is not present dimm module 0:6 (module 6 @ cartridge 0) is not present checking disk subsystem controller accelerator is failed controller accelerator battery is failed logical drive 2:1 is ok (mirroring) physical drive 2:144 is ok physical drive 2:145 is ok scsi controller 3:1 in slot 1 is ok ide controller 0 in slot -1 is ok and unused checking ASR ASR overall condition is ok checking events

By the way, it is impossible to collect this information through the ILO1 server available on the server. What is fraught with dead battery accelerator disk controller? The write speed drops almost twice.

You can get even more detailed information by adding the –vv key to the command line, and if you need a very long sheet for diagnostics, use –vvv

Register in nagios verification commands:

 $USER1$/check_hpasm --hostname $HOSTADDRESS$ --community public –v

Instead, public should be your SNMP community. We assign the verification team to hosts and services.

The plugin relies in its work on the data returned by the drivers. Sometimes (usually they are errors in firmware) they can return incorrect values - the presence of physically missing components, incorrect temperature parameters (99 degrees, as was the case with iLO4 for gen8), etc. To do this, there is an extensive set of keys that can exclude some sensors or subsystems entirely. The description of the keys is large and fully listed on the plugin page in the Blacklisting section.

To work under Windows 2003/2008/2012 on Proliant you need to install:

1. SNMP service (not installed by default, lies in the system components) and properly configure it (specify the community / set the address of the nagios / icinga server as the SNMP server / set the address for sending SNMP traps). WBEM providers / SIM drivers are dependent on the SNMP service. WBEM - Web-based Enterprise Management.
2. The latest version of iLO, HP System Insight Manager and WBEM providers for the correct version of the system.

For Linux, HPASM is installed in the same way - it also requires configuration of SNMP and drivers. Described in detail in the HOWTO

You can install drivers in two ways:
1. For each of the models available on the farm, for example, for HP ProLiant DL380 G4 Server, they will be placed in the Driver - System Management section (drivers for iLO) or Software - System Management (WBEM providers and drivers). SIM)
2. Download SPP (Service Pack for Proliant - former PSP) - latest version 2012.10.0 . The race requires free registration with HP Passport and a number of manipulations with the purchase for $ 0.0 of the right to jump .iso with SPP. (I forgot to mention that HP SIM is the same story).

Installing drivers with SPP is at times more convenient, if only because you can do a mass installation on several servers at once. HP SUM - Smart Update Manager deals with this. This will require the details of a domain administrator or a local server administrator. In addition, before installing the drivers it is strongly recommended to upgrade all available firmware.

Noted problems:
1. On a dozen-other servers can be set ooooooochen long, why - it is not clear.
2. HP SUM is not recommended to run on a machine with productive software, because it tends to load the processor up to 90%
3. Some servers are not amenable to automatic installation, especially if the system is bare - WBEM / SIM drivers have not previously stood. We'll have to put hands.
4. Sometimes on the target servers, HP SUM does not allow you to choose WBEM / SIM drivers - they will also have to be put up by hand, especially for servers with LO100i. Noticed on the DL180G6 and DL160G6.
5. WBEM / SIM drivers are not installed if Windows 2008 R2 is installed on an officially unsupported architecture (such as the outdated DL360G4 or DL380G4). The blog http://kf.livejournal.com has the following solution:.

 <i>What if you can't install HP Insight Management Agents or WBEM to Windows Server 2008 R2?  HP DL3x0 G4     Microsoft Windows Server 2008 R2,         -.             HP System Management Homepage.   , , -,  WBEM,  HP Insight Management Agents   .    WBEM/HPIMA ,   SmartStart,        : Installation for "HP Insight Management Agents for Windows Server 2003/2008 x64 Editions" requires one or more of the following that is not currently installed or in the install set: - HP ProLiant Advanced System Management Controller Driver for Windows - HP ProLiant iLO Advanced and Enhanced System Management Controller Driver for Windows - HP ProLiant iLO 2 Management Controller Driver for Windows - HP ProLiant iLO 3 Management Controller Driver for Windows   ,   : 1) Download HP ProLiant iLO Advanced and Enhanced System Management Controller Driver for Windows Server 2008 x64 Editions (cp010914.exe) to the server. 2) Extract downloaded file with integrated extract feature. 3) Set the compatibility mode for cpqsetup.exe as Windows Server 2008 (Service Pack 1). 4) Run cpqsetup.exe, installation should works fine. 5) Install HP Insight Management Agents/WBEM as usual. 6) Also, 1 of unknown devices will be disappeared from Device Manager and it will be called "HP ProLiant iLO2 Advanced System Management Controller" now.</i>

Check_hpasm errors and disadvantages

At the time of version 4.6.3, the following error was noticed: When running on FreeBSD, the plugin displays the message:

 Use of uninitialized value in lc at /usr/local/libexec/nagios/check_hpasm line 3622.

No influence was observed on the work of the plug-in itself and the information it gives out. In new versions fixed.
It was suggested that the lc utility was not available, but the installation of the utility did not work. The error remained. The author of the plugin reported.

The disadvantages include the conclusion in the Unix format (from the beginning of the era) the dates of the strings from IML, which makes them poorly perceived:

 Event: 76 Added:1357193160 Class: (System Revision) informational ROM flashed (New version: 12/02/2011)

Can be counted on an online calculator. For example, here .

You can write to the plugin author (plain english) about the noticed errors. G.Lausser is a busy person, but check_hpasm always tries to respond to reports of problems or misbehavior.

Plugin benefits

C using this plugin were quickly detected
• Faulty batteries on the disk controller accelerators in the DL360G4, which led to a drop in write speed on the attached MSA20 arrays.
• A stuck fan was detected in one of the Proliant. Fan replaced.
• A dead backup power supply was detected on one of the servers. BP replaced.

Of course, if every day you watch all the iLO interfaces and watch the IML logs carefully , you will notice such errors, but the question arises - when should we work? This plugin in combination with nagios / icinga allows you to simplify the administration process and fully cover the infrastructure of HP Proliant servers.

Some helpful guides

1. HP Integrated Lights-Out User Guide .
2. HP Integrated Lights-Out 2 User Guide .
3. HP iLO 3 User Guide
4. HP iLO 4 User Guide
5. HP management software for Linux on ProLiant servers.HOWTO 6th edition
6. HP Remote Insight Lights-Out Edition II User Guide
7. HP ProLiant Lights Out-100 User Guide

^{PS Will there be interesting articles on other aspects of monitoring - configurators, practical developments, interesting plugins, addons, non-trivial hardware?} ^{Material has accumulated a lot - even write a book.}

UPDATE 2015:

In 2015, we had to return to the tasks associated with monitoring HP servers and it turned out that the article was a bit outdated.
In particular, it turned out that when working with iLO2 an error appears:

ILO2_HEALTH UNKNOWN - ERROR: Failed to establish connection with <host_address>: 443.

iLO3 and iLO4 work fine.

The study showed that the source is a well-known problem with SSL. Our environment needs to be updated.

1. Upgrade our script to version at least 1.60
(the script is on nagios.exchange. here )
2. Firmware upgrade from iLO2 to the latest version. (1.94 or 1.96, available from July 2015)

The verification team needs to be changed:

 ./check_ilo2_health.pl -H <host_address> -d 1 -u <admin> -p <password> -l --sslopts 'SSL_verify_mode => SSL_VERIFY_NONE, SSL_version => "TLSv1"'

Added key sslopts, with which TLSv1 is enabled and SSL checking is disabled.

Result:

 ILO2_HEALTH OK - (Board-Version: ILO2) Temperatures: Temp_1 (Ok): 18, Temp_2 (Ok): 40, Temp_4 (Ok): 25, Temp_5 (Ok): 26, Temp_8 (Ok): 37, Temp_9 (Ok): 30, Temp_10 (Ok): 37, Temp_11 (Ok): 29, Temp_12 (Ok): 41, Temp_19 (Ok): 21, Temp_20 (Ok): 26, Temp_21 (Ok): 27, Temp_22 (Ok): 25, Temp_23 (Ok): 34, Temp_24 (Ok): 29, Temp_25 (Ok): 26, Temp_26 (Ok): 26, Temp_29 (Ok): 35, Temp_30 (Ok): 63

UPDATE 2016:

Regarding all pearl scripts using Nagios :: Plugin (check_ilo2_health.pl, check_ilo2_health.pl):
Exclusively because of copyrights and trademarks (and Nagios is a registered trademark) cpan no longer indexes Nagios :: Plugin, respectively, it is normal to install and use it no longer.

Instead, we use Monitoring :: Plugin, which performs identical functions; in the text of the scripts, we need to replace Nagios with Monitoring

you can install Monitoring :: Plugin using:

perl -MCPAN -e 'install Monitoring :: Plugin'

And it works!

Source: https://habr.com/ru/post/168981/

All Articles