📜 ⬆️ ⬇️

Monitor SMART in Zabbix

For those who use Zabbix, and wants to learn how to make their own templates and monitor non-standard systems (which are not yet in Zabbix), as well,
who needs advanced SMART monitoring, and who hasn't been satisfied with already existing templates, please under the cat.

It all started with the fact that the existing template for SMART did not suit me. He allowed me to look at a rather limited number of attributes, and building it up to a level acceptable to me became superimposed. Especially due to the fact that he used simple fields in Zabbix Agent, and as their number increased, it somehow became uncomfortable. Let's look at one line in the config file, with a parameter request (there are a lot of similar ones):

UserParameter=uHDD[*], sudo smartctl -A /dev/$1| grep "$2"| tail -1| cut -c 88-|cut -f1 -d' '

All is well if you only have this parameter, or a couple, but if you have ten of them? And drives for example a dozen? For each such parameter, are we going to pull the smartctl (twitching the disk once more)? In addition, each such parameter is a separate request from Zabbix Server (well, or a group request with parameters substituted for *). In such a situation, unfortunately there is no solution, Zabbix Agent does not support another way to get data, but Zabbix Trapper and the zabbix_sender utility, which allow you to send a whole pack of parameters, come to the rescue.

Here we will deal with the preparation of data for them.
Let's start by searching for devices that generally give us SMART, for which we will need:

Let's write this script (smartdiscovery.sh):
 #!/bin/bash # require: sg module and sg_map util # Get know generic scsi device from sg_map or from /usr/local/etc/smartdev.lst (is prefered used), # and then try to read some SMART attribue, if success, echo output combination to SDTOUT /usr/sbin/modprobe sg # dev_type so limit? becose i can`t test it on corresponding controller, /usr/local/etc/smartdev.lst can use for set dev_type manual DEV_TYPE=(sat scsi ata) DEV_LST='/usr/local/etc/smartdev.lst' while read -r -a attr; do if [ -z "${attr[1]}" ]; then DEV=${attr[0]} else DEV=${attr[1]} fi for i in "${DEV_TYPE[@]}";do /usr/sbin/smartctl -A -d $i $DEV | grep -q 'ID#' if [[ $? == 0 ]]; then DEV=$(basename $DEV) if [ -f $DEV_LST ]; then grep -q $DEV $DEV_LST if [[ $? != 0 ]]; then echo "$DEV $i" fi fi break fi done done < <(/usr/bin/sg_map) if [ -f $DEV_LST ]; then cat $DEV_LST fi 


He will look for devices for us (he searches for utilities and compares the one found with the /usr/local/etc/smartdev.lst file, if a match is found, then the values ​​from the file are used, this will temporarily bypass the inability to test the work with some controllers, for example 3ware ) and will list as pairs of values: <device name> <connection type>
Then we will transfer this list to another script (zabbix_smart_discovery.sh), which will generate JSON for Zabbix:
zabbix_smart_discovery.sh
 #!/bin/bash # Formating discovering device list to JSON format for zabbix echo -e "{\n\t\"data\":[" LN=0 while IFS=' ' read -r -a attribute; do if [[ $LN != 0 ]]; then echo "," fi echo -e "\t\t{ \"{#DEVNAME}\":\"${attribute[0]}\", \"{#DEVTYPE}\":\"${attribute[1]}\" }\c" LN=1 done < /dev/stdin echo -e "\n\t]\n}" 


The output will be something like this:
smartctl.discovery
 { "data":[ { "{#DEVNAME}":"sg1", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sg2", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sg3", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sg4", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sg5", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sg6", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sg7", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sg8", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sdb", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sdc", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sdd", "{#DEVTYPE}":"sat" }, { "{#DEVNAME}":"sde", "{#DEVTYPE}":"sat" } ] } 


{#DEVNAME} and {#DEVTYPE} are macros that Zabbix will use for substitutions.
The smart2zabbix.sh script will generate data for Zabbix Trapper
smart2zabbix.sh
 #!/bin/bash # Format output from smartctl to zabbix_sender input # $1 is path for examine device # $2 type of device is used in smartctld -d paramentr # $3 hostname of monitoring system, can set to '-', if using -s or -c paramentr in zabbix_sender DEV_PATH=$1 DEV_TYPE=$2 HOSTNAME=$3 HEADERS=(id attribute_name flag value worst thresh type updated when_failed raw_value) DEVICE=$(basename $DEV_PATH) SECTION='' while IFS='' read -r line; do case $line in '=== START OF INFORMATION SECTION ===') SECTION='INFO' continue ;; '=== START OF READ SMART DATA SECTION ===') SECTION='HEALF' continue ;; 'ID#'*) SECTION='ATTR' continue ;; esac case $SECTION in 'INFO') if [ -z "$line" ]; then SECTION='' else IFS=':' read -r -a attribute <<< "$line" PRE="$HOSTNAME smartctl.info[$DEVICE," ATTR_V=$( echo ${attribute[1]} | sed -e 's/^[ \t]*//' ) ATTR_N=$(echo ${attribute[0]} | tr '[:upper:]' '[:lower:]' | sed 's/ /_/' ) case ${attribute[0]} in 'Model Family') echo "${PRE}$ATTR_N] \"$ATTR_V\"" ;; 'Device Model') echo "${PRE}$ATTR_N] \"$ATTR_V\"" ;; 'Serial Number') echo "${PRE}$ATTR_N] \"$ATTR_V\"" ;; 'Firmware Version') echo "${PRE}$ATTR_N] \"$ATTR_V\"" ;; 'User Capacity') echo "${PRE}$ATTR_N] \"$ATTR_V\"" ;; 'Sector Size' | 'Sector Sizes') ATTR_N=$(echo 'Sector Size' | tr '[:upper:]' '[:lower:]' | sed 's/ /_/' ) echo "${PRE}$ATTR_N] \"$ATTR_V\"" ;; 'Rotation Rate') echo "${PRE}$ATTR_N] \"$ATTR_V\"" ;; esac fi ;; 'HEALF') if [ -z "$line" ]; then SECTION='' else IFS=':' read -r -a attribute <<< "$line" PRE="$HOSTNAME smartctl.smart[$DEVICE," ATTR=$( echo ${attribute[1]} | sed -e 's/^[ \t]*//' ) case ${attribute[0]} in 'SMART overall-health self-assessment test result') echo "${PRE}test_result] \"$ATTR\"" ;; esac fi ;; 'ATTR') if [ -z "$line" ]; then SECTION='' else read -r -a attribute <<< "$line" PRE="$HOSTNAME smartctl.smart[$DEVICE," for i in "${!attribute[@]}";do if [[ $i == 0 ]]; then continue fi case ${attribute[$i]} in ''|*[!0-9]*) ATTR="\"${attribute[$i]}\"" ;; *) ATTR="$(echo ${attribute[$i]} | sed 's/0*//')" ;; esac if [ -z "$ATTR" ]; then ATTR=0 fi echo "${PRE}${attribute[0]},${HEADERS[$i]}] $ATTR" done fi ;; esac done < /dev/stdin 


The output will be something like this:
The output will be something like this:
 test.local smartctl.info[sg1,model_family] "Western Digital RE4 (SATA 6Gb/s)" test.local smartctl.info[sg1,device_model] "WDC WD2000FYYZ-01UL1B1" test.local smartctl.info[sg1,serial_number] "WD-WCC1P1175320" test.local smartctl.info[sg1,firmware_version] "01.01K02" test.local smartctl.info[sg1,user_capacity] "2 000 398 934 016 bytes [2,00 TB]" test.local smartctl.info[sg1,sector_size] "512 bytes logical/physical" test.local smartctl.info[sg1,rotation_rate] "7200 rpm" test.local smartctl.smart[sg1,test_result] "PASSED" test.local smartctl.smart[sg1,1,attribute_name] "Raw_Read_Error_Rate" test.local smartctl.smart[sg1,1,flag] "0x002f" test.local smartctl.smart[sg1,1,value] 200 test.local smartctl.smart[sg1,1,worst] 200 test.local smartctl.smart[sg1,1,thresh] 51 test.local smartctl.smart[sg1,1,type] "Pre-fail" test.local smartctl.smart[sg1,1,updated] "Always" test.local smartctl.smart[sg1,1,when_failed] "-" test.local smartctl.smart[sg1,1,raw_value] 0 test.local smartctl.smart[sg1,3,attribute_name] "Spin_Up_Time" test.local smartctl.smart[sg1,3,flag] "0x0027" test.local smartctl.smart[sg1,3,value] 169 test.local smartctl.smart[sg1,3,worst] 169 test.local smartctl.smart[sg1,3,thresh] 21 test.local smartctl.smart[sg1,3,type] "Pre-fail" test.local smartctl.smart[sg1,3,updated] "Always" test.local smartctl.smart[sg1,3,when_failed] "-" test.local smartctl.smart[sg1,3,raw_value] 6508 test.local smartctl.smart[sg1,4,attribute_name] "Start_Stop_Count" test.local smartctl.smart[sg1,4,flag] "0x0032" test.local smartctl.smart[sg1,4,value] 100 test.local smartctl.smart[sg1,4,worst] 100 test.local smartctl.smart[sg1,4,thresh] 0 test.local smartctl.smart[sg1,4,type] "Old_age" test.local smartctl.smart[sg1,4,updated] "Always" test.local smartctl.smart[sg1,4,when_failed] "-" test.local smartctl.smart[sg1,4,raw_value] 36 test.local smartctl.smart[sg1,5,attribute_name] "Reallocated_Sector_Ct" test.local smartctl.smart[sg1,5,flag] "0x0033" test.local smartctl.smart[sg1,5,value] 200 test.local smartctl.smart[sg1,5,worst] 200 test.local smartctl.smart[sg1,5,thresh] 140 test.local smartctl.smart[sg1,5,type] "Pre-fail" test.local smartctl.smart[sg1,5,updated] "Always" test.local smartctl.smart[sg1,5,when_failed] "-" test.local smartctl.smart[sg1,5,raw_value] 0 test.local smartctl.smart[sg1,7,attribute_name] "Seek_Error_Rate" test.local smartctl.smart[sg1,7,flag] "0x002e" test.local smartctl.smart[sg1,7,value] 200 test.local smartctl.smart[sg1,7,worst] 200 test.local smartctl.smart[sg1,7,thresh] 0 test.local smartctl.smart[sg1,7,type] "Old_age" test.local smartctl.smart[sg1,7,updated] "Always" test.local smartctl.smart[sg1,7,when_failed] "-" test.local smartctl.smart[sg1,7,raw_value] 0 test.local smartctl.smart[sg1,9,attribute_name] "Power_On_Hours" test.local smartctl.smart[sg1,9,flag] "0x0032" test.local smartctl.smart[sg1,9,value] 79 test.local smartctl.smart[sg1,9,worst] 79 test.local smartctl.smart[sg1,9,thresh] 0 test.local smartctl.smart[sg1,9,type] "Old_age" test.local smartctl.smart[sg1,9,updated] "Always" test.local smartctl.smart[sg1,9,when_failed] "-" test.local smartctl.smart[sg1,9,raw_value] 15927 test.local smartctl.smart[sg1,10,attribute_name] "Spin_Retry_Count" test.local smartctl.smart[sg1,10,flag] "0x0032" test.local smartctl.smart[sg1,10,value] 100 test.local smartctl.smart[sg1,10,worst] 253 test.local smartctl.smart[sg1,10,thresh] 0 test.local smartctl.smart[sg1,10,type] "Old_age" test.local smartctl.smart[sg1,10,updated] "Always" test.local smartctl.smart[sg1,10,when_failed] "-" test.local smartctl.smart[sg1,10,raw_value] 0 test.local smartctl.smart[sg1,11,attribute_name] "Calibration_Retry_Count" test.local smartctl.smart[sg1,11,flag] "0x0032" test.local smartctl.smart[sg1,11,value] 100 test.local smartctl.smart[sg1,11,worst] 253 test.local smartctl.smart[sg1,11,thresh] 0 test.local smartctl.smart[sg1,11,type] "Old_age" test.local smartctl.smart[sg1,11,updated] "Always" test.local smartctl.smart[sg1,11,when_failed] "-" test.local smartctl.smart[sg1,11,raw_value] 0 test.local smartctl.smart[sg1,12,attribute_name] "Power_Cycle_Count" test.local smartctl.smart[sg1,12,flag] "0x0032" test.local smartctl.smart[sg1,12,value] 100 test.local smartctl.smart[sg1,12,worst] 100 test.local smartctl.smart[sg1,12,thresh] 0 test.local smartctl.smart[sg1,12,type] "Old_age" test.local smartctl.smart[sg1,12,updated] "Always" test.local smartctl.smart[sg1,12,when_failed] "-" test.local smartctl.smart[sg1,12,raw_value] 30 test.local smartctl.smart[sg1,183,attribute_name] "Runtime_Bad_Block" test.local smartctl.smart[sg1,183,flag] "0x0032" test.local smartctl.smart[sg1,183,value] 100 test.local smartctl.smart[sg1,183,worst] 100 test.local smartctl.smart[sg1,183,thresh] 0 test.local smartctl.smart[sg1,183,type] "Old_age" test.local smartctl.smart[sg1,183,updated] "Always" test.local smartctl.smart[sg1,183,when_failed] "-" test.local smartctl.smart[sg1,183,raw_value] 0 test.local smartctl.smart[sg1,192,attribute_name] "Power-Off_Retract_Count" test.local smartctl.smart[sg1,192,flag] "0x0032" test.local smartctl.smart[sg1,192,value] 200 test.local smartctl.smart[sg1,192,worst] 200 test.local smartctl.smart[sg1,192,thresh] 0 test.local smartctl.smart[sg1,192,type] "Old_age" test.local smartctl.smart[sg1,192,updated] "Always" test.local smartctl.smart[sg1,192,when_failed] "-" test.local smartctl.smart[sg1,192,raw_value] 29 test.local smartctl.smart[sg1,193,attribute_name] "Load_Cycle_Count" test.local smartctl.smart[sg1,193,flag] "0x0032" test.local smartctl.smart[sg1,193,value] 200 test.local smartctl.smart[sg1,193,worst] 200 test.local smartctl.smart[sg1,193,thresh] 0 test.local smartctl.smart[sg1,193,type] "Old_age" test.local smartctl.smart[sg1,193,updated] "Always" test.local smartctl.smart[sg1,193,when_failed] "-" test.local smartctl.smart[sg1,193,raw_value] 6 test.local smartctl.smart[sg1,194,attribute_name] "Temperature_Celsius" test.local smartctl.smart[sg1,194,flag] "0x0022" test.local smartctl.smart[sg1,194,value] 125 test.local smartctl.smart[sg1,194,worst] 96 test.local smartctl.smart[sg1,194,thresh] 0 test.local smartctl.smart[sg1,194,type] "Old_age" test.local smartctl.smart[sg1,194,updated] "Always" test.local smartctl.smart[sg1,194,when_failed] "-" test.local smartctl.smart[sg1,194,raw_value] 25 test.local smartctl.smart[sg1,196,attribute_name] "Reallocated_Event_Count" test.local smartctl.smart[sg1,196,flag] "0x0032" test.local smartctl.smart[sg1,196,value] 200 test.local smartctl.smart[sg1,196,worst] 200 test.local smartctl.smart[sg1,196,thresh] 0 test.local smartctl.smart[sg1,196,type] "Old_age" test.local smartctl.smart[sg1,196,updated] "Always" test.local smartctl.smart[sg1,196,when_failed] "-" test.local smartctl.smart[sg1,196,raw_value] 0 test.local smartctl.smart[sg1,197,attribute_name] "Current_Pending_Sector" test.local smartctl.smart[sg1,197,flag] "0x0032" test.local smartctl.smart[sg1,197,value] 200 test.local smartctl.smart[sg1,197,worst] 200 test.local smartctl.smart[sg1,197,thresh] 0 test.local smartctl.smart[sg1,197,type] "Old_age" test.local smartctl.smart[sg1,197,updated] "Always" test.local smartctl.smart[sg1,197,when_failed] "-" test.local smartctl.smart[sg1,197,raw_value] 0 test.local smartctl.smart[sg1,198,attribute_name] "Offline_Uncorrectable" test.local smartctl.smart[sg1,198,flag] "0x0030" test.local smartctl.smart[sg1,198,value] 200 test.local smartctl.smart[sg1,198,worst] 200 test.local smartctl.smart[sg1,198,thresh] 0 test.local smartctl.smart[sg1,198,type] "Old_age" test.local smartctl.smart[sg1,198,updated] "Offline" test.local smartctl.smart[sg1,198,when_failed] "-" test.local smartctl.smart[sg1,198,raw_value] 0 test.local smartctl.smart[sg1,199,attribute_name] "UDMA_CRC_Error_Count" test.local smartctl.smart[sg1,199,flag] "0x0032" test.local smartctl.smart[sg1,199,value] 200 test.local smartctl.smart[sg1,199,worst] 200 test.local smartctl.smart[sg1,199,thresh] 0 test.local smartctl.smart[sg1,199,type] "Old_age" test.local smartctl.smart[sg1,199,updated] "Always" test.local smartctl.smart[sg1,199,when_failed] "-" test.local smartctl.smart[sg1,199,raw_value] 0 test.local smartctl.smart[sg1,200,attribute_name] "Multi_Zone_Error_Rate" test.local smartctl.smart[sg1,200,flag] "0x0008" test.local smartctl.smart[sg1,200,value] 200 test.local smartctl.smart[sg1,200,worst] 200 test.local smartctl.smart[sg1,200,thresh] 0 test.local smartctl.smart[sg1,200,type] "Old_age" test.local smartctl.smart[sg1,200,updated] "Offline" test.local smartctl.smart[sg1,200,when_failed] "-" test.local smartctl.smart[sg1,200,raw_value] 0 


And then just send all this Zabbix Trapper:
zabbix_smartctl.sh
 #!/bin/bash # Sending collected data to the zabbix server # Get device list and type from STDIN, produced by smartdiscovery.sh PREFIX='/usr/local/bin' AGENT_CFG='/etc/zabbix/zabbix_agentd.conf' while IFS=' ' read -r -a attr; do smartctl -A -H -i -d ${attr[1]} /dev/${attr[0]} | $PREFIX/smart2zabbix.sh /dev/${attr[0]} ${attr[1]} - | /usr/bin/zabbix_sender -c $AGENT_CFG -i - done < /dev/stdin 


Next, you only need to enable sudo for some scripts, place the task in cron and import the template on Zabbix Server.
The ready-made kit can be obtained from the official Zabbix Share portal, where it is all laid out for everyone: SMART monitoring with smartmontools (LLD, Trapper)
')
The main advantage over other similar templates / scripts is that you can load all the attributes that you later use at will, without changing the scripts, just by adding them on the server.

Source: https://habr.com/ru/post/274391/


All Articles