Developers Club geek daily blog

1 year, 1 month ago
For those who uses Zabbix, and wants will learn to do the templates and to monitor not standard systems (which are not in Zabbix yet), and also,
who needs expanded monitoring of S.M.A.R.T., and whom already existing templates did not arrange, I ask under kat.

Everything began with the fact that already existing template for S.M.A.R.T. did not arrange me. He allowed to watch quite limited number of attributes, and its building to the level accepted for me became laid on. Especially because he used simple fields in Zabbix Agent, and at increase in their number somehow it became feel ill at ease. Let's look at one line in a config, with request of parameter (similar there is a lot of):

UserParameter=uHDD[*], sudo smartctl -A /dev/$1| grep "$2"| tail -1| cut -c 88-|cut -f1 -d' '

Everything is good if you have only this parameter, well or a couple, but if you have their ten? And disks for example ten? On each such parameter we will pull smartctl (once again pulling a disk)? Besides, each such parameter, is separate request from Zabbix Server (well or group request with the parameters substituted instead of *). In such situation, unfortunately there is no solution, Zabbix Agent does not support other method of data acquisition, but we are come to the rescue by Zabbix Trapper and the utility of zabbix_sender which allow to send the whole pack of parameters.

Here for them we will also be engaged in data preparation.
Let's begin with search of devices which in general give us S.M.A.R.T., for what it is required to us:
  • The sg driver (modprobe sg), it allows among other things, to see disks behind a number of RAID controllers (in particular I have Adaptec)
  • The utility of sg_map which will give us the list of the devices associated via the sg driver
  • And of course smartctl

Let's write such script (smartdiscovery.sh):
#!/bin/bash
# require: sg module and sg_map util
# Get know generic scsi device from sg_map or from /usr/local/etc/smartdev.lst (is prefered used),
# and then try to read some S.M.A.R.T. attribue, if success, echo output combination to SDTOUT

/usr/sbin/modprobe sg
# dev_type so limit? becose i can`t test it on corresponding controller, /usr/local/etc/smartdev.lst can use for set dev_type manual
DEV_TYPE=(sat scsi ata)
DEV_LST='/usr/local/etc/smartdev.lst'
while read -r -a attr; do
	if [ -z "${attr[1]}" ]; then
		DEV=${attr[0]}
	else
		DEV=${attr[1]}
	fi
	for i in "${DEV_TYPE[@]}";do
		/usr/sbin/smartctl -A -d $i $DEV | grep -q 'ID#'
		if [[ $? == 0 ]]; then
			DEV=$(basename $DEV)
			if [ -f $DEV_LST ]; then			
				grep -q $DEV $DEV_LST 
				if [[ $? != 0 ]]; then
					echo "$DEV $i"
				fi
			fi
			break
		fi
	done
done  < <(/usr/bin/sg_map)
if [ -f $DEV_LST ]; then
	cat $DEV_LST
fi


He will look for us for devices (looks for utilities and verifies found with file/usr/local/etc/smartdev.lst if coincidence is found that in an effect it is used values from the file, it will allow to bypass temporarily lack of an opportunity to check work with some controllers, for example 3ware) and will issue the list in the form of couples of values: <имя устройства> <тип подключения>
Further we will transfer this list to other script (zabbix_smart_discovery.sh) which will create JSON for Zabbix:
zabbix_smart_discovery.sh
#!/bin/bash
# Formating discovering device list to JSON format for zabbix

echo -e "{\n\t\"data\":["
LN=0
while IFS=' ' read -r -a attribute; do	
	if [[ $LN != 0 ]]; then
		echo ","
	fi
	echo -e "\t\t{ \"{#DEVNAME}\":\"${attribute[0]}\", \"{#DEVTYPE}\":\"${attribute[1]}\" }\c"
	LN=1
done  < /dev/stdin
echo -e "\n\t]\n}"


The output will be approximately such:
smartctl.discovery
{
        "data":[
                { "{#DEVNAME}":"sg1", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sg2", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sg3", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sg4", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sg5", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sg6", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sg7", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sg8", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sdb", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sdc", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sdd", "{#DEVTYPE}":"sat" },
                { "{#DEVNAME}":"sde", "{#DEVTYPE}":"sat" }
        ]
}


{#DEVNAME} and {#DEVTYPE} it is macroes which will be used by Zabbix for substitutions.
The script of smart2zabbix.sh will create data for Zabbix Trapper
smart2zabbix.sh
#!/bin/bash
# Format output from smartctl to zabbix_sender input
# $1 is path for examine device
# $2 type of device is used in smartctld -d paramentr
# $3 hostname of monitoring system, can set to '-', if using -s or -c paramentr in zabbix_sender

DEV_PATH=$1
DEV_TYPE=$2
HOSTNAME=$3
HEADERS=(id attribute_name flag value worst thresh type updated when_failed raw_value)
DEVICE=$(basename $DEV_PATH)
SECTION=''
while IFS='' read -r line; do
	case $line in
		'=== START OF INFORMATION SECTION ===')
			SECTION='INFO'
			continue
		;;
		'=== START OF READ SMART DATA SECTION ===')
			SECTION='HEALF'
			continue
		;;
		'ID#'*)
			SECTION='ATTR'
			continue
		;;
	esac
	case $SECTION in
		'INFO')
			if [ -z "$line" ]; then
				SECTION=''
			else
				IFS=':' read -r -a attribute <<< "$line"
				PRE="$HOSTNAME smartctl.info[$DEVICE,"
				ATTR_V=$( echo ${attribute[1]} | sed -e 's/^[ \t]*//' )
				ATTR_N=$(echo ${attribute[0]} | tr '[:upper:]' '[:lower:]' | sed 's/ /_/' )
				case ${attribute[0]} in
					'Model Family')
						echo "${PRE}$ATTR_N] \"$ATTR_V\""
					;;
					'Device Model')
						echo "${PRE}$ATTR_N] \"$ATTR_V\""
					;;
					'Serial Number')
						echo "${PRE}$ATTR_N] \"$ATTR_V\""
					;;
					'Firmware Version')
						echo "${PRE}$ATTR_N] \"$ATTR_V\""
					;;
					'User Capacity')
						echo "${PRE}$ATTR_N] \"$ATTR_V\""
					;;
					'Sector Size' | 'Sector Sizes')
						ATTR_N=$(echo 'Sector Size' | tr '[:upper:]' '[:lower:]' | sed 's/ /_/' )
						echo "${PRE}$ATTR_N] \"$ATTR_V\""
					;;
					'Rotation Rate')
						echo "${PRE}$ATTR_N] \"$ATTR_V\""
					;;
				esac
			fi
			
		;;
		'HEALF')
			if [ -z "$line" ]; then
				SECTION=''
			else
				IFS=':' read -r -a attribute <<< "$line"
				PRE="$HOSTNAME smartctl.smart[$DEVICE,"
				ATTR=$( echo ${attribute[1]} | sed -e 's/^[ \t]*//' )
				case ${attribute[0]} in
					'SMART overall-health self-assessment test result')
						echo "${PRE}test_result] \"$ATTR\""
					;;
				esac				
			fi
		;;
		'ATTR')
			if [ -z "$line" ]; then
				SECTION=''
			else
				read -r -a attribute <<< "$line"
				PRE="$HOSTNAME smartctl.smart[$DEVICE,"
				for i in "${!attribute[@]}";do
					if [[ $i == 0 ]]; then
						continue
					fi
					case ${attribute[$i]} in
						''|*[!0-9]*) ATTR="\"${attribute[$i]}\"" ;;
						*) ATTR="$(echo ${attribute[$i]} | sed 's/0*//')" ;;
					esac
					if [ -z "$ATTR" ]; then
						ATTR=0
					fi
					echo "${PRE}${attribute[0]},${HEADERS[$i]}] $ATTR"
				done				
			fi
		;;
	esac
done < /dev/stdin


The output will be approximately such:
The output will be approximately such:
test.local smartctl.info[sg1,model_family] "Western Digital RE4 (SATA 6Gb/s)"
test.local smartctl.info[sg1,device_model] "WDC WD2000FYYZ-01UL1B1"
test.local smartctl.info[sg1,serial_number] "WD-WCC1P1175320"
test.local smartctl.info[sg1,firmware_version] "01.01K02"
test.local smartctl.info[sg1,user_capacity] "2 000 398 934 016 bytes [2,00 TB]"
test.local smartctl.info[sg1,sector_size] "512 bytes logical/physical"
test.local smartctl.info[sg1,rotation_rate] "7200 rpm"
test.local smartctl.smart[sg1,test_result] "PASSED"
test.local smartctl.smart[sg1,1,attribute_name] "Raw_Read_Error_Rate"
test.local smartctl.smart[sg1,1,flag] "0x002f"
test.local smartctl.smart[sg1,1,value] 200
test.local smartctl.smart[sg1,1,worst] 200
test.local smartctl.smart[sg1,1,thresh] 51
test.local smartctl.smart[sg1,1,type] "Pre-fail"
test.local smartctl.smart[sg1,1,updated] "Always"
test.local smartctl.smart[sg1,1,when_failed] "-"
test.local smartctl.smart[sg1,1,raw_value] 0
test.local smartctl.smart[sg1,3,attribute_name] "Spin_Up_Time"
test.local smartctl.smart[sg1,3,flag] "0x0027"
test.local smartctl.smart[sg1,3,value] 169
test.local smartctl.smart[sg1,3,worst] 169
test.local smartctl.smart[sg1,3,thresh] 21
test.local smartctl.smart[sg1,3,type] "Pre-fail"
test.local smartctl.smart[sg1,3,updated] "Always"
test.local smartctl.smart[sg1,3,when_failed] "-"
test.local smartctl.smart[sg1,3,raw_value] 6508
test.local smartctl.smart[sg1,4,attribute_name] "Start_Stop_Count"
test.local smartctl.smart[sg1,4,flag] "0x0032"
test.local smartctl.smart[sg1,4,value] 100
test.local smartctl.smart[sg1,4,worst] 100
test.local smartctl.smart[sg1,4,thresh] 0
test.local smartctl.smart[sg1,4,type] "Old_age"
test.local smartctl.smart[sg1,4,updated] "Always"
test.local smartctl.smart[sg1,4,when_failed] "-"
test.local smartctl.smart[sg1,4,raw_value] 36
test.local smartctl.smart[sg1,5,attribute_name] "Reallocated_Sector_Ct"
test.local smartctl.smart[sg1,5,flag] "0x0033"
test.local smartctl.smart[sg1,5,value] 200
test.local smartctl.smart[sg1,5,worst] 200
test.local smartctl.smart[sg1,5,thresh] 140
test.local smartctl.smart[sg1,5,type] "Pre-fail"
test.local smartctl.smart[sg1,5,updated] "Always"
test.local smartctl.smart[sg1,5,when_failed] "-"
test.local smartctl.smart[sg1,5,raw_value] 0
test.local smartctl.smart[sg1,7,attribute_name] "Seek_Error_Rate"
test.local smartctl.smart[sg1,7,flag] "0x002e"
test.local smartctl.smart[sg1,7,value] 200
test.local smartctl.smart[sg1,7,worst] 200
test.local smartctl.smart[sg1,7,thresh] 0
test.local smartctl.smart[sg1,7,type] "Old_age"
test.local smartctl.smart[sg1,7,updated] "Always"
test.local smartctl.smart[sg1,7,when_failed] "-"
test.local smartctl.smart[sg1,7,raw_value] 0
test.local smartctl.smart[sg1,9,attribute_name] "Power_On_Hours"
test.local smartctl.smart[sg1,9,flag] "0x0032"
test.local smartctl.smart[sg1,9,value] 79
test.local smartctl.smart[sg1,9,worst] 79
test.local smartctl.smart[sg1,9,thresh] 0
test.local smartctl.smart[sg1,9,type] "Old_age"
test.local smartctl.smart[sg1,9,updated] "Always"
test.local smartctl.smart[sg1,9,when_failed] "-"
test.local smartctl.smart[sg1,9,raw_value] 15927
test.local smartctl.smart[sg1,10,attribute_name] "Spin_Retry_Count"
test.local smartctl.smart[sg1,10,flag] "0x0032"
test.local smartctl.smart[sg1,10,value] 100
test.local smartctl.smart[sg1,10,worst] 253
test.local smartctl.smart[sg1,10,thresh] 0
test.local smartctl.smart[sg1,10,type] "Old_age"
test.local smartctl.smart[sg1,10,updated] "Always"
test.local smartctl.smart[sg1,10,when_failed] "-"
test.local smartctl.smart[sg1,10,raw_value] 0
test.local smartctl.smart[sg1,11,attribute_name] "Calibration_Retry_Count"
test.local smartctl.smart[sg1,11,flag] "0x0032"
test.local smartctl.smart[sg1,11,value] 100
test.local smartctl.smart[sg1,11,worst] 253
test.local smartctl.smart[sg1,11,thresh] 0
test.local smartctl.smart[sg1,11,type] "Old_age"
test.local smartctl.smart[sg1,11,updated] "Always"
test.local smartctl.smart[sg1,11,when_failed] "-"
test.local smartctl.smart[sg1,11,raw_value] 0
test.local smartctl.smart[sg1,12,attribute_name] "Power_Cycle_Count"
test.local smartctl.smart[sg1,12,flag] "0x0032"
test.local smartctl.smart[sg1,12,value] 100
test.local smartctl.smart[sg1,12,worst] 100
test.local smartctl.smart[sg1,12,thresh] 0
test.local smartctl.smart[sg1,12,type] "Old_age"
test.local smartctl.smart[sg1,12,updated] "Always"
test.local smartctl.smart[sg1,12,when_failed] "-"
test.local smartctl.smart[sg1,12,raw_value] 30
test.local smartctl.smart[sg1,183,attribute_name] "Runtime_Bad_Block"
test.local smartctl.smart[sg1,183,flag] "0x0032"
test.local smartctl.smart[sg1,183,value] 100
test.local smartctl.smart[sg1,183,worst] 100
test.local smartctl.smart[sg1,183,thresh] 0
test.local smartctl.smart[sg1,183,type] "Old_age"
test.local smartctl.smart[sg1,183,updated] "Always"
test.local smartctl.smart[sg1,183,when_failed] "-"
test.local smartctl.smart[sg1,183,raw_value] 0
test.local smartctl.smart[sg1,192,attribute_name] "Power-Off_Retract_Count"
test.local smartctl.smart[sg1,192,flag] "0x0032"
test.local smartctl.smart[sg1,192,value] 200
test.local smartctl.smart[sg1,192,worst] 200
test.local smartctl.smart[sg1,192,thresh] 0
test.local smartctl.smart[sg1,192,type] "Old_age"
test.local smartctl.smart[sg1,192,updated] "Always"
test.local smartctl.smart[sg1,192,when_failed] "-"
test.local smartctl.smart[sg1,192,raw_value] 29
test.local smartctl.smart[sg1,193,attribute_name] "Load_Cycle_Count"
test.local smartctl.smart[sg1,193,flag] "0x0032"
test.local smartctl.smart[sg1,193,value] 200
test.local smartctl.smart[sg1,193,worst] 200
test.local smartctl.smart[sg1,193,thresh] 0
test.local smartctl.smart[sg1,193,type] "Old_age"
test.local smartctl.smart[sg1,193,updated] "Always"
test.local smartctl.smart[sg1,193,when_failed] "-"
test.local smartctl.smart[sg1,193,raw_value] 6
test.local smartctl.smart[sg1,194,attribute_name] "Temperature_Celsius"
test.local smartctl.smart[sg1,194,flag] "0x0022"
test.local smartctl.smart[sg1,194,value] 125
test.local smartctl.smart[sg1,194,worst] 96
test.local smartctl.smart[sg1,194,thresh] 0
test.local smartctl.smart[sg1,194,type] "Old_age"
test.local smartctl.smart[sg1,194,updated] "Always"
test.local smartctl.smart[sg1,194,when_failed] "-"
test.local smartctl.smart[sg1,194,raw_value] 25
test.local smartctl.smart[sg1,196,attribute_name] "Reallocated_Event_Count"
test.local smartctl.smart[sg1,196,flag] "0x0032"
test.local smartctl.smart[sg1,196,value] 200
test.local smartctl.smart[sg1,196,worst] 200
test.local smartctl.smart[sg1,196,thresh] 0
test.local smartctl.smart[sg1,196,type] "Old_age"
test.local smartctl.smart[sg1,196,updated] "Always"
test.local smartctl.smart[sg1,196,when_failed] "-"
test.local smartctl.smart[sg1,196,raw_value] 0
test.local smartctl.smart[sg1,197,attribute_name] "Current_Pending_Sector"
test.local smartctl.smart[sg1,197,flag] "0x0032"
test.local smartctl.smart[sg1,197,value] 200
test.local smartctl.smart[sg1,197,worst] 200
test.local smartctl.smart[sg1,197,thresh] 0
test.local smartctl.smart[sg1,197,type] "Old_age"
test.local smartctl.smart[sg1,197,updated] "Always"
test.local smartctl.smart[sg1,197,when_failed] "-"
test.local smartctl.smart[sg1,197,raw_value] 0
test.local smartctl.smart[sg1,198,attribute_name] "Offline_Uncorrectable"
test.local smartctl.smart[sg1,198,flag] "0x0030"
test.local smartctl.smart[sg1,198,value] 200
test.local smartctl.smart[sg1,198,worst] 200
test.local smartctl.smart[sg1,198,thresh] 0
test.local smartctl.smart[sg1,198,type] "Old_age"
test.local smartctl.smart[sg1,198,updated] "Offline"
test.local smartctl.smart[sg1,198,when_failed] "-"
test.local smartctl.smart[sg1,198,raw_value] 0
test.local smartctl.smart[sg1,199,attribute_name] "UDMA_CRC_Error_Count"
test.local smartctl.smart[sg1,199,flag] "0x0032"
test.local smartctl.smart[sg1,199,value] 200
test.local smartctl.smart[sg1,199,worst] 200
test.local smartctl.smart[sg1,199,thresh] 0
test.local smartctl.smart[sg1,199,type] "Old_age"
test.local smartctl.smart[sg1,199,updated] "Always"
test.local smartctl.smart[sg1,199,when_failed] "-"
test.local smartctl.smart[sg1,199,raw_value] 0
test.local smartctl.smart[sg1,200,attribute_name] "Multi_Zone_Error_Rate"
test.local smartctl.smart[sg1,200,flag] "0x0008"
test.local smartctl.smart[sg1,200,value] 200
test.local smartctl.smart[sg1,200,worst] 200
test.local smartctl.smart[sg1,200,thresh] 0
test.local smartctl.smart[sg1,200,type] "Old_age"
test.local smartctl.smart[sg1,200,updated] "Offline"
test.local smartctl.smart[sg1,200,when_failed] "-"
test.local smartctl.smart[sg1,200,raw_value] 0


And further we will just send all this Zabbix Trapper:
zabbix_smartctl.sh
#!/bin/bash
# Sending collected data to the zabbix server
# Get device list and type from STDIN, produced by smartdiscovery.sh

PREFIX='/usr/local/bin'
AGENT_CFG='/etc/zabbix/zabbix_agentd.conf'
while IFS=' ' read -r -a attr; do
	smartctl -A -H -i -d ${attr[1]} /dev/${attr[0]} | $PREFIX/smart2zabbix.sh /dev/${attr[0]} ${attr[1]} - | /usr/bin/zabbix_sender -c $AGENT_CFG -i -
done < /dev/stdin


It is only necessary to permit sudo for some scripts further, to place a task in cron and to import a template on Zabbix Server.
The ready set can be received from the official Zabbix Share portal where all this is laid out for all comers: S.M.A.R.T. monitoring with smartmontools (LLD, Trapper)

Primary benefit before other similar templates \scripts, it is possible to call that all attributes which you in an effect use at will are loaded, without change of scripts, having only added them on the server.

This article is a translation of the original post at habrahabr.ru/post/274391/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus