Developers Club geek daily blog

1 year, 4 months ago
There is a mass of monitors of an operating system, but the special sense is made by a task to catch the moment of emergence of a problem and to catch the reason of high loading or a source of problems with performance. I call it hunting for "rodents" of resources.

For this purpose I composed for myself a simple script of ratskill.sh which you will be able to modify under the systems and tasks.

The principle of work simple — a script is started with the set frequency, checks the Load Average level (you can use other control parameters) and in case of exceeding of the set value the script executes the set set of diagnostic commands with creation of the report which is sent on the postal address specified by you.

Example of a script for the OpenVZ server


#!/bin/bash

# чтобы не было проблем с выводом данных на кириллице 
export LC_ALL=C

# ваш лимит load average может быть другим в зависимости от количества ядер и типа задач
# например,  для сервера  OpenVZ могу рекомендовать 75-200, для гипервизора KVM - 15-45   
LALIMIT="80"

# кому отправить  отчет
EMAIL="alerts@домен.tld"

# тема сообщения
SUBJECT="WARNING-High load notification"

# Получить среднее значение нагрузки за  5 минут
F5M="$(cat /proc/loadavg | awk '{print $1}'|awk -F \. '{print $1}')"

# Сравнить с пороговым значением
RESULT="$(echo "$F5M > $LALIMIT" | bc)"


# Если не зарегистрировано превышение лимита, прекратить выполнение и выйти из выполнения
# Если зарегистрировано превышение, то  создать и отправить  отчет, но не делать это повторно до 
# понижения нагрузки ниже лимита.  Для этого при превышении  создать файл  /tmp/ratkill.flag, 
# при понижении удалить /tmp/ratkill.flag  для продолжения контроля.
#
if (( "$RESULT" == "1" )); then
  if [ -f /tmp/ratkill.flag ]; then
    exit 0
  fi
  touch /tmp/ratkill.flag
else
  if [ -f /tmp/ratkill.flag ]; then
    rm -f /tmp/ratkill.flag
  fi
  exit 0
fi

# Создать временный файл для отчета
TEMPFILE="$(mktemp)"

# Создать заголовок отчета
echo "Load average Crossed allowed limit $LALIMIT." >> $TEMPFILE
echo "Hostname: $(hostname)" >> $TEMPFILE
echo "Local Date &Time : $(date)" >> $TEMPFILE

# Использование памяти
echo "Memory-----------------------------------" >> $TEMPFILE
free -m >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE
vmstat -s -Sm >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# Контроль количества переключений контекста
echo "context switches:" >> $TEMPFILE
sar -w 1 5 >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# наиболее активные "гости"
echo "Top loaded containers:" >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE
/usr/sbin/vzlist \
-o veid,ip,hostname,numproc,numfile,numflock,numtcpsock,physpages,laverage \
-s laverage | tail -20 >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

#Контроль количества сетевых соединений у гостей
echo "Top containers by net. connections count:" >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE
/usr/sbin/vzlist \
-o veid,ip,hostname,numproc,numtcpsock -s numtcpsock | tail -20 >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# Общее количество сетевых подключений
echo "conntrack count" >> $TEMPFILE
wc -l /proc/net/nf_conntrack >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# Утилизация дисков
echo "I/O statistic:" >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE
iostat -x 2 5 >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# Снимок вывода top
echo "System snapshot from top:" >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE
top -b | head -30 >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# Процессы с максимальным I/O и нагрузкой на CPU
echo "Report from dstat:" >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE
dstat --net --disk --disk-util --sys --load --proc --top-io-adv \
--top-cpu-adv --nocolor 5 5 >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# Отчет по RAID массивам 
echo "RAID Logical device information" >> $TEMPFILE
#/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aAll >> $TEMPFILE
/usr/local/sbin/arcconf GETCONFIG  1 ld >> $TEMPFILE
echo "-------------------------------------------" >> $TEMPFILE

# Отправить  отчет по почте
cat $TEMPFILE > /tmp/load.txt
echo "${SUBJECT}-${F5M}" | mail -a /tmp/load.txt -s "$(hostname -s)-${SUBJECT}-${F5M}" "$EMAIL" 
rm -f $TEMPFILE

To have a binding to the specific guest it is possible to add still analysis of PID of processes through vzpid and many other, but you can make it if it is necessary.

For work of a script you will need to set utilities of sysstat and dstat in addition. Use the latest version of dstat for your distribution kit differently you do not receive the necessary output.

Something similar has to turn out on it:

image

You look also:


This article is a translation of the original post at habrahabr.ru/post/274633/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus