前几天公司的邮件系统被内部的一个员工搞的不能收发邮件,原因是因为其在java程序中加入了告警设置。出现告警后会由公司的邮箱向他的163邮箱发送告警邮件。不想程序写的不严谨,出现了死循环。搞的不停的向163邮箱发邮件,163的反垃圾机制过滤后。所有的邮件发不出去,只能排在队列里。等发现时,发现队列里已经有15万多个在等待发送的邮件了。结果是:公司老大很生气,邮件系统出问题了都没人知道。为什么不加入nagios监控里。

老大发话了,咱也只能屁颠屁颠的去办了。本来想自己写个插件,不过有现成的为什么不用呢,先去了exchange.nagios.org里找监控postfix队列的插件。相关的插件有几个,具体可以查看页面:http://exchange.nagios.org/index.php?option=com_mtree&task=search&Itemid=74&searchword=postfix ,大致看了,都差不多。正好在搜索时从网上又发现了另外一个脚本,即监控发送内容的多少又监控队列的多少(说白了几个脚本不过是利用mailq和postqueue -p罢了)

  1#!/bin/bash
  2STATE_OK=0
  3STATE_WARNING=1
  4STATE_CRITICAL=2
  5STATE_UNKNOWN=3
  6
  7#default options
  8postfix_dir=/var/spool/postfix
  9warning_active=100
 10critical_active=2000
 11warning_deferred=500
 12critical_deferred=1000
 13warning_other=1
 14critical_other=100
 15
 16
 17function usage {
 18echo "$0 [-dir postfix_dir] [-wa warning_active] [-ca critical_active] [-wd warning_deferred] [-cd critical_deferred] [-wo warning_other] [-co critical_other]" 1>&2
 19}
 20
 21
 22if [ -z $# ]; then
 23	echo "Error : need argument!" 1>&2
 24	usage
 25	exit $STATE_UNKNOWN
 26fi
 27
 28while test -n "$1"; do
 29    case "$1" in
 30        --dir|-d ) postfix_dir=$2
 31				shift;;
 32        --wa|-w ) warning_active=$2
 33				shift;;
 34        --ca|-c ) critical_active=$2
 35				shift;;
 36        --wd ) warning_deferred=$2
 37				shift;;
 38        --cd ) critical_deferred=$2
 39				shift;;
 40        --wo ) warning_other=$2
 41				shift;;
 42        --co ) warning_other=$2
 43				shift;;
 44		*) echo "Wrong arguments!" 1>&2
 45		   usage
 46           exit $STATE_UNKNOWN ;;
 47    esac
 48    shift
 49done
 50
 51queue=$(/usr/bin/mailq | tail -n 1)
 52# queue empty = ok
 53if [ "$queue" == "Mail queue is empty" ] ; then	
 54	perfdata="'req'=0;;; 'size'=0KB;;; 'active'=0;$warning_active;$critical_active; 'bounce'=0;$warning_other;$warning_other; 'corrupt'=0;$warning_other;$warning_other; 'deferred'=0;$warning_deferred;$critical_deferred; 'maildrop'=0;$warning_other;$warning_other; "
 55	output="$queue"
 56	echo "OK - ${output} | ${perfdata}"
 57	exit $STATE_OK
 58else
 59	queue_req=$(echo $queue | cut -d ' ' -f 5)
 60	queue_size=$(echo $queue | cut -d ' ' -f 2)	# in KB
 61	queue_active=$(find $postfix_dir/active -type f | wc -l)
 62	queue_bounce=$(find $postfix_dir/bounce -type f | wc -l)
 63	queue_corrupt=$(find $postfix_dir/corrupt -type f | wc -l)
 64	queue_deferred=$(find $postfix_dir/deferred -type f | wc -l)
 65	queue_maildrop=$(find $postfix_dir/maildrop -type f | wc -l)
 66	perfdata="'req'=$queue_req;;; 'size'=${queue_size}KB;;; 'active'=$queue_active;$warning_active;$critical_active; 'bounce'=$queue_bounce;$warning_other;$warning_other; 'corrupt'=$queue_corrupt;$warning_other;$warning_other; 'deferred'=$queue_deferred;$warning_deferred;$critical_deferred; 'maildrop'=$queue_maildrop;$warning_other;$warning_other; "
 67fi
 68
 69#echo $perfdata
 70#echo "postfix_dir $postfix_dir - warning_active $warning_active - critical_active $critical_active - warning_deferred $warning_deferred - critical_deferred $critical_deferred - warning_other $warning_other - critical_other $critical_other"
 71
 72returnCrit=0
 73returnWarn=0
 74errorString=""
 75#Check critical and warning state for each queue
 76if [ $queue_active -ge $critical_active ]; then
 77    returnCrit=1
 78	errorString="$errorString - CRIT $queue_active > $critical_active actives"
 79elif [ $queue_active -ge $warning_active ]; then
 80    returnWarn=1
 81	errorString="$errorString - WARN $queue_active > $warning_active actives"
 82fi
 83if [ $queue_bounce -ge $critical_other ]; then
 84    returnCrit=1
 85	errorString="$errorString - CRIT $queue_bounce > $critical_other bounce"
 86elif [ $queue_bounce -ge $warning_other ]; then
 87    returnWarn=1
 88	errorString="$errorString - CRIT $queue_bounce > $warning_other bounce"
 89fi
 90if [ $queue_corrupt -ge $critical_other ]; then
 91    returnCrit=1
 92	errorString="$errorString - CRIT $queue_corrupt > $critical_other corrupt"
 93elif [ $queue_corrupt -ge $warning_other ]; then
 94    returnWarn=1
 95	errorString="$errorString - WARN $queue_corrupt > $warning_other corrupt"
 96fi
 97if [ $queue_deferred -ge $critical_deferred ]; then
 98    returnCrit=1
 99	errorString="$errorString - CRIT $queue_deferred > $critical_deferred deferred"
100elif [ $queue_deferred -ge $warning_deferred ]; then
101    returnWarn=1
102	errorString="$errorString - WARN $queue_deferred > $warning_deferred deferred"
103fi
104if [ $queue_maildrop -ge $critical_other ]; then
105    returnCrit=1
106	errorString="$errorString - CRIT $queue_maildrop > $critical_other maildrop"
107elif [ $queue_maildrop -ge $warning_other ]; then
108    returnWarn=1
109	errorString="$errorString - WARN $queue_maildrop > $warning_other maildrop"
110fi
111
112output="$queue_req request(s) ($queue_size kB)"
113if [ $returnCrit == 0 ] && [ $returnWarn == 0 ] ; then
114	echo "OK - ${output} | ${perfdata}"
115	returnCode=$STATE_OK
116elif [ $returnCrit == 0 ] && [ $returnWarn == 1 ] ; then
117	echo "WARNING - ${output} ${errorString} | ${perfdata}"
118	returnCode=$STATE_WARNING
119else
120	echo "CRITICAL - ${output} ${errorString} | ${perfdata}"
121	returnCode=$STATE_CRITICAL
122fi
123
124exit $returnCode

注:脚本刚拿来用时,是有问题的,我把其中出问题的部分的判断已经改好了,可以直接拿走使用。

接着修改邮件服务器的nrpe.cfg文件,增加如下command监控:

1command[check_postque]=/App/nagios/libexec/check_postque -w 50-c 100 -W 3000000 -C 5000000 -p postfix
2#队列数大于50告警,100严重告警。邮件大小总计300M告警,邮件大小总计500M严重告警

注:我的nagios的程序是安装在/App/nagios目录的,如果你安装的其他目录,上面的command中的路径也需要做相应的修改,不然会出问题的。

然后,kill掉nrpe进程,并重新启动/App/nagios/bin/nrpe -c /App/nagios/etc/nrpe.cfg -d 在nagios中心主控端也需要添加相应的一条监控service

1define service{
2        use                             local-service,srv-pnp
3        host_name                       XXX.XX.XX.XX
4        service_description             check_postque
5        check_command                   check_nrpe!check_postque
6        }

我上面把自己的IP给改成了XXX,具体改成自己的就好了,use也需要nagios之前的配置改。操作完成后,在主控端通过/App/nagios/bin/nagios -v /App/nagios/etc/nagios.cfg查看是不是配置文件有问题,如果没问题就可以进入/etc/init.d目前通过./nagios reload重新加载配置文件了。

如果有问题,也可以通过在主控端通过

./check_nrpe -H XXX.XX.XX.XX -c check_postque查看是不是有输出来检测nrpe通信是不是有问题。

我这边因为没有邮件在发送,所以得到的结果是OK : Mail queue is empty。