Nagios自定义扩展

原理：监控端通过check_nrpe把要监控的指令发送给被监控端，被监控端在本机执行监控任务，并把执行的结果发送回监控端。

如何扩展Nagios，以实现自定义监控？

借助插件进行的每一次有效的Nagios检查（Nagios check）都会生成一个数字表示的退出状态。可能的状态有：

0--各方面都正常，检查成功完成。
1--资源处于警告状态。某个地方不太妙。
2--资源处于临界状态。原因可能是主机宕机或服务未运行。
3--未知状态，这未必表明就有问题，而是表明检查没有给出一个清楚明确的状态。

插件还能输出文本消息。默认情况下，该消息显示在Nagios web界面和Nagios邮件警报信息中。尽管消息并不是硬性要求，你通常还是可以在可用插件中找到它们，因为消息告诉用户出了什么岔子，而不会迫使用户查阅说明文档。

网上的例子，我自己稍作更改实验后可正常测试使用：

被监控端设置：

vim /usr/lib64/nagios/plugins/check_file

#!/bin/bash
filename=$1
if [ ! -e $filename ];then
        echo "CRITICALL status -file $filename doesn't exist"
        exit 2
elif [ ! -r $filename ];then
        echo "WARNING status -file $filename is not readable"
        exit 1
elif [ ! -f $filename ];then
        echo "UNKNOWN status -file $filename is not a file"
        exit 3
else 
        if [  $1 ];then
        echo "OK status -file $filename is OK"
        exit 0
        fi
fi

View Code

vim /etc/nagios/nrpe.cfg

command[check_mytestfile]=/usr/lib64/nagios/plugins/check_file  /tmp/jjtest

监控端设置：

vim /usr/local/nagios/etc/objects/command.cfg

define command{
        command_name check_myfile
        command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
        }

vim /usr/local/nagios/etc/objects/service.cfg

define service{
        use                             linux-service
        host_name                       myhost
        service_description             jjtest
        check_interval                  1            #检测间隔，默认每5分钟主动检测一次主机
        retry_interval                  1            #重试间隔，默认值1分钟
        check_command                   check_myfile!check_mytestfile
        }

templates.cfg 模板文件相关参数说明：（当然了也包括了上面的check_interval和retry_interval）

max_check_attempts #这一项用来定义在检测返回结果不是OK时，nagios重试检测命令的次数。设置这个值为1会导致nagios一次也不重试就报警

check_period   #这一项用一个time period项的名字来定义在哪段时间内激活对这台主机的主动检测。time period是定义在别的文件里的配置项,我们可以在这里用名字来引用它
contact_groups                  #这是一个联系组列表。我们用联系组的名字来引用她们。多个联系组间用“，”来分隔。
notification_interval            #这一项用来定义当一个服务仍然down或unreachable时，我们间隔多久重发一次通知给联系组，通告间隔，默认2小时。
notification_period              #这一项用一个time period定义来标识什么时间段内给联系组送通知。这里我们用time period定义的名字来引用它。
notification_options             #这一项用来决定发送通知的时机。选项有：d = 当有down状态时发送通知，u = 当有unreachable状态时发送通知, r = 当有服务recoveries时发送通知，f = 当主机启动或停机时发送通知。如果你给一个n选项，那么永远不会发送通知。

-------------------------------------------------------------------------------------------------------------------

yum install nagios-plugins-* 生成/usr/lib64/nagios/plugins/check_*脚本文件

yum install nrpe

防止tomcat进程假死

在tomcat的webapps目录下，新建一个目录jiankong（这个目录随便建），然后在其下面放一个asp文件。然后修改commands.cfg ，在里面添加

#tomcat1 set

define command{

　　　　command_name check_tomcat_8028

　　　　command_line /usr/local/nagios/libexec/check_http -I $HOSTADDRESS$ -p 8028 -u /jiankong/test.jsp -e 200

　　　　}

如果有多个端口，可以建立多个，只需要修改端口号，上面这个是8028端口，然后在servers.cfg中添加服务就好了。

-----------------------------------------------------------------------------------------------------

关于日志监控的相关功能可以使用nagios plugins自带的check_log 参考：http://sery.blog.51cto.com/10037/287923/

附：

巡检指定日志文件里的关键词，可以设置阀值，超过阀值报警！（用的是python2所写，摘自网络）

vim check_mylog

# -*- coding: utf-8 -*-
#!/usr/bin/python
import mmap
import os
import sys
import getopt

def usage():
    print """
    check_log is a Nagios monitor logs Script

    Usage:

    check_log [-h|--help][-l|--log][-s|--string][-w|warning][-c|critical]

    Options:
           --help|-h)
                 check_log help.
           --log|-l)
                 sets log file path.
           --string|-s)
                 sets monitor Keywords.
           --warning|-w)
                 sets Keywords quantity.Default is: off
           --critical|-c)
                 sets Keywords quantity.Default is: off
     example:
            ./check_log -l /var/log/nginx.log -s "502 Bad Gateway" -w 5 -c 10 """
    sys.exit(3)

try:
   options,args = getopt.getopt(sys.argv[1:],"hl:s:w:c:",["--help","log=","string=","warning=","critical="])
except getopt.GetoptError:
   usage()

for n,v in options:
    if n in ("-h","--help"):
       usage()
    if n in ("-l","--log"):
       log = v
    if n in ("-s","--string"):
       string = v
    if n in ("-w","--warning"):
       warning = v
    if n in ("-c","--critical"):
       critical = v

if 'log' in dir() and 'string' in dir():
   try:
      file = open(log,"r+")
      size = os.path.getsize(log)
      data = mmap.mmap(file.fileno(),size)
     # 用了mmap模块的功能！
      text = data.read(-1)
      counts = text.count(string)
      counts = str(counts)
      data.close()
      file.close()
   except IOError:
      print "No such file or directory:"+log
else:
   usage()

if 'warning' in dir() and 'critical' in dir():
   if warning < critical:
      if counts >= warning and counts < critical:
         print 'WARNING - %s views %s' % (string,counts)
         sys.exit(2)
      elif counts >= critical:
         print 'CRITICAL - %s views %s' % (string,counts)
         sys.exit(1)
      else:
         print 'OK - %s views %s' % (string,counts)
         sys.exit(0)
   else:
     print "Must critical > warning"
     sys.exit(0)
else:
    print 'OK - %s views %s' % (string,counts)
    sys.exit(0)

View Code

command[check_mylog]=/usr/bin/python /usr/local/nagios/libexec/check_mylog -l /var/logs/nginx.log -s "No such file or directory" -w 2 -c 5

-------------------------------------------------------------------------------------------------------

查看配置文件 template.cfg，是否为主动检测模式

define service{
        name                         passive_service
        use                          generic-service
        max_check_attempts           1
        active_checks_enabled        0 #（关闭主动检测）
        passive_checks_enabled       1 #（开启被动检测）
        normal_check_interval        5
        retry_check_interval         1
        check_freshness              1 # (开启强制刷新)
        notifications_enabled        1
        notification_interval        5
        notification_period          24x7
        contact_groups               admins
        register                     0 #（必须）
           }

相关阅读:
WPF中ListBoxItem绑定一个UserControl的学习
 Server.Transfer和Response.Redirect的区别
 4个程序员的一天
 (转)让ADO.NET Entity Framework支持Oracle数据库
 IIS操作类
 HttpHandler与HttpModule区别
 网站性能优化的34条黄金法则
 oracle9i/10g/11g各种下载
 WCF简要介绍
 软件系统的稳定性
原文地址：https://www.cnblogs.com/wjoyxt/p/4936954.html