监控的作用有两个:一是可以通过查看历史或当前了解主机一段时间内的运行情况、负载情况;一是在出现状况时及时发出通知,告知相关人员进行处理。这里主要说下后者。 在nagios的配置中,关于主机状态和服务状态通知的方式主要有三种调用方法,一是通过contacts或contact_groups;一是通过模板引用define contacts;一是通过define host模板引用。

本文主要为承接 nagios分组相关 这篇日志而写的。该文中最后提到nagios的配置引用方式非常灵活。这里就结合监控通知联系人的调用方式做一个说明。

一、联系人引用方法一(通过contacts或contact_groups)

先通过define contacts定义好通知人和通知方式,在主机或服务中的引用如下:

1define service{
2        use                          window-service    #引用定义的服务模板
3        host_name                    jjh
4        service_description             PING
5        check_command               check_ping!100.0,20%!500.0,60%
6        contacts                    admin1   #需事先定义过
7        }

注:上面的use使用的是模板,对应我们经常说的templates.cfg中的内容。contacts引用的是contacts.cfg中的内容。

二、联系人引用方法二(通过模板引用define contacts)

1、先定义联系人

1define  contact {
2        contact_name                    ZheJiang
3        use                             generic-contact #联系人中引用模板
4        alias                           ZheJiang_Mobile
5        service_notification_commands   notify-service-by-email,notify-service-by-sms
6        email                           abc@361way.com,def@361way.com
7        pager                           "1366XXXXXXX,13819XXXXXX"
8        }

2、通过use引用

 1define  service{
 2    use                    ZheJiang   #引用联系人
 3    host_name              ZJ-ZJ-App
 4    service_description    CPU Load
 5    low_flap_threshold     0
 6    high_flap_threshold    0.999
 7    check_command          check_nrpe!check_load
 8}
 9define service {
10    use                    ZheJiang   #引用联系人
11    host_name              ZJ-ZJ-App
12    service_description    Check_Disk
13    check_command          check_nrpe!check_disk
14}

注:这里直接使用通过use使用了contact定义,use的作用类似于编程中的include ,就是把前面定义过的东西直接套过来用。而上面define的contact里又use了templates.cfg中的定义。templates.cfg一般会定义通知触发条件,时间周期等。

三、联系人引用方法三(通过define host模板引用)

这里提到的方法和方法二其实是个对调,就是先定义好联系人,再在templates.cfg中通过contacts或contact_groups调用联系人。而host-xxxx.cfg中再去引用templates.cfg中的模板。由于方法二中已经提到过contacts.cfg中联系人的定义,这里就省过。这里只列几个templates.cfg中的常见定义:

 1#定义联系人模板
 2define contact{
 3        name                            generic-contact         ; The name of this contact template
 4        service_notification_period     24x7                    ; service notifications can be sent anytime
 5        host_notification_period        24x7                    ; host notifications can be sent anytime
 6        service_notification_options    w,u,c,r,f,s             ; #触发条件,下同
 7        host_notification_options       d,u,r,f,s               ; send notifications for all host states, flapping events, and scheduled downtime events
 8        service_notification_commands   notify-service-by-email ; send service notifications via email
 9        host_notification_commands      notify-host-by-email    ; send host notifications via email
10        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
11        }
12#定义主机模板
13define host{
14        name                            generic-host    ; The name of this host template
15        notifications_enabled           1               ; Host notifications are enabled
16        event_handler_enabled           1               ; Host event handler is enabled
17        flap_detection_enabled          1               ; Flap detection is enabled
18        failure_prediction_enabled      1               ; Failure prediction is enabled
19        process_perf_data               1               ; Process performance data
20        retain_status_information       1               ; Retain status information across program restarts
21        retain_nonstatus_information    1               ; Retain non-status information across program restarts
22        notification_period             24x7            ; Send host notifications at any time
23        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
24        }
25define host{
26        name                            JJH-server      ; The name of this host template
27        use                             generic-host    ; This template inherits other values from the generic-host template
28        check_period                    24x7            ; By default, Linux hosts are checked round the clock
29        check_interval                  5               ; Actively check the host every 5 minutes
30        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
31        max_check_attempts              10              ; Check each Linux host 10 times (max)
32        check_command                   check-host-alive ; Default command to check Linux hosts
33        notification_period             workhours
34        notification_interval           120             ; Resend notifications every 2 hours
35        notification_options            d,u,r           ; Only send notifications for specific host states
36        contact_groups                  admins-jjh      #包含引用联系人组
37        hostgroups                      JJH-servers
38        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
39        }
40#定义服务模板
41define service{
42        name                            generic-service         ; The 'name' of this service template
43        active_checks_enabled           1                       ; Active service checks are enabled
44        passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
45        parallelize_check               1                       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
46        obsess_over_service             1                       ; We should obsess over this service (if necessary)
47        check_freshness                 0                       ; Default is to NOT check service 'freshness'
48        notifications_enabled           1                       ; Service notifications are enabled
49        event_handler_enabled           1                       ; Service event handler is enabled
50        flap_detection_enabled          1                       ; Flap detection is enabled
51        failure_prediction_enabled      1                       ; Failure prediction is enabled
52        process_perf_data               1                       ; Process performance data
53        retain_status_information       1                       ; Retain status information across program restarts
54        retain_nonstatus_information    1                       ; Retain non-status information across program restarts
55        is_volatile                     0                       ; The service is not volatile
56        check_period                    24x7                    ; The service can be checked at any time of the day
57        max_check_attempts              3                       ; Re-check the service up to 3 times in order to determine its final (hard) state
58        normal_check_interval           10                      ; Check the service every 10 minutes under normal conditions
59        retry_check_interval            2                       ; Re-check the service every two minutes until a hard state can be determined
60        contact_groups                  admins                  ; Notifications get sent out to everyone in the 'admins' group
61        notification_options            w,u,c,r                 ; Send notifications about warning, unknown, critical, and recovery events
62        notification_interval           60                      ; Re-notify about service problems every hour
63        notification_period             24x7                    ; Notifications can be sent out at any time
64         register                        0                      ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
65        }
66#通过将notifications_enabled设为0,关闭通知
67define service{
68        name                            no-notice-service           ; The name of this service template
69        use                             generic-service         ; Inherit default values from the generic-service definition
70        max_check_attempts              4                       ; Re-check the service up to 4 times in order to determine its final (hard) state
71        normal_check_interval           5                       ; Check the service every 5 minutes under normal conditions
72        notifications_enabled           0                       ; Service notifications are enabled
73        event_handler_enabled           0
74        retry_check_interval            1                       ; Re-check the service every minute until a hard state can be determined
75        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
76        }
77#以下服务模板中指定了通知(联系人)组
78define service{
79        name                            windows-service           ; The name of this service template
80        use                             generic-service         ; Inherit default values from the generic-service definition
81        max_check_attempts              4                       ; Re-check the service up to 4 times in order to determine its final (hard) state
82        normal_check_interval           5                       ; Check the service every 5 minutes under normal conditions
83        retry_check_interval            1                       ; Re-check the service every minute until a hard state can be determined
84        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
85        contact_groups                  admins-win              #包含引用联系人组
86        }
87define service{
88        name                            JJH-service           ; The name of this service template
89        use                             generic-service         ; Inherit default values from the generic-service definition
90        max_check_attempts              4                       ; Re-check the service up to 4 times in order to determine its final (hard) state
91        normal_check_interval           5                       ; Check the service every 5 minutes under normal conditions
92        retry_check_interval            1                       ; Re-check the service every minute until a hard state can be determined
93        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
94        contact_groups                  admins-jjh              #包含引用联系人组
95        }

注:以上模板的书写可以看到非常灵活,可以设置是否通知,联系人组,通知频率,触发条件等。模板的书写为了以后在xxxhost.cfg中use引用方便,简少书写的内容。这又类似于编程中的变量。

而在比如361way.cfg之样的主机中引用模板时如下:

 1#模板引用
 2define host{
 3        use                     JJH-server     #使用模板
 4        host_name               jjh-cc
 5        parents                 aliyun
 6        statusmap_image         linux40.gd2
 7        alias                   jjh-cc
 8        address                 115.29.161.54
 9        notification_interval   0
10        process_perf_data       1
11        action_url              /pnp4nagios/graph?host=$HOSTNAME$
12        }
13define service{
14        use                             JJH-service,srv-pnp         ; Name of service template to use
15        host_name                       jjh-cc
16        service_description             PING
17        check_command                   check_ping!100.0,20%!500.0,60%
18        }
19define service{
20        use                             JJH-service,srv-pnp
21        host_name                       jjh-cc
22        service_description             check_cpu
23        check_command                   check_nrpe!check_cpu
24        }

四、总结

以上主要通过示例试图说明白nagios内contacts.cfg、templates.cfg、XXXhost.cfg之间的灵活引用关系。不过这里还省略了一个timeperiods.cfg (主要用于定义时间,例如工作或休息,中国时间和美国时间等通知的时间范围)。如果直接看上面的配置或我上面提到的三种方式可能会越看越迷糊,下面几句总结可能会对理解有所帮助。

1、从最笨的一思路出发,你在hostxxx.cfg中定义监控项时,可以直接加入service_notification_options、service_notification_period、notification_interval、notification_interval、contact_groups等参数。一样的可以实现你的监控通知需要。

2、为简化上面的笨方法,你将以上参数定义了一个变量,给其取了一个名字,在templates.cfg中做了定义,然后在hostxxx.cfg中通过use + name(template.cfg中定义的)的方式调用。ok,上面提到的参数都在模板中了,可以省略了。

3、联系人比较多时,不同的应用和主机要通知到不同的人,又取了一个contacts.cfg的文件,在其中对主要对通知人员做了定义和划分。无论是contacts use templates还是templates contact contact.cfg,最终不过是让其配置做了个汇总给hostxxx.cfg use 。

4、配置文件无论几个或者取什么名字等无所谓,如果你高兴,可以只设置一个配置文件。多个配置文件名的作用是便于区分,便于查找,简化工作。最终只要在nagios.cfg中include,nagios可以很多的做出处理。

5、define的作你就可以当做是定义变量,use的作用可以当作是引用变量或include配置文件。contacts、contact_groups这些都是nagios参数,可以看作系统内部函数。

参考页面:nagios在线手册