Nagios系统及服务


、安装
1.有关Nagios、Nagios-plugin和nrpe的关系
系统监视的功能是由Nagios-plugin提供的,如可执行文件(check_load)可以用来监视系统负载情况。Nagios把nagios-plugin提供的程序组织起来,若没有nagios-plugin,nagios本身什么也做不了。
我们在一台装有Apache的服务器上安装Nagios和Nagios-plugin ,在其他被监视机器上安装nagios-plugin和nrpe,这样,在10.15.3.170上,磁盘、负载等本地检查程序(check_disk、check_load)被nrpe执行,之后nrpe将结果发送给nagios服务器。
网络服务的监测以及本地的监测不需要nrpe中转。

2.nagios、nagios-plugin和nrpe的安装
nagios的安装参见nagios的官方网站 http://nagios.sourceforge.net/docs/2_0/toc.html
nagios、nagios-plugin和nrpe的安装也可参见LDF帮助文档。

首先确定apache和gdlib已经安装 (apache安装在/usr/local/apache22, 同时/usr/lib下有libgd.so.*)接下来安装nagios

2.1 安装nagios
解压缩包
#tar zxvf nagios-version.tar.gz
#cd nagios-2.8
#adduser nagios
#mkdir /usr/local/nagios
#chown nagios:nagios /usr/local/nagios
确定apache使用的用户
#grep "^User" /usr/local/apache22/conf/httpd.conf
User daemon
将daemon放入nagios组
#usermod -G nagios daemon
#./configure --prefix=/usr/local/nagios
#make all
#make install
安装启动脚本到/etc/init.d/
#make install-init
安装配置文件的示例
#make install-config
#make install-commandmode
配置apache:
在apache的配置文件中添加如下内容:

#vi /usr/local/apache22/conf/http.conf

ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin

<Directory "/usr/local/nagios/sbin">
    Options ExecCGI
    AllowOverride None
    Order allow,deny
    Allow from all
    AuthName "Nagios Access"
    AuthType Basic
    AuthUserFile /usr/local/nagios/etc/htpasswd.users
    Require valid-user
</Directory>

Alias /nagios /usr/local/nagios/share

<Directory "/usr/local/nagios/share">
    Options None
    AllowOverride None
    Order allow,deny
    Allow from all
    AuthName "Nagios Access"
    AuthType Basic
    AuthUserFile /usr/local/nagios/etc/htpasswd.users
    Require valid-user
</Directory>

为了不与原服务器配置冲突,可以加在虚拟主机中。

#vi /usr/local/apache22/conf/extra/http-vhosts.conf

<VirtualHost 10.15.5.145:80>
    DocumentRoot /usr/local/apache22/htdocs
    ServerName nagios.mj.dalian
    ErrorLog logs/dummy-host2.example.com-error_log
    CustomLog logs/dummy-host2.example.com-access_log common
ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin

<Directory "/usr/local/nagios/sbin">
    Options ExecCGI
    AllowOverride None
    Order allow,deny
    Allow from all
    AuthName "Nagios Access"
    AuthType Basic
    AuthUserFile /usr/local/nagios/etc/htpasswd.users
    Require valid-user
</Directory>

Alias /nagios /usr/local/nagios/share

<Directory "/usr/local/nagios/share">
    Options None
    AllowOverride None
    Order allow,deny
    Allow from all
    AuthName "Nagios Access"
    AuthType Basic
    AuthUserFile /usr/local/nagios/etc/htpasswd.users
    Require valid-user
</Directory>
</VirtualHost>

用apache提供的 htpasswd 为nagios 服务器添加用户:

#htpasswd -c /usr/local/nagios/etc/htpasswd.users MJ

添加更多用户, 因为已经生成htpasswd.users, 故不加-c

#htpasswd /usr/local/nagios/etc/htpasswd.users username

察看/usr/local/nagios/etc/cgi.cfg中use_authentication选项是否为1:

# grep use_authentication etc/cgi.cfg
use_authentication=1

重新启动apache,在浏览器中输入正确的网址,会出现用户名和密码登陆框,登陆后会显示nagios的主页面。

#/usr/local/apache22/bin/apachectl restart



2.2安装nagios-plugin

解压进入nagios-plugin的目录

#tar zxvf nagios-plugin-1.4.tar.gz
#cd nagios-plugin-1.4
#./configure --prefix=/usr/local/nagios/
#make all
#make install

nagios-plugin的可执行程序被安装在了/usr/local/nagios/libexec

须注意的是,在nagios-plugin执行./configure 的时候,若需要编译check_mysql,需要参数 --with-mysql=(mysql安装路径)。我在装有mysql的机器172.18.3.173上编译nagios-plugin然后将check_mysql和mysql运行库(/usr/local/mysql/lib/mysql/libmysqlclient.so.15.0.0) scp到172.18.3.141, 将运行库连接为 /usr/lib/libmysqlclient.so.15,之后check_mysql程序执行正常。

2.3在其他机器上安装nrpe和nagios-plugin

解压进入nrpe源码目录,

#tar zxvf nrpe.tar.gz
#cd nrpe-2.7.1
#./configure --enable-ssl --enable-command-args
#make all

在src中生成两个可执行程序nrpe 和check_nrpe,一个是nrpe本身的可执行程序,一个是nagios插件。
check_nrpe需要放到nagios服务器libexec目录下。

#scp src/check_nrpe 10.15.3.166:/usr/local/nagios/libexec

#mkdir /usr/local/nagios/{bin,etc} -p
#cp src/nrpe /usr/local/nagios/bin

同时,sample-config目录下有nrpe配置文件的示例。
#cp sample-config/* /usr/local/nagios/ect/


nagios-plugin的安装同2.3。

二、配置
1. nagios简单配置
nagios安装好后,/usr/local/nagios/etc目录下会有一些配置文件的示例,如nagios.cfg-sample。将每个文件后缀-sample去掉
#cd /usr/local/nagios/etc
#cp nagios.cfg-sample nagios.cfg
其它同样操作。
nagios的主配置文件是/usr/local/nagios/etc/nagios.cfg。
nagios.cfg中一些参数含义
*_file的选项指明其它配置文件的位置,而且可以多次出现。如
resource_file=/usr/local/nagios/etc/resource.cfg
在resource.cfg中定义了一些宏,如 $USER1$ =/usr/local/nagios/libexec 之后在定义命令和监视任务的时候$USER1$就指nagios-plugin的目录。
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/monitor.cfg
在commands.cfg 和 monitor.cfg中定义了具体的监视任务和选项。
在每一个cfg_file中,每一个任务是以对象的形式定义的。
在commands.cfg中有如下定义:
define command {
        command_name    check_local_disk
        command_line    $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
        }
定义了一条命令,命令的名字为check_local_disk,命令的内容为 $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
$USER1$代表nagios-plugin的目录,特殊宏 $AGRn$ 代表传给可执行程序的参数。当nagios执行check_local_disk的时候实际执行的是 /usr/local/nagios/libexec/check_disk文件。可以在linux命令行中手动执行
#/usr/local/nagios/libexec/check_disk --help
得到帮助信息。
在monitor.cfg中有如下定义:
define timeperiod{
        timeperiod_name 24x7
        alias           24 Hours A Day, 7 Days A Week
        sunday          00:00-24:00
        monday          00:00-24:00
        tuesday         00:00-24:00
        wednesday       00:00-24:00
        thursday        00:00-24:00
        friday          00:00-24:00
        saturday        00:00-24:00
        }
定义了一个时间段,时间段的名字是24x7,后面的定义表明这个时间段覆盖从周一到周日的每天24小时。
define contact{
        contact_name                    MJ
        alias                           MJ
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,u,r
        service_notification_commands   notify-by-email
        host_notification_commands      host-notify-by-email
        email                           [email protected]
        }
定义了一个联系人,名称为MJ,service和host出现问题的报警时间段为24x7, 当service状态为warning,unknown,critical,recover时报警,报警方式为邮件。

注意:在本例中,MJ也是apache 页面验证的名字,默认只有登录nagios所用的名字和contact的名字相同时才有足够的权限察看服务状态。

define host{
        name                            mj-host    ; The name of this host template
        notifications_enabled           1               ; Host notifications are enabled
        event_handler_enabled           1               ; Host event handler is enabled
        flap_detection_enabled          1               ; Flap detection is enabled
        failure_prediction_enabled      1               ; Failure prediction is enabled
        process_perf_data               1               ; Process performance data
        retain_status_information       1               ; Retain status information across program restarts
        check_period                    24x7            ; By default, Linux hosts are checked round the clock

        max_check_attempts              10              ; Check each Linux host 10 times (max)
        check_command                   check-host-alive
        notification_period             workhours       ; Linux admins hate to be woken up, so we only notify during the day
        notification_interval           120             ; Resend notification every 2 hours
        notification_options            d,u,r           ; Only send notifications for specific host states
        contact_groups                  MJ-SYS          ; Notifications get sent to the admins by default
        retain_nonstatus_information    1               ; Retain non-status information across program restarts

        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }
定义了一个host类别,名字为mj-host, 监视时间段为 24x7 (前面定义过的) ,报警时间段为workhours等等。
register 为 0 表明它只是一个模板,可以被其它对象继承,但本身对监视器行为没有影响。
check_command属性指定监视程序检查时执行的命令,本例中为check-host-alive,这条命令在command.cfg中有定义。
define host{
        use                     mj-host              ; Name of host template to use
        host_name               MJ-FRONT
        alias                   MJ-APACHE-TOMCAT
        address                 10.15.3.166
        }
定义了一台host主机,use mj-host表明这台主机的设定继承自刚刚定义过的mj-host模版,实际上相当于把mj-host的设定中除name和register之外的属性复制到了当前配置中。这台主机除拥有mj-host中设定的值之外还有这些额外的属性:host_name MJ-FRONT(显示在监视画面上的), alias以及address.
register不自动继承,并且默认为1。
子类中的属性也可以覆盖父类中的属性。
define service{
        name                            mj-host                 ; The 'name' of this service template
        active_checks_enabled           1                       ; Active service checks are enabled
        passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
        parallelize_check               1                       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1                       ; We should obsess over this service (if necessary)
        check_freshness                 0                       ; Default is to NOT check service 'freshness'
        notifications_enabled           1                       ; Service notifications are enabled
        event_handler_enabled           1                       ; Service event handler is enabled
        flap_detection_enabled          1                       ; Flap detection is enabled
        failure_prediction_enabled      1                       ; Failure prediction is enabled
        process_perf_data               1                       ; Process performance data
        retain_status_information       1                       ; Retain status information across program restarts
        retain_nonstatus_information    1                       ; Retain non-status information across program restarts
        is_volatile                     0                       ; The service is not volatile
        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

define service{
        name                            MJ-FRONT                ; The name of this service template
        use                             mj-host                 ; Inherit default values from the generic-service definition
        check_period                    24x7                    ; The service can be checked at any time of the day
        max_check_attempts              4                       ; Re-check the service up to 4 times in order to determine its final (hard) state
        normal_check_interval           5                       ; Check the service every 5 minutes under normal conditions
        retry_check_interval            1                       ; Re-check the service every minute until a hard state can be determined
        contact_groups                  MJ-SYS                  ; Notifications get sent out to everyone in the 'admins' group
        notification_options            w,u,c,r                 ; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           60                      ; Re-notify about service problems every hour
        notification_period             24x7                    ; Notifications can be sent out at any time
        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

define service{
        use                             MJ-FRONT         ; Name of service template to use
        host_name                       MJ-FRONT
        service_description             DISK    
        check_command                   check_local_disk!10%!5%!/
        }
其中check_command的内容,check_local_disk为命令名,在command.cfg中定义的,不同的参数由!隔开,它们在command定义中用$ARGn$引用。
normal_check_interval           5
表示每隔5个时间段检查一次服务,时间段长度由nagios.cfg中interval_length=60 决定,默认为60秒。

每一个service通过host_name同一台host相联系。对host状态的检查由host上面的服务检查驱动,如果101上面的http服务应该在5分钟后检查,那么在检查apache时,nagios首先检查101这台host是否能ping通,能则在 监视页面host detail项中显示ok,否则视为这台host已down掉。如果一台主机上不定义任何service,那么这台主机不会被检查。(除非另行设置)

nagios配置好后在启动之前需检查配置的正确性:
#/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
(some output)...........
Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check
确认无误后启动nagios:
#service nagios start

2 nrpe配置

#ssh 10.15.3.170
#cd /usr/local/nagios/
#vi etc/nrpe.cfg
检查是否有
dont_blame_nrpe=1 (此参数允许nagios向nrpe传递命令参数。)
以及allowed_hosts 中包含nagios服务器ip,或将此行注释。
nrpe接受的命令定义在文件末尾。将191-196注释,并将最后几行注释打开

    189 # The following examples use hardcoded command arguments...
    190
    191 #command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10
    192 #command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
    193 #command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/hda1
    194 #command[check_disk2]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/hdb1
    195 #command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
    196 #command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200
    197
    198 # The following examples allow user-supplied arguments and can
    199 # only be used if the NRPE daemon was compiled with support for
    200 # command arguments *AND* the dont_blame_nrpe directive in this
    201 # config file is set to '1'...
    202
    203 command[check_users]=/usr/local/nagios/libexec/check_users -w $ARG1$ -c $ARG2$
    204 command[check_load]=/usr/local/nagios/libexec/check_load -w $ARG1$ -c $ARG2$
    205 command[check_disk]=/usr/local/nagios/libexec/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
    206 command[check_procs]=/usr/local/nagios/libexec/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$

启动nrpe
#/usr/local/nagios/bin/nrpe -n -c /usr/local/nagios/etc/nrpe.cfg -d
-n表示不使用ssl,当遇到ssl错误时加这个选项
-c 指定nrpe配置文件
-d 以守护进程执行。

说明:nrpe提供check_nrpe插件,放在nagios主机的libexec目录下,同其它插件一样使用。
如,nagios执行
check_nrpe -H 192.168.2.3 -c check_mysql
check_nrpe将指令check_mysql发送到192.168.2.3的5666端口(默认),由2.3的nrpe执行,结果返回给nagios.

设定nagios服务器端:
#ssh 10.15.3.166
#cd /usr/local/nagios/etc
#vi commands.cfg
添加
define command{
        command_name    check_nrpe_load
        command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -n -c $ARG1$ -a $ARG2$ $ARG3$
}

#vi monitor.cfg
添加
define service{
        use                             MJ-FRONT
        host_name                       MJ-Admin
        service_description             LOAD
        check_command                   check_nrpe_load!check_load!5,5,5!10,10,10
        }

重新启动nagios:
#service nagios restart