一、NSclient++与nrpe
nagios对windows的监控主要有两种方法,一种是NSclient++,另一种是通过nrpe的方式来达到监控目的
NSclient++与nrpe最大的区别就是:
1、被监控机上安装有nrpe,并且还有插件,最终的监控是由这些插件来进行的.当监控主机将监控请求发给nrpe后,nrpe调用插件来完成监控.
2、NSclient++则不同,被监控机上只安装NSclient++,没有任何的插件.当监控主机将监控请求发给NSclient++后,NSclient++直接完成监控,所有的监控是由NSclient++完成的。
这也说明了NSclient++的一个很大的问题,不灵活,没有可扩展性.它只能完成自己本身包含的监控操作,不能由一些插件来扩展.好在NSclient++已经做的不错了,基本上可以完全满足我们的监控需要。
NSclient++的原理图
二、部署过程
1、在windows上安装NSclient++
(1)一直下一步
(2)设置nagios服务器IP地址
(3)检查NSclient++的端口是否成功开启
如果服务没有开启,就:win+r --> services.msc --> nsclient++ 开启服务即可
(4)防火墙打开tcp 12489端口
2、设置nagios服务器
(1)检测nagios命令是否可以正常监测windows主机
1
2
3
4
5
6
7
8
|
[root@cacti libexec]
# ./check_nt -H 192.168.200.15 -p 12489 -s dianyi123 -v UPTIME
System Uptime - 3 day(s) 12 hour(s) 32 minute(s)
[root@cacti libexec]
#
[root@cacti libexec]
# ./check_nt -H 192.168.200.15 -p 12489 -s dianyi123 -v CPULOAD -w 80 -c 90 -l 5,80,90
CPU Load 0% (5 min average) |
'5 min avg Load'
=0%;80;90;0;100
#-w 警告比例 -c 紧急比例 -l(小写L) 表示过去5分钟的平均值,80%为警告,90%为紧急
[root@cacti libexec]
#
[root@cacti libexec]
# ./check_nt -H 192.168.200.15 -p 12489 -s dianyi123 -v USEDDISKSPACE -w 80 -c 90 -l C
C:\ - total: 100.83 Gb - used: 13.71 Gb (14%) -
free
87.12 Gb (86%) |
'C:\ Used Space'
=13.71Gb;80.66;90.74;0.00;100.83
|
(2)定义命令、主机、服务
①、定义命令
1
2
3
4
5
|
[root@cacti ~]
# vim /usr/local/nagios/etc/objects/commands.cfg
define
command
{
command_name check_win
command_line $USER1$
/check_nt
-H
"$HOSTADDRESS$"
-p 12489 -s dianyi123 -
v
$ARG1$ $ARG2$
}
|
②、定义主机和服务
为了方便,主机和监控服务都定义在一个配置文件里面
首先创在/usr/local/nagios/etc建一个文件夹servers专门保存各服务器的配置文件,然后以服务器IP命名各服务器配置文件
这样的话,nagios.cfg里面就需要开启对servers目录的支持
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
|
[root@cacti etc]
# pwd
/usr/local/nagios/etc
[root@cacti etc]
#
[root@cacti etc]
# ls
cgi.cfg htpasswd.
users
nagios.cfg objects resource.cfg servers
[root@cacti etc]
#
[root@cacti etc]
# vim nagios.cfg
cfg_dir=
/usr/local/nagios/etc/servers
[root@cacti etc]
#
[root@cacti etc]
# vim servers/192.168.200.15.cfg
define host{
use windows-server ; Name of host template to use
host_name 192.168.200.15
alias
my computer
address 192.168.200.15
}
#define hostgroup{
# hostgroup_name windows-servers ; The name of the hostgroup
# alias Windows Servers ; Long name of the group
# }
define service{
use generic-service
host_name 192.168.200.15
service_description NSClient++ Version
check_command check_win!CLIENTVERSION
}
define service{
use generic-service
host_name 192.168.200.15
service_description Uptime
check_command check_win!UPTIME
}
define service{
use generic-service
host_name 192.168.200.15
service_description CPU Load
check_command check_win!CPULOAD!-l 5,80,90
}
define service{
use generic-service
host_name 192.168.200.15
service_description Memory Usage
check_command check_win!MEMUSE!-w 80 -c 90
}
define service{
use generic-service
host_name 192.168.200.15
service_description C:\ Drive Space
check_command check_win!USEDDISKSPACE!-l c -w 80 -c 90
}
define service{
use generic-service
host_name 192.168.200.15
service_description D:\ Drive Space
check_command check_win!USEDDISKSPACE!-l d -w 80 -c 90
}
define service{
use generic-service
host_name 192.168.200.15
service_description E:\ Drive Space
check_command check_win!USEDDISKSPACE!-l e -w 80 -c 90
}
#define service{
# use generic-service
# host_name 192.168.200.15
# service_description W3SVC
# check_command check_win!SERVICESTATE!-d SHOWALL -l W3SVC
# }
define service{
use generic-service
host_name 192.168.200.15
service_description Explorer
check_command check_win!PROCSTATE!-d SHOWALL -l Explorer.exe
}
|
(3)检查配置文件有无错误
1
|
/usr/local/nagios/bin/nagios
-
v
/usr/local/nagios/etc/nagios
.cfg
|
如果没有消息,那就是最好的消息,下一步,就可以重启nagios服务了
(4)重启nagios服务
1
2
3
|
[root@cacti ~]
# service nagios restart
Stopping nagios: [ OK ]
Starting nagios: [ OK ]
|
三、nagios监控页面查看主机与服务
1、主机状态
2、服务状态
四、排错阶段
本次部署nagios监控windows主机主要碰到两个问题
1、主机状态(status)是down,而不是正常的up
原因:这种情况下,一般都是服务器禁ping了,监控服务器是通过ping服务来检查被监控服务器是否在线,当把windows服务器ping的回显请求开启后,监控成功
解决:win2008:服务器管理器——设置——高级安全windows防火墙——入站规则——找到“文件和打印机共享(回显请求-ICMPv4-in)”右击……选择“启用规则”
2、could not fetch information from server
当把第1个问题解决掉后,Status是UP起来了,可是所有的服务全部都是could not fetch information from server
原因:出现这种状况的原因是因为nagios服务器没有从被监控端服务器上获得相关数据,直接原因就是NSclient++的配置文件中Allowed hosts的IP没有设置正确
解决:NSclient++的配置文件中 Allowed hosts = nagios服务器IP
当时在安装NSclient++时,我的 Allowed hosts = 192.168.200.105 ,我的设置是正确的,但是为什么会变成15我也不知道为什么
五、nagios监控linux主机
1、服务端定义主机
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
|
define host{
use linux-server
host_name 192.168.200.111
alias
linux
address 192.168.200.111
}
define service{
use generic-service
host_name 192.168.200.111
service_description root_/
check_command check_nrpe!check_xvda!5%!10%
}
define service{
use generic-service
host_name 192.168.200.111
service_description
/dev/xvdb2
check_command check_nrpe!check_xvdb2!5%!10%
}
define service{
use generic-service
host_name 192.168.200.111
service_description Check Swap
check_command check_nrpe!check_swap
}
define service{
use generic-service
host_name 192.168.200.111
service_description total
check_command check_nrpe!check_total_procs
}
define service{
use generic-service
host_name 192.168.200.111
service_description check_load
check_command check_nrpe!check_load
}
define service{
use generic-service
host_name 192.168.200.111
service_description check_tcp_3306
check_command check_tcp!3306
}
define service{
use generic-service
host_name 192.168.200.111
service_description check_users
check_command check_nrpe!check_users
}
define service{
use generic-service
host_name 192.168.200.111
service_description check_mem
check_command check_nrpe!check_mem
}
define service{
use generic-service
host_name 192.168.200.111
service_description check_mysql
check_command check_nrpe!check_mysql
}
define service{
use generic-service
host_name 192.168.200.111
service_description check_mysql_slave
check_command check_nrpe!check_mysql_slave
}
define service{
use generic-service
host_name 192.168.200.111
service_description check_http 192.168.200.111
/test
.html
check_command check_http!
'-u /test.html'
#nagios监控网页状态(如 200),在commands.cfg中有自带check_http命令,也可监控域名!
}
|
2、客户端修改:vim /usr/local/nagios/etc/nrpe.cfg
1
2
3
4
5
6
7
8
9
10
|
command
[check_users]=
/usr/local/nagios/libexec/check_users
-w 3 -c 5
command
[check_load]=
/usr/local/nagios/libexec/check_load
-w 15,10,5 -c 30,25,20
command
[check_xvda]=
/usr/local/nagios/libexec/check_disk
-w 10% -c 5% -p
/dev/xvda
command
[check_zombie_procs]=
/usr/local/nagios/libexec/check_procs
-w 5 -c 10 -s Z
command
[check_total_procs]=
/usr/local/nagios/libexec/check_procs
-w 150 -c 200
command
[check_xvdb2]=
/usr/local/nagios/libexec/check_disk
-w 10% -c 5% -p
/dev/xvdb2
#阿里云
command
[check_swap]=
/usr/local/nagios/libexec/check_swap
-w 20% -c 10%
#/dev/xvdb1 分区做了swap
command
[check_mem]=
/usr/bin/sudo
/usr/local/nagios/libexec/check_mem
-w 20 -c 10
command
[check_mysql]=
/usr/local/nagios/libexec/check_mysql
-H 192.168.200.111 -unagios -dnagios_monitor -p dianyi123
command
[check_mysql_slave]=
/usr/local/nagios/libexec/check_mysql_slave
|
3、在nrpe.cfg配置文件中允许nagios服务器IP
1
2
|
[root@localhost ~]
# vim /usr/local/nagios/etc/nrpe.cfg
allowed_hosts=127.0.0.1,192.168.200.105
|
4、客户端以独立进程方式启动 nrpe
1
|
/usr/local/nagios/bin/nrpe
-c
/usr/local/nagios/etc/nrpe
.cfg -d
|
5、修改nagios的命令模板
1
2
3
4
5
|
[root@monitor ~]
# vim /usr/local/nagios/etc/objects/commands.cfg #添加以下一行
define
command
{
command_name check_nrpe
command_line $USER1$
/check_nrpe
-H $HOSTADDRESS$ -c $ARG1$
}
|
否则重启nagios会报错:
1
|
Error: Service check
command
'check_nrpe!check_total_procs'
specified
in
service
'total'
for
host
'192.168.200.105'
not defined anywhere!
|
6、服务端检测 :
/usr/local/nagios/libexec/check_nrpe -H 192.168.200.111 -c check_sda
六、补充
1、nagios监控windows端口
基本上socket(收发通信协议)写的程序都会对应一个tcp端口出来,我们只要监控此端口就相当于监控了此程序;如FTP 21,pop 110,smtp 25 这些是常见的tcp端口,常见的端口一般nagios内都有定义的check_nt!,如果不是常见的端口,就需自定义程序的tcp端口。
在监控之前,要确认端口是打开的,可以在CMD中telnet一下端口
1
|
C:\Users\Administrator>telnet 192.168.200.15 3389
|
(1)定义命令
1
2
3
4
5
|
[root@cacti objects]
# vim /usr/local/nagios/etc/objects/commands.cfg
define
command
{
command_name tcp3389
command_line $USER1$
/check_tcp
-H $HOSTADDRESS$ -p 3389 -
v
CLIENTVERSION
}
|
(2)定义服务
主机已定义,主机和服务在一个配置文件里
1
2
3
4
5
6
7
|
[root@cacti servers]
# vim /usr/local/nagios/etc/servers/192.168.200.15.cfg
define service{
use generic-service
host_name 192.168.200.15
service_description port3389
check_command tcp3389
}
|
(3)重启nagios服务
(4)查看验证
2、nagios监控linux端口
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
[root@cacti servers]
# pwd
/usr/local/nagios/etc/servers
[root@cacti servers]
#
[root@cacti servers]
# vim 192.168.200.18.cfg
define service{
use generic-service
host_name 192.168.200.18
service_description check_tcp_3306
check_command check_tcp!3306
}
define service{
use generic-service
host_name 192.168.200.18
service_description check_tcp_873
check_command check_tcp!873
}
#
[root@cacti ~]
# service nagios restart
|
1
2
3
4
|
############# 如果监听的端口是这样的,而不是 *:5666 这样 ###############
tcp LISTEN 0 50 61.138.78.59:7003 *:*
tcp LISTEN 0 5 *:5666 *:*
则需要修改commands中的 $HOSTADDRESS$ 为61.138.78.59,然后修改command_name,再定义服务即可
|
3、nagios监控mysql主从同步
判断mysql的主从同步主要还是看那两个线程:Slave_IO线程和Slave_SQL线程,两个都是YES的话,就证明是没有问题的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting
for
master to send event
Master_Host: 192.168.200.17
Master_User: doteyplay
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000008
Read_Master_Log_Pos: 1277
Relay_Log_File: relay-bin.000025
Relay_Log_Pos: 1486
Relay_Master_Log_File: master-bin.000008
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
|
第一部分:客户端配置
(1)在被监控的从服务器增加一个用户
1
2
3
4
5
|
MariaDB [(none)]> grant Replication client on *.* to nagios@localhost identified by
'nagios'
;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]>
MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)
|
(2)验证命名执行状态
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
[root@DBSlave ~]
# mysql -unagios -pnagios -e "show slave status\G;"
*************************** 1. row ***************************
Slave_IO_State: Waiting
for
master to send event
Master_Host: 192.168.200.17
Master_User: doteyplay
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000008
Read_Master_Log_Pos: 1277
Relay_Log_File: relay-bin.000025
Relay_Log_Pos: 1486
Relay_Master_Log_File: master-bin.000008
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
|
(3)编写脚本/usr/local/nagios/libexec/check_mysql_slave(这是监控其作用的核心)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
#!/bin/sh
declare
-a slave_is
slave_is=($(
/usr/local/mysql/bin/mysql
-unagios -pnagios -e
"show slave status\G"
|
grep
Running |
awk
'{print $2}'
))
if
[
"${slave_is[0]}"
=
"Yes"
-a
"${slave_is[1]}"
=
"Yes"
]
then
echo
"OK C2-slave is running"
exit
0
else
echo
"Critical C2-slave is error"
exit
2
fi
#
[root@DBSlave libexec]
# chmod +x check_mysql_slave #赋予执行权限
[root@DBSlave libexec]
# chown nagios.nagios check_mysql_slave
|
(4)在从服务器安装 nrpe,然后在配置文件nrpe.cfg加入一行
1
2
|
[root@DBSlave ~]
# vim /usr/local/nagios/etc/nrpe.cfg
command
[check_mysql_slave]=
/usr/local/nagios/libexec/check_mysql_slave
|
(5)手动执行脚本,观察输出状态
1
2
|
[root@DBSlave libexec]
# sh check_mysql_slave
OK C2-slave is running
|
(6)检查被监控端的5666端口
1
2
3
4
|
[root@DBSlave libexec]
# ss -antulp | grep 5666
tcp LISTEN 0 5 :::5666 :::*
users
:((
"nrpe"
,26512,5))
tcp LISTEN 0 5 *:5666 *:*
users
:((
"nrpe"
,26512,4))
[root@DBSlave libexec]
# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
|
第二部分:服务端配置
(1)在监控机上检查是否可成功监控被监控机
1
2
|
[root@cacti ~]
# /usr/local/nagios/libexec/check_nrpe -H 192.168.200.18 -c check_mysql_slave
NRPE: Command
'check_mysql_slave'
not defined
#遇到问题
|
排错:NRPE: Command
'check_mysql_slave'
not defined
1
2
|
[root@cacti ~]
# /usr/local/nagios/libexec/check_nrpe -H 192.168.200.18
NRPE v2.15
|
证明在被监测主机上配置的NRPE已经正常工作,并且监测主机能够通过SSL与被监测主机上的NRPE正常通信。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
[root@DBSlave libexec]
# ps -ef | grep nrpe
root 10287 9703 0 12:01 pts
/1
00:00:00 vim
/usr/local/nagios/etc/nrpe
.cfg
root 10522 9639 0 12:30 pts
/0
00:00:00
grep
nrpe
nagios 26512 1 0 Aug15 ? 00:01:09
/usr/local/nagios/bin/nrpe
-c
/usr/local/nagios/etc/nrpe
.cfg -d
#这里的nrpe是以独立进程运行的,而非守护进程。先kill一下nrpe再说
[root@DBSlave libexec]
#
[root@DBSlave libexec]
# kill -9 26512 #kill nrpe进程
[root@DBSlave libexec]
#
[root@DBSlave libexec]
# ps -ef | grep nrpe
root 10287 9703 0 12:01 pts
/1
00:00:00 vim
/usr/local/nagios/etc/nrpe
.cfg
root 10524 9639 0 12:31 pts
/0
00:00:00
grep
nrpe
#kill 成功
[root@DBSlave libexec]
#
[root@DBSlave libexec]
#
[root@DBSlave libexec]
# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d #重启nrpe
[root@DBSlave libexec]
#
[root@DBSlave libexec]
# ps -ef | grep nrpe
root 10287 9703 0 12:01 pts
/1
00:00:00 vim
/usr/local/nagios/etc/nrpe
.cfg
nagios 10526 1 0 12:31 ? 00:00:00
/usr/local/nagios/bin/nrpe
-c
/usr/local/nagios/etc/nrpe
.cfg -d
root 10528 9639 0 12:31 pts
/0
00:00:00
grep
nrpe
|
再次在监控端测试
1
2
|
[root@cacti ~]
# /usr/local/nagios/libexec/check_nrpe -H 192.168.200.18 -c check_mysql_slave
OK C2-slave is running
#终于顺利通过了,就是nrpe进程的事儿
|
(2)定义主机、服务
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
[root@cacti servers]
# pwd
/usr/local/nagios/etc/servers
[root@cacti servers]
# vim 192.168.200.18.cfg
define host{
use linux-server
host_name 192.168.200.18
alias
linux
address 192.168.200.18
}
define service{
use generic-service
host_name 192.168.200.18
service_description check_mysql_slave
check_command check_nrpe!check_mysql_slave
}
|
(3)重启nagios服务
(4)查看监控状态
4、nagios通过web界面修改某个服务时报错
例如对某个服务进行临时安排其执行时间,或者不让它发警告,web页面上都有这样的设置.但是常常会有错误信息如下:
Could not open command file '/usr/local/nagios/var/rw/nagiosNaNd' for update!The permissions on the external command file and/or directory may be incorrect. Read the FAQs on how to setup proper permissions.An error occurred while attempting to commit your command for processing. |
(1)修改属组
1
|
[root@monitor ~]
# chown -R nagios.nagios /usr/local/nagios/var/rw/
|
(2)把apache用户加入到nagios组
1
|
[root@monitor ~]
# usermod -G nagios apache
|
(3)重启服务
1
2
|
[root@monitor ~]
# service nagios restart
[root@monitor ~]
# service httpd restart
|