windows下有HDTune可以查看磁盘的状态,防止磁盘挂掉才会自己知道,CentOS下有SMART (Self-Monitoring, Analysis and Reporting Technology System) 同样对磁盘做状态检测
下面以dell R720服务器举例,/dev/sda是1T的scsi接口普通硬盘,/dev/sdd 是三块盘做的raid5
# df -h #查看磁盘的名字
# dmesg |grep sdd #查看开机信息里面的磁盘info
sd 0:2:0:0: [sdd] Attached SCSI disk
# hdparm -I /dev/sda #查看磁盘硬件信息、开启的功能等,信息特别详细
下面用smart查看磁盘的状态:
1
2
3
4
5
6
|
# yum install smartmontools //安装SMART
# smartctl -H /dev/sdd //磁盘健康状况查看
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] (
local
build)
Copyright (C) 2002-12 by Bruce Allen,
http:
//smartmontools
.sourceforge.net
SMART Health Status: OK
|
# smartctl -A /dev/sda 或者 smartctl --all /dev/sda #硬盘的smart信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
# smartctl -a /dev/sdd
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] (
local
build)
Copyright (C) 2002-12 by Bruce Allen,
http:
//smartmontools
.sourceforge.net
Vendor: DELL
Product: PERC H310
Revision: 2.12
User Capacity: 598,879,502,336 bytes [598 GB]
Logical block size: 512 bytes
Logical Unit
id
:
Serial number:
Device
type
: disk
Local Time is: Wed Jan 14 15:37:39 2015 CST
Device does not support SMART
Error Counter logging not supported
Device does not support Self Test logging
|
这里提示Device does not support SMART,所以按下面方式查看
查看raid5中第一块磁盘的状态
# smartctl -a -d megaraid,0 /dev/sdd
同样查看第二块、第三块磁盘的状态,根据自己的监控情况,加速nagios、zabbix报警
# smartctl -a -d megaraid,1 /dev/sdd
# smartctl -a -d megaraid,2 /dev/sdd
除此之外的smartctl用法,介绍的很详细:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
|
# smartctl -h
Usage: smartctl [options] device
============================================ SHOW INFORMATION OPTIONS =====
-h, --help, --usage
Display this help and
exit
-V, --version, --copyright, --license
Print license, copyright, and version information and
exit
-i, --info
Show identity information
for
device
-g NAME, --get=NAME
Get device setting: all, aam, apm, lookahead, security, wcache
-a, --all
Show all SMART information
for
device
-x, --xall
Show all information
for
device
--scan
Scan
for
devices
--scan-
open
Scan
for
devices and try to
open
each device
================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====
-q TYPE, --quietmode=TYPE (ATA)
Set smartctl quiet mode to one of: errorsonly, silent, noserial
-d TYPE, --device=TYPE
Specify device
type
to one of: ata, scsi, sat[,auto][,N][+TYPE],
usbcypress[,X], usbjmicron[,x][,N], usbsunplus, marvell, areca,N
/E
,
3ware,N, hpt,L
/M/N
, megaraid,N, cciss,N, auto,
test
-T TYPE, --tolerance=TYPE (ATA)
Tolerance: normal, conservative, permissive, verypermissive
-b TYPE, --badsum=TYPE (ATA)
Set action on bad checksum to one of: warn,
exit
, ignore
-r TYPE, --report=TYPE
Report transactions (see
man
page)
-n MODE, --nocheck=MODE (ATA)
No check
if
: never,
sleep
, standby, idle (see
man
page)
============================== DEVICE FEATURE ENABLE
/DISABLE
COMMANDS =====
-s VALUE, --smart=VALUE
Enable
/disable
SMART on device (on
/off
)
-o VALUE, --offlineauto=VALUE (ATA)
Enable
/disable
automatic offline testing on device (on
/off
)
-S VALUE, --saveauto=VALUE (ATA)
Enable
/disable
Attribute autosave on device (on
/off
)
-s NAME[,VALUE], --
set
=NAME[,VALUE]
Enable
/disable/change
device setting: aam,[N|off], apm,[N|off],
lookahead,[on|off], security-freeze, standby,[N|off|now],
wcache,[on|off]
======================================= READ AND DISPLAY DATA OPTIONS =====
-H, --health
Show device SMART health status
-c, --capabilities (ATA)
Show device SMART capabilities
-A, --attributes
Show device SMART vendor-specific Attributes and values
-f FORMAT, --
format
=FORMAT (ATA)
Set output
format
for
attributes: old, brief, hex[,
id
|val]
-l TYPE, --log=TYPE
Show device log. TYPE: error, selftest, selective, directory[,g|s],
xerror[,N][,error], xselftest[,N][,selftest],
background, sasphy[,reset], sataphy[,reset],
scttemp[sts,hist], scttempint,N[,p],
scterc[,N,M], devstat[,N], ssd,
gplog,N[,RANGE], smartlog,N[,RANGE]
-
v
N,OPTION , --vendorattribute=N,OPTION (ATA)
Set display OPTION
for
vendor Attribute N (see
man
page)
-F TYPE, --firmwarebug=TYPE (ATA)
Use firmware bug workaround: none, samsung, samsung2,
samsung3, swapid
-P TYPE, --presets=TYPE (ATA)
Drive-specific presets: use, ignore, show, showall
-B [+]FILE, --drivedb=[+]FILE (ATA)
Read and replace [add] drive database from FILE
[default is +
/etc/smart_drivedb
.h
and
then
/usr/share/smartmontools/drivedb
.h]
============================================ DEVICE SELF-TEST OPTIONS =====
-t TEST, --
test
=TEST
Run
test
. TEST: offline, short, long, conveyance, force, vendor,N,
select
,M-N, pending,N, afterselect,[on|off]
-C, --captive
Do
test
in
captive mode (along with -t)
-X, --abort
Abort any non-captive
test
on device
=================================================== SMARTCTL EXAMPLES =====
smartctl --all
/dev/hda
(Prints all SMART information)
smartctl --smart=on --offlineauto=on --saveauto=on
/dev/hda
(Enables SMART on first disk)
smartctl --
test
=long
/dev/hda
(Executes extended disk self-
test
)
smartctl --attributes --log=selftest --quietmode=errorsonly
/dev/hda
(Prints Self-Test & Attribute errors)
smartctl --all --device=3ware,2
/dev/sda
smartctl --all --device=3ware,2
/dev/twe0
smartctl --all --device=3ware,2
/dev/twa0
smartctl --all --device=3ware,2
/dev/twl0
(Prints all SMART info
for
3rd ATA disk on 3ware RAID controller)
smartctl --all --device=hpt,1
/1/3
/dev/sda
(Prints all SMART info
for
the SATA disk attached to the 3rd PMPort
of the 1st channel on the 1st HighPoint RAID controller)
smartctl --all --device=areca,3
/1
/dev/sg2
(Prints all SMART info
for
3rd ATA disk of the 1st enclosure
on Areca RAID controller)
|
http://linux-wiki.cn/wiki/zh-hans/SSD_(%E5%9B%BA%E6%80%81%E7%A1%AC%E7%9B%98)
nagios设置
下面检测raid5磁盘,总共3块磁盘
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
root@web:
/usr/local/nagios/libexec
# vim check_disk_status.sh
#!/bin/bash
#
STATE_OK=0
STATE_W ARNING=1
SMARTCTL=
"/usr/sbin/smartctl"
CHECK_DISK=
"/dev/sda"
DISK_HEALTH1=`$SMARTCTL -a -d megaraid,0 $CHECK_DISK |
grep
"SMART Health Status"
|
awk
'{print $4}'
`
if
[
"$DISK_HEALTH1"
=
"OK"
]|| [
"$DISK_HEALTH1"
=
"PASSED"
];
then
echo
"OK - $CHECK_DISK 1 status is $DISK_HEALTH1 "
else
echo
"CRITICAL - $CHECK_DISK status is $DISK_HEALTH1 "
exit
$STATE_CRITICAL
fi
DISK_HEALTH2=`$SMARTCTL -a -d megaraid,1 $CHECK_DISK |
grep
"SMART Health Status"
|
awk
'{print $4}'
`
if
[
"$DISK_HEALTH2"
=
"OK"
]|| [
"$DISK_HEALTH2"
=
"PASSED"
];
then
echo
"OK - $CHECK_DISK 2 status is $DISK_HEALTH2 "
else
echo
"CRITICAL - $CHECK_DISK status is $DISK_HEALTH2 "
exit
$STATE_CRITICAL
fi
DISK_HEALTH3=`$SMARTCTL -a -d megaraid,2 $CHECK_DISK |
grep
"SMART Health Status"
|
awk
'{print $4}'
`
if
[
"$DISK_HEALTH3"
=
"OK"
]|| [
"$DISK_HEALTH3"
=
"PASSED"
];
then
echo
"OK - $CHECK_DISK 3 status is $DISK_HEALTH3 "
else
echo
"CRITICAL - $CHECK_DISK status is $DISK_HEALTH3 "
exit
$STATE_CRITICAL
fi
# chmod 755 check_disk_status.sh
|
1
2
|
vim
/usr/local/nagios/etc/nrpe
.cfg
command
[check_disk_status]=
/usr/bin/sudo
/usr/local/nagios/libexec/check_disk_status
.sh
|
因为/usr/sbin/smartctl必须要root才可以运行,得到磁盘的状态
1
2
3
|
vim
/etc/sudoers
#Defaults requiretty
nagios ALL=(ALL) NOPASSWD:
/usr/local/nagios/libexec/check_disk_status
.sh
|
在nagios服务器端执行命令来测试:
1
2
3
4
|
root@nagios:
/usr/local/nagios/libexec
# ./check_nrpe -H 192.168.2.2 -c check_disk_status
OK -
/dev/sda
1 status is OK
OK -
/dev/sda
2 status is OK
OK -
/dev/sda
3 status is OK
|
定义nagios服务
1
2
3
4
5
6
|
define service{
use linux-service
host_name 192_168_2_2
service_description check disk status
check_command check_nrpe!check_disk_status
}
|
再把时间定义为1天一次,省的总扫描硬盘,对硬盘也不好
参考http://blog.chinaunix.net/uid-20592013-id-2436813.html
执行脚本,发邮件
最简单的,加入crontab,查看邮件即可,下面是脚本