不管是SSD硬盘还是机械硬盘, 都有使用寿命.
大多数的硬盘是支持S.M.A.R.T的, 全称(
Self-Monitoring, Analysis and Reporting Technology)
smartmontools这个工具可以用来读取硬盘的smart信息, 进行测试, 改变一些属性等.
操作系统自带的smartmontools版本较低, 所以建议去源站下载一个最新的, 可以支持更多的硬盘.
使用smartmontools可以提前知道哪些硬盘可能要坏了, 做好预先更换的准备. 像EMC的CX4存储就带这个功能, 能够提前侦测出硬盘是否可能会坏了, 提前用hot spare盘顶上去. 这样能提高卷组的可靠性, 减少数据丢失的风险, 在硬盘数量较多的卷组中尤为重要.
我们也能够利用smartmontools这个工具来提前对硬盘的使用情况进行告警.
安装 :
wget http://downloads.sourceforge.net/project/smartmontools/smartmontools/6.0/smartmontools-6.0.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fsmartmontools%2Ffiles%2Fsmartmontools%2F6.0%2F&ts=1354091701&use_mirror=jaist
tar -zxvf smartmontools-6.0.tar.gz
cd smartmontools-6.0
./configure --prefix=/opt/smartmontools-6.0
make
make install
使用 :
查询你的硬盘是否在对应的smartmontools版本库中 :
硬盘是接在RAID卡上的, 所以这里要加上设备, 后面有一个是PCI-E设备, 不需要加--device
以下是希捷的机械盘
以下是一块DELL的SATA接口SSD硬盘, 注意这行 :
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl --info --device=megaraid,1 /dev/sda
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
Vendor: SEAGATE
Product: ST9146803SS
Revision: FS03
User Capacity: 146,815,733,760 bytes [146 GB]
Logical block size: 512 bytes
Logical Unit id: 0x5000c50016c512bb
Serial number: 3SD17SDR
Device type: disk
Transport protocol: SAS
Local Time is: Wed Nov 28 16:37:48 2012 CST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
以下是一块DELL的SATA接口SSD硬盘, 注意这行 :
Device is: Not in smartctl database [for details use: -P showall]
表示这个设备不在当前的smartctl库中.
库文件是源码中的drivedb.h, 运行smartctl时可以替换这个文件, 使用-B选项.
以下是OCZ RevoDrive3的PCI-E SSD硬盘 :
根据提示可以列出它的详细信息 :
查看你的硬盘支持哪些SMART特性 :
健康查询 :
-B [+]FILE, --drivedb=[+]FILE (ATA)
Read and replace [add] drive database from FILE
[default is +/opt/smartmontools-6.0/etc/smart_drivedb.h
and then /opt/smartmontools-6.0/share/smartmontools/drivedb.h]
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl --info --device=sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: MZ-5EA2000-0D3
Serial Number: S0SDNEABB01510
LU WWN Device Id: 5 002538 250005b9c
Firmware Version: AXM17D3Q
User Capacity: 200,049,647,616 bytes [200 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Wed Nov 28 16:39:14 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
以下是OCZ RevoDrive3的PCI-E SSD硬盘 :
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl --info /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: OCZ-REVODRIVE3
Serial Number: OCZ-Z2134R0TLQBNE659
LU WWN Device Id: 5 e83a97 e827c316e
Firmware Version: 2.15
User Capacity: 240,068,197,888 bytes [240 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Nov 28 16:42:26 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
根据提示可以列出它的详细信息 :
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -P show /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
Drive found in smartmontools Database. Drive identity strings:
MODEL: OCZ-REVODRIVE3
FIRMWARE: 2.15
match smartmontools Drive Database entry:
MODEL REGEXP: SandForce 1st Ed\.|ADATA SSD S(510|599) .?..GB|Corsair CSSD-F(40|60|80|115|120|160|240)GBP?2.*|Corsair Force (3 SSD|GT)|FM-25S2S-(60|120|240)GBP2|FTM(06|12|24|48)CT25H|KINGSTON SH100S3(120|240)G|OCZ[ -](AGILITY2([ -]EX)?|COLOSSUS2|ONYX2|VERTEX(2|-LE))( [123]\..*)?|OCZ-NOCTI|OCZ-REVODRIVE3?( X2)?|OCZ[ -](VELO|VERTEX2[ -](EX|PRO))( [123]\..*)?|D2[CR]STK251...-....|OCZ-(AGILITY3|SOLID3|VERTEX3( MI)?)|OCZ Z-DRIVE R4 [CR]M8[48]|(APOC|DENC|DENEVA|FTNC|GFGC|MANG|MMOC|NIMC|TMSC).*|(DENR|DRSAK|EC188|NIMR|PSIR|TRSAK).*|OWC Mercury Electra [36]G SSD|OWC Mercury Extreme Pro (RE )?SSD|Patriot Pyro|SanDisk SDSSDX(60|120|240|480)GG25|(TX32|TX31C1|VN0..GCNMK|VN0...GCNMK).*|(TX22D1|TX21B1).*|TX52D1.*|UGB(88P|99S)GC...H[BF].
FIRMWARE REGEXP: .*
MODEL FAMILY: SandForce Driven SSDs
ATTRIBUTE OPTIONS: 001 Raw_Read_Error_Rate
005 Retired_Block_Count
009 Power_On_Hours_and_Msec
013 Soft_Read_Error_Rate
100 Gigabytes_Erased
170 Reserve_Block_Count
171 Program_Fail_Count
172 Erase_Fail_Count
174 Unexpect_Power_Loss_Ct
177 Wear_Range_Delta
181 Program_Fail_Count
182 Erase_Fail_Count
184 IO_Error_Detect_Code_Ct
195 ECC_Uncorr_Error_Count
198 Uncorrectable_Sector_Ct
199 SATA_CRC_Error_Count
201 Unc_Soft_Read_Err_Rate
204 Soft_ECC_Correct_Rate
230 Life_Curve_Status
231 SSD_Life_Left
233 SandForce_Internal
234 SandForce_Internal
235 SuperCap_Health
241 Lifetime_Writes_GiB
242 Lifetime_Reads_GiB
查看你的硬盘支持哪些SMART特性 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -c /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 2097) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0021) SCT Status supported.
SCT Data Table supported.
健康查询 :
DELL SAS 机械盘 :
DELL SATA SSD硬盘 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -H -d megaraid,1 /dev/sda
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
SMART Health Status: OK
DELL SATA SSD硬盘 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -H -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
注意上面查询DELL SATA SSD盘是有一个告警, 这是一个BUG造成的. 如下
OCZ PCI-E SSD硬盘 :
信息 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -H /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
信息 :
列出所有信息使用-a, 或者 -x, 或者-A
测试 :
例如 :
-t TEST, --test=TEST
Run test. TEST: offline, short, long, conveyance, force, vendor,N,
select,M-N, pending,N, afterselect,[on|off]
例如 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -t short -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Wed Nov 28 17:11:37 2012
Use smartctl -X to abort test.
使用 -X 可以中断它.
查看测试结果 :
这里第一条表示当前还有测试未完成.
命令行参数 :
使用例子 :
SMART属性介绍 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -l selftest -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Self-test routine in progress 60% 2376 -
# 2 Short offline Completed without error 00% 2375 -
命令行参数 :
[root@db-172-16-3-150 man8]# /opt/smartmontools-6.0/sbin/smartctl --help
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
Usage: smartctl [options] device
============================================ SHOW INFORMATION OPTIONS =====
-h, --help, --usage
Display this help and exit
-V, --version, --copyright, --license
Print license, copyright, and version information and exit
-i, --info
Show identity information for device
--identify[=[w][nvb]]
Show words and bits from IDENTIFY DEVICE data (ATA)
-g NAME, --get=NAME
Get device setting: all, aam, apm, lookahead, security, wcache
-a, --all
Show all SMART information for device
-x, --xall
Show all information for device
--scan
Scan for devices
--scan-open
Scan for devices and try to open each device
================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====
-q TYPE, --quietmode=TYPE (ATA)
Set smartctl quiet mode to one of: errorsonly, silent, noserial
-d TYPE, --device=TYPE
Specify device type to one of: ata, scsi, sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,x][,N], usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, cciss,N, auto, test
-T TYPE, --tolerance=TYPE (ATA)
Tolerance: normal, conservative, permissive, verypermissive
-b TYPE, --badsum=TYPE (ATA)
Set action on bad checksum to one of: warn, exit, ignore
-r TYPE, --report=TYPE
Report transactions (see man page)
-n MODE, --nocheck=MODE (ATA)
No check if: never, sleep, standby, idle (see man page)
============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS =====
-s VALUE, --smart=VALUE
Enable/disable SMART on device (on/off)
-o VALUE, --offlineauto=VALUE (ATA)
Enable/disable automatic offline testing on device (on/off)
-S VALUE, --saveauto=VALUE (ATA)
Enable/disable Attribute autosave on device (on/off)
-s NAME[,VALUE], --set=NAME[,VALUE]
Enable/disable/change device setting: aam,[N|off], apm,[N|off],
lookahead,[on|off], security-freeze, standby,[N|off|now],
wcache,[on|off]
======================================= READ AND DISPLAY DATA OPTIONS =====
-H, --health
Show device SMART health status
-c, --capabilities (ATA)
Show device SMART capabilities
-A, --attributes
Show device SMART vendor-specific Attributes and values
-f FORMAT, --format=FORMAT (ATA)
Set output format for attributes: old, brief, hex[,id|val]
-l TYPE, --log=TYPE
Show device log. TYPE: error, selftest, selective, directory[,g|s],
xerror[,N][,error], xselftest[,N][,selftest],
background, sasphy[,reset], sataphy[,reset],
scttemp[sts,hist], scttempint,N[,p],
scterc[,N,M], devstat[,N], ssd,
gplog,N[,RANGE], smartlog,N[,RANGE]
-v N,OPTION , --vendorattribute=N,OPTION (ATA)
Set display OPTION for vendor Attribute N (see man page)
-F TYPE, --firmwarebug=TYPE (ATA)
Use firmware bug workaround:
none, nologdir, samsung, samsung2, samsung3, xerrorlba, swapid
-P TYPE, --presets=TYPE (ATA)
Drive-specific presets: use, ignore, show, showall
-B [+]FILE, --drivedb=[+]FILE (ATA)
Read and replace [add] drive database from FILE
[default is +/opt/smartmontools-6.0/etc/smart_drivedb.h
and then /opt/smartmontools-6.0/share/smartmontools/drivedb.h]
============================================ DEVICE SELF-TEST OPTIONS =====
-t TEST, --test=TEST
Run test. TEST: offline, short, long, conveyance, force, vendor,N,
select,M-N, pending,N, afterselect,[on|off]
-C, --captive
Do test in captive mode (along with -t)
-X, --abort
Abort any non-captive test on device
使用例子 :
=================================================== SMARTCTL EXAMPLES =====
smartctl --all /dev/hda (Prints all SMART information)
smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
(Enables SMART on first disk)
smartctl --test=long /dev/hda (Executes extended disk self-test)
smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda
(Prints Self-Test & Attribute errors)
smartctl --all --device=3ware,2 /dev/sda
smartctl --all --device=3ware,2 /dev/twe0
smartctl --all --device=3ware,2 /dev/twa0
smartctl --all --device=3ware,2 /dev/twl0
(Prints all SMART info for 3rd ATA disk on 3ware RAID controller)
smartctl --all --device=hpt,1/1/3 /dev/sda
(Prints all SMART info for the SATA disk attached to the 3rd PMPort
of the 1st channel on the 1st HighPoint RAID controller)
smartctl --all --device=areca,3/1 /dev/sg2
(Prints all SMART info for 3rd ATA disk of the 1st enclosure
on Areca RAID controller)
SMART属性介绍 :
这里有个表, 但不是每个硬盘厂商的都一样, 可能略有变化.
具体请看smartctl -A的输出.
例如 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -A /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 083 083 050 Pre-fail Always - 0/72989578
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 122h+57m+28.195s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
171 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 7
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 0
181 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 030 030 000 Old_age Always - 30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 0/72989578
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 0/72989578
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 0/72989578
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100
231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always - 0
233 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 0
234 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 0
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 15503
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 101
以上是OCZ PCI-E RevoDrive3的输出, 所以要到OCZ厂商索取SMART的attribute信息表, 根据芯片不同, 也有不同的smart属性.
如下 :
其他 :
smartd 可以用来调度硬盘检查计划.
配置文件的写法参考man smartd.conf
【参考】
1.
http://sourceforge.net/projects/smartmontools/