smartmontools monitor storage device's health status

简介:
不管是SSD硬盘还是机械硬盘, 都有使用寿命.
大多数的硬盘是支持S.M.A.R.T的, 全称( Self-Monitoring, Analysis and Reporting Technology)
smartmontools这个工具可以用来读取硬盘的smart信息, 进行测试, 改变一些属性等.
操作系统自带的smartmontools版本较低, 所以建议去源站下载一个最新的, 可以支持更多的硬盘.
使用smartmontools可以提前知道哪些硬盘可能要坏了, 做好预先更换的准备. 像EMC的CX4存储就带这个功能, 能够提前侦测出硬盘是否可能会坏了, 提前用hot spare盘顶上去. 这样能提高卷组的可靠性, 减少数据丢失的风险, 在硬盘数量较多的卷组中尤为重要.

我们也能够利用smartmontools这个工具来提前对硬盘的使用情况进行告警.

安装 :
wget http://downloads.sourceforge.net/project/smartmontools/smartmontools/6.0/smartmontools-6.0.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fsmartmontools%2Ffiles%2Fsmartmontools%2F6.0%2F&ts=1354091701&use_mirror=jaist
tar -zxvf smartmontools-6.0.tar.gz
cd smartmontools-6.0
./configure --prefix=/opt/smartmontools-6.0
make
make install

使用 : 
查询你的硬盘是否在对应的smartmontools版本库中 : 
硬盘是接在RAID卡上的, 所以这里要加上设备,  后面有一个是PCI-E设备, 不需要加--device
以下是希捷的机械盘
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl --info --device=megaraid,1 /dev/sda
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

Vendor:               SEAGATE 
Product:              ST9146803SS     
Revision:             FS03
User Capacity:        146,815,733,760 bytes [146 GB]
Logical block size:   512 bytes
Logical Unit id:      0x5000c50016c512bb
Serial number:        3SD17SDR
Device type:          disk
Transport protocol:   SAS
Local Time is:        Wed Nov 28 16:37:48 2012 CST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported


以下是一块DELL的SATA接口SSD硬盘, 注意这行 : 
Device is:        Not in smartctl database [for details use: -P showall]

表示这个设备不在当前的smartctl库中.
库文件是源码中的drivedb.h, 运行smartctl时可以替换这个文件, 使用-B选项.
  -B [+]FILE, --drivedb=[+]FILE                                       (ATA)
        Read and replace [add] drive database from FILE
        [default is +/opt/smartmontools-6.0/etc/smart_drivedb.h
         and then    /opt/smartmontools-6.0/share/smartmontools/drivedb.h]


[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl --info --device=sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     MZ-5EA2000-0D3
Serial Number:    S0SDNEABB01510
LU WWN Device Id: 5 002538 250005b9c
Firmware Version: AXM17D3Q
User Capacity:    200,049,647,616 bytes [200 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Wed Nov 28 16:39:14 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


以下是OCZ RevoDrive3的PCI-E SSD硬盘 : 
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl --info /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     OCZ-REVODRIVE3
Serial Number:    OCZ-Z2134R0TLQBNE659
LU WWN Device Id: 5 e83a97 e827c316e
Firmware Version: 2.15
User Capacity:    240,068,197,888 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 28 16:42:26 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

根据提示可以列出它的详细信息 : 
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -P show /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

Drive found in smartmontools Database.  Drive identity strings:
MODEL:              OCZ-REVODRIVE3
FIRMWARE:           2.15
match smartmontools Drive Database entry:
MODEL REGEXP:       SandForce 1st Ed\.|ADATA SSD S(510|599) .?..GB|Corsair CSSD-F(40|60|80|115|120|160|240)GBP?2.*|Corsair Force (3 SSD|GT)|FM-25S2S-(60|120|240)GBP2|FTM(06|12|24|48)CT25H|KINGSTON SH100S3(120|240)G|OCZ[ -](AGILITY2([ -]EX)?|COLOSSUS2|ONYX2|VERTEX(2|-LE))( [123]\..*)?|OCZ-NOCTI|OCZ-REVODRIVE3?( X2)?|OCZ[ -](VELO|VERTEX2[ -](EX|PRO))( [123]\..*)?|D2[CR]STK251...-....|OCZ-(AGILITY3|SOLID3|VERTEX3( MI)?)|OCZ Z-DRIVE R4 [CR]M8[48]|(APOC|DENC|DENEVA|FTNC|GFGC|MANG|MMOC|NIMC|TMSC).*|(DENR|DRSAK|EC188|NIMR|PSIR|TRSAK).*|OWC Mercury Electra [36]G SSD|OWC Mercury Extreme Pro (RE )?SSD|Patriot Pyro|SanDisk SDSSDX(60|120|240|480)GG25|(TX32|TX31C1|VN0..GCNMK|VN0...GCNMK).*|(TX22D1|TX21B1).*|TX52D1.*|UGB(88P|99S)GC...H[BF].
FIRMWARE REGEXP:    .*
MODEL FAMILY:       SandForce Driven SSDs
ATTRIBUTE OPTIONS:  001 Raw_Read_Error_Rate
                    005 Retired_Block_Count
                    009 Power_On_Hours_and_Msec
                    013 Soft_Read_Error_Rate
                    100 Gigabytes_Erased
                    170 Reserve_Block_Count
                    171 Program_Fail_Count
                    172 Erase_Fail_Count
                    174 Unexpect_Power_Loss_Ct
                    177 Wear_Range_Delta
                    181 Program_Fail_Count
                    182 Erase_Fail_Count
                    184 IO_Error_Detect_Code_Ct
                    195 ECC_Uncorr_Error_Count
                    198 Uncorrectable_Sector_Ct
                    199 SATA_CRC_Error_Count
                    201 Unc_Soft_Read_Err_Rate
                    204 Soft_ECC_Correct_Rate
                    230 Life_Curve_Status
                    231 SSD_Life_Left
                    233 SandForce_Internal
                    234 SandForce_Internal
                    235 SuperCap_Health
                    241 Lifetime_Writes_GiB
                    242 Lifetime_Reads_GiB

查看你的硬盘支持哪些SMART特性 : 
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -c /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                ( 2097) seconds.
Offline data collection
capabilities:                    (0x7f) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0021) SCT Status supported.
                                        SCT Data Table supported.


健康查询 : 
DELL SAS 机械盘 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -H -d megaraid,1 /dev/sda
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

SMART Health Status: OK

DELL SATA SSD硬盘 : 
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -H -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

注意上面查询DELL SATA SSD盘是有一个告警, 这是一个BUG造成的. 如下
OCZ PCI-E SSD硬盘 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -H /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


信息 : 
列出所有信息使用-a, 或者 -x, 或者-A

测试 :
  -t TEST, --test=TEST
        Run test. TEST: offline, short, long, conveyance, force, vendor,N,
                        select,M-N, pending,N, afterselect,[on|off]

例如 : 
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -t short -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Wed Nov 28 17:11:37 2012

Use smartctl -X to abort test.

使用 -X 可以中断它.

查看测试结果 : 
这里第一条表示当前还有测试未完成.
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -l selftest -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Self-test routine in progress 60%      2376         -
# 2  Short offline       Completed without error       00%      2375         -


命令行参数 : 
[root@db-172-16-3-150 man8]# /opt/smartmontools-6.0/sbin/smartctl --help
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

Usage: smartctl [options] device

============================================ SHOW INFORMATION OPTIONS =====

  -h, --help, --usage
         Display this help and exit

  -V, --version, --copyright, --license
         Print license, copyright, and version information and exit

  -i, --info
         Show identity information for device

  --identify[=[w][nvb]]
         Show words and bits from IDENTIFY DEVICE data                (ATA)

  -g NAME, --get=NAME
        Get device setting: all, aam, apm, lookahead, security, wcache

  -a, --all
         Show all SMART information for device

  -x, --xall
         Show all information for device

  --scan
         Scan for devices

  --scan-open
         Scan for devices and try to open each device

================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====

  -q TYPE, --quietmode=TYPE                                           (ATA)
         Set smartctl quiet mode to one of: errorsonly, silent, noserial

  -d TYPE, --device=TYPE
         Specify device type to one of: ata, scsi, sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,x][,N], usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, cciss,N, auto, test

  -T TYPE, --tolerance=TYPE                                           (ATA)
         Tolerance: normal, conservative, permissive, verypermissive

  -b TYPE, --badsum=TYPE                                              (ATA)
         Set action on bad checksum to one of: warn, exit, ignore

  -r TYPE, --report=TYPE
         Report transactions (see man page)

  -n MODE, --nocheck=MODE                                             (ATA)
         No check if: never, sleep, standby, idle (see man page)

============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS =====

  -s VALUE, --smart=VALUE
        Enable/disable SMART on device (on/off)

  -o VALUE, --offlineauto=VALUE                                       (ATA)
        Enable/disable automatic offline testing on device (on/off)

  -S VALUE, --saveauto=VALUE                                          (ATA)
        Enable/disable Attribute autosave on device (on/off)

  -s NAME[,VALUE], --set=NAME[,VALUE]
        Enable/disable/change device setting: aam,[N|off], apm,[N|off],
        lookahead,[on|off], security-freeze, standby,[N|off|now],
        wcache,[on|off]

======================================= READ AND DISPLAY DATA OPTIONS =====

  -H, --health
        Show device SMART health status

  -c, --capabilities                                                  (ATA)
        Show device SMART capabilities

  -A, --attributes
        Show device SMART vendor-specific Attributes and values

  -f FORMAT, --format=FORMAT                                          (ATA)
        Set output format for attributes: old, brief, hex[,id|val]

  -l TYPE, --log=TYPE
        Show device log. TYPE: error, selftest, selective, directory[,g|s],
                               xerror[,N][,error], xselftest[,N][,selftest],
                               background, sasphy[,reset], sataphy[,reset],
                               scttemp[sts,hist], scttempint,N[,p],
                               scterc[,N,M], devstat[,N], ssd,
                               gplog,N[,RANGE], smartlog,N[,RANGE]

  -v N,OPTION , --vendorattribute=N,OPTION                            (ATA)
        Set display OPTION for vendor Attribute N (see man page)

  -F TYPE, --firmwarebug=TYPE                                         (ATA)
        Use firmware bug workaround:
        none, nologdir, samsung, samsung2, samsung3, xerrorlba, swapid

  -P TYPE, --presets=TYPE                                             (ATA)
        Drive-specific presets: use, ignore, show, showall

  -B [+]FILE, --drivedb=[+]FILE                                       (ATA)
        Read and replace [add] drive database from FILE
        [default is +/opt/smartmontools-6.0/etc/smart_drivedb.h
         and then    /opt/smartmontools-6.0/share/smartmontools/drivedb.h]

============================================ DEVICE SELF-TEST OPTIONS =====

  -t TEST, --test=TEST
        Run test. TEST: offline, short, long, conveyance, force, vendor,N,
                        select,M-N, pending,N, afterselect,[on|off]

  -C, --captive
        Do test in captive mode (along with -t)

  -X, --abort
        Abort any non-captive test on device


使用例子 : 
=================================================== SMARTCTL EXAMPLES =====

  smartctl --all /dev/hda                    (Prints all SMART information)

  smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
                                              (Enables SMART on first disk)

  smartctl --test=long /dev/hda          (Executes extended disk self-test)

  smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda
                                      (Prints Self-Test & Attribute errors)
  smartctl --all --device=3ware,2 /dev/sda
  smartctl --all --device=3ware,2 /dev/twe0
  smartctl --all --device=3ware,2 /dev/twa0
  smartctl --all --device=3ware,2 /dev/twl0
          (Prints all SMART info for 3rd ATA disk on 3ware RAID controller)
  smartctl --all --device=hpt,1/1/3 /dev/sda
          (Prints all SMART info for the SATA disk attached to the 3rd PMPort
           of the 1st channel on the 1st HighPoint RAID controller)
  smartctl --all --device=areca,3/1 /dev/sg2
          (Prints all SMART info for 3rd ATA disk of the 1st enclosure
           on Areca RAID controller)


SMART属性介绍 : 
这里有个表, 但不是每个硬盘厂商的都一样, 可能略有变化. 
具体请看smartctl -A的输出.
例如 :
[root@db-172-16-3-150 smartmontools-6.0]# /opt/smartmontools-6.0/sbin/smartctl -A /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   083   050    Pre-fail  Always       -       0/72989578
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       122h+57m+28.195s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       7
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   030   030   000    Old_age   Always       -       30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/72989578
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/72989578
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/72989578
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       0
234 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       0
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       15503
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       101

以上是OCZ PCI-E RevoDrive3的输出, 所以要到OCZ厂商索取SMART的attribute信息表, 根据芯片不同, 也有不同的smart属性.
如下 : 

其他 : 
smartd 可以用来调度硬盘检查计划.
配置文件的写法参考man smartd.conf

【参考】
1.  http://sourceforge.net/projects/smartmontools/
目录
相关文章
|
Kubernetes Linux 容器
【kubernetes】修复 systemctl status sshd Failed to get D-Bus connection: Operation not permitted
【kubernetes】修复 systemctl status sshd Failed to get D-Bus connection: Operation not permitted
599 0
|
7月前
|
Kubernetes 容器 Perl
error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is cu
error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is cu
121 0
|
7月前
|
负载均衡 Java 应用服务中间件
Client not connected, current status:STARTING
Client not connected, current status:STARTING
580 1
|
网络协议
Job for named.service failed because the control process exited with error code.
Job for named.service failed because the control process exited with error code.
832 0
|
网络协议
Job for named.service failed because the control process exited with error code.怎么解决
本篇内容记录了如何解决Job for named.service failed because the control process exited with error code.的问题。
3734 0
Job for named.service failed because the control process exited with error code.怎么解决
|
Dubbo 应用服务中间件 容器
关于Failed to check the status of the service com.taotao.service.ItemService. No provider available fo
原文:http://www.bubuko.com/infodetail-2250226.html 项目中用dubbo发生:     Failed to check the status of the service com.taotao.service.ItemService. No provider available for the service 原因: Dubbo缺省会在启动时检查依赖的服务是否可用,不可用时会抛出异常,阻止Spring初始化完成,以便上线时,能及早发现问题,默认check=true。
2773 0
asm Normal redundancy storage link is broken(19C)
asm Normal redundancy storage link is broken,Storage Split
3101 0
|
网络协议 关系型数据库 PostgreSQL