前一篇BLOG讲了一下smartmontools 工具的使用.
smartmontools的数据库在不断的更新, 如果你的硬盘还没有加入到smartmontools 的 drivedb.h数据库中, 那就再等等吧.
已经加入到这里的, 也不要欣喜若狂, 读取到的SMART Attributes你都读懂了么?
SMART Attributes的定义是硬盘厂商提供的, 没有一个统一的标准, 所以比较痛苦.
下面以OCZ PCI-E RevoDrive3为例, 来看看SMART Attribute的意思 :
[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 089 089 050 - 0/113484753
5 Retired_Block_Count PO--CK 100 100 003 - 0
9 Power_On_Hours_and_Msec -O--CK 100 100 000 - 170h+08m+29.325s
12 Power_Cycle_Count -O--CK 100 100 000 - 8
171 Program_Fail_Count -O--CK 000 000 000 - 0
172 Erase_Fail_Count -O--CK 000 000 000 - 0
174 Unexpect_Power_Loss_Ct ----CK 000 000 000 - 7
177 Wear_Range_Delta ------ 000 000 000 - 2
181 Program_Fail_Count -O--CK 000 000 000 - 0
182 Erase_Fail_Count -O--CK 000 000 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
194 Temperature_Celsius -O---K 030 030 000 - 30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count --SRC- 120 120 000 - 0/113484753
196 Reallocated_Event_Count PO--CK 100 100 003 - 0
201 Unc_Soft_Read_Err_Rate --SRC- 120 120 000 - 0/113484753
204 Soft_ECC_Correct_Rate --SRC- 120 120 000 - 0/113484753
230 Life_Curve_Status PO--C- 100 100 000 - 100
231 SSD_Life_Left PO--C- 100 100 010 - 0
233 SandForce_Internal ------ 000 000 000 - 0
234 SandForce_Internal ------ 000 000 000 - 0
241 Lifetime_Writes_GiB -O--CK 000 000 000 - 44814
242 Lifetime_Reads_GiB -O--CK 000 000 000 - 363
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
解释一下每个属性的意思 :
请参看smartctl的帮助文件 -A 部分 :
ID# : 属性ID, 从1到255.
ATTRIBUTE_NAME : 属性名.
FLAG : 表示这个属性携带的标记. 使用-f brief可以打印.
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
VALUE : Normalized value, 取值范围1到254. 越低表示越差. 越高表示越好. (with 1 representing the worst case and 254 representing the best)
注意wiki上说的是1到253. 这个值是硬盘厂商根据RAW_VALUE转换来的, smartmontools工具不负责转换工作.
WORST : 表示SMART开启以来的, 所有Normalized values的最低值. (which represents the lowest recorded normalized value.)
THRESH : 阈值, 当Normalized value小于等于THRESH值时, 表示这项指标已经failed了.
Each Attribute has a "Raw" value, printed under the heading "RAW_VALUE", and a "Normalized" value printed under the heading "VALUE".
[Note: smartctl prints these values in base-10.] In the example just given, the "Raw Value" for Attribute 12 would be the actual number of times that the disk has been power-cycled, for example 365 if the disk has been turned on once per day for exactly one year.
Each vendor uses their own algorithm to convert this "Raw" value to a "Normalized" value in the range from 1 to 254.
Please keep in mind that smartctl only reports the different Attribute types, values, andthresholds as read from the device.
It does not carry out the conversion between "Raw" and "Normalized values: this is done by the disk's firmware.
The conversion from Raw value to a quantity with physical units is not specified by the SMART standard.
In most cases, the values printed by smartctl are sensible.
For example the temperature Attribute generally has its raw value equal to the temperature in Celsius. However in some cases vendors use unusual conventions.
For example the Hitachi disk on my laptop reports its power-on hours in minutes, not hours.
Some IBM disks track three temperatures rather than one, in their raw values. And so on.
WORST : 表示SMART开启以来的, 所有Normalized values的最低值. (which represents the lowest recorded normalized value.)
Each Attribute also has a "Worst" value shown under the heading "WORST".
This is the smallest (closest to failure) value that the disk has recorded at any time during its lifetime when SMART was enabled.
[Note however that some vendors firmware may actually increase the "Worst" value for some "rate-type" Attributes.]
THRESH : 阈值, 当Normalized value小于等于THRESH值时, 表示这项指标已经failed了.
Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under the heading "THRESH".
If the Normalized value is less than or equal to the Threshold value, then the Attribute is said to have failed.
If the Attribute is a pre-failure Attribute, then disk failure is imminent.
注意这里提到, 如果这个属性是pre-failure的, 那么这项如果出现Normalized value<=THRESH, 那么磁盘将马上failed掉.
TYPE : 这里存在两种TYPE类型, Pre-failed和Old_age.
Pre-failed 类型的Normalized value可以用来预先知道磁盘是否要坏了. 例如Normalized value接近THRESH时, 就赶紧换硬盘吧.
Old_age 类型的Normalized value是指正常的使用损耗值, 当Normalized value 接近THRESH时, 也需要注意, 但是比Pre-failed要好一点.
UPDATED : 这个字段表示这个属性的值在什么情况下会被更新. 一种是通常的操作和离线测试都更新(Always), 另一种是只在离线测试的情况下更新(Offline).
WHEN_FAILED : 这个字段表示当前这个属性的状态 : failing_now(normalized_value <= THRESH), 或者in_the_past(WORST <= THRESH), 或者 - , 正常(normalized_value以及wrost >= THRESH).
The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute.
Attributes are one of two possible types: Pre-failure or Old age.
Pre-failure Attributes are ones which, if Normalized value less than or equal to their threshold values, indicate pending disk failure.
Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute's Normalized value is less than or equal to the threshold.
Please note: the fact that an Attribute is of type ’Pre-fail’ does not mean that your disk is about to fail!
It only has this meaning if the Attribute′s current Normalized value is less than or equal to the threshold value.
UPDATED : 这个字段表示这个属性的值在什么情况下会被更新. 一种是通常的操作和离线测试都更新(Always), 另一种是只在离线测试的情况下更新(Offline).
The table column labeled "UPDATED" shows if the SMART Attribute values are updated during both normal operation and off-line testing, or only during offline testing.
The former are labeled "Always" and the latter are labeled "Offline".
WHEN_FAILED : 这个字段表示当前这个属性的状态 : failing_now(normalized_value <= THRESH), 或者in_the_past(WORST <= THRESH), 或者 - , 正常(normalized_value以及wrost >= THRESH).
If the Attribute's current Normalized value is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW".
If not, but the worst recorded value is less than or equal to the threshold value, then this column will display "In_the_past".
If the "WHEN_FAILED" column has no entry (indicated by a dash: '-') then this Attribute is OK now (not failing) and has also never failed in the past.
RAW_VALUE : 表示这个属性的未转换前的RAW值, 可能是计数, 也可能是温度, 也可能是其他的.
注意RAW_VALUE转换成Normalized value是由厂商的firmware提供的, smartmontools不提供转换.
其他注意事项 :
So to summarize: the Raw Attribute values are the ones that might have a real physical interpretation,
such as "Temperature Celsius", "Hours", or "Start-Stop Cycles".
Each manufacturer converts these, using their detailed knowledge of the disk's operations and failure modes, to Normalized Attribute values in the range 1-254.
The current and worst (lowest measured) of these Normalized Attribute values are stored on the disk, along with a Threshold value that the manufacturer has determined will indicate that the disk is going to fail, or that it has exceeded its design age or aging limit. smartctl does not calculate any of the Attribute values, thresholds, or types, it merely reports them from the SMART data on the device.
其他注意事项 :
1. 当SSD磁盘不在smartmontools 数据库中时, 显示的attribute name可能不准. 因为硬盘厂商可能更改了attribute id的定义.
Solid-state drives use different meanings for some of the attributes. In this case the attribute name printed by smartctl is incorrect unless the drive is already in the smartmontools drive database.
2. 注意有个FLAG是KEEP, 如果不带这个FLAG的属性, 值将不会KEEP在磁盘中, 可能出现WORST值被刷新的情况, 例如这里的ID=1的值, 已经89了, 重新执行又变成91了, 但是WORST的值并不是历史以来的最低89.
遇到这种情况的解决办法是找个地方存储这些值的历史值.
因此监控磁盘的重点在哪里呢?
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 091 091 050 - 0/113746865
5 Retired_Block_Count PO--CK 100 100 003 - 0
9 Power_On_Hours_and_Msec -O--CK 100 100 000 - 170h+43m+38.270s
12 Power_Cycle_Count -O--CK 100 100 000 - 8
171 Program_Fail_Count -O--CK 000 000 000 - 0
172 Erase_Fail_Count -O--CK 000 000 000 - 0
174 Unexpect_Power_Loss_Ct ----CK 000 000 000 - 7
177 Wear_Range_Delta ------ 000 000 000 - 2
181 Program_Fail_Count -O--CK 000 000 000 - 0
182 Erase_Fail_Count -O--CK 000 000 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
194 Temperature_Celsius -O---K 030 030 000 - 30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count --SRC- 120 120 000 - 0/113746865
196 Reallocated_Event_Count PO--CK 100 100 003 - 0
201 Unc_Soft_Read_Err_Rate --SRC- 120 120 000 - 0/113746865
204 Soft_ECC_Correct_Rate --SRC- 120 120 000 - 0/113746865
230 Life_Curve_Status PO--C- 100 100 000 - 100
231 SSD_Life_Left PO--C- 100 100 010 - 0
233 SandForce_Internal ------ 000 000 000 - 0
234 SandForce_Internal ------ 000 000 000 - 0
241 Lifetime_Writes_GiB -O--CK 000 000 000 - 45175
242 Lifetime_Reads_GiB -O--CK 000 000 000 - 363
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
因此监控磁盘的重点在哪里呢?
严重情况从上到下 :
1. 最严重的情况WHEN_FAILED =
FAILING_NOW 并且 TYPE=Pre-failed, 表示现在这个属性已经出问题了. 并且硬盘也已经failed了.
2. 次严重的情况
WHEN_FAILED =
in_the_past 并且 TYPE=Pre-failed, 表示这个属性曾经出问题了. 但是现在是正常的.
3.
WHEN_FAILED =
FAILING_NOW 并且 TYPE=Old_age, 表示现在这个属性已经出问题了. 但是硬盘可能还没有failed.
4.
WHEN_FAILED =
in_the_past 并且 TYPE=Old_age, 表示现在这个属性曾经出问题了. 但是现在是正常的.
为了避免这4种情况的发生.
1. 对于UPDATE=Offline的属性, 应该让smartd定期进行测试(smartd还可以发邮件). 或者crontab进行测试.
2. 应该时刻关注磁盘的Normalized value以及WORST的值是否接近THRESH的值了. 当有值要接近THRESH了, 提前更换硬盘.
3. 温度, 有些磁盘对温度比较敏感, 例如PCI-E SSD硬盘. 如果温度过高可能就挂了. 这里读取RAW_VALUE就比较可靠了.
监控手段,
1. 可以写nagios插件, 通过nagios来监控.
2. smartd, 它可以发邮件告警.
3. 定期收集smart数据, 分析每个属性的Normalized value的变化趋势, 更加精确的推算出磁盘的剩余使用寿命.
另一块SSD硬盘的输出 :
【参考】
【附】
[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate -O-R-K 001 001 000 - 16936932
5 Reallocated_Sector_Ct PO--CK 100 100 001 - 0
9 Power_On_Hours -O--CK 094 094 000 - 2423
12 Power_Cycle_Count -O--CK 099 099 000 - 6
13 Read_Soft_Error_Rate -O-R-K 100 100 000 - 0
175 Program_Fail_Count_Chip -O--CK 100 100 000 - 0
176 Erase_Fail_Count_Chip -O--CK 100 100 000 - 0
177 Wear_Leveling_Count -O--CK 099 099 000 - 3
178 Used_Rsvd_Blk_Cnt_Chip PO--CK 074 074 001 - 43
179 Used_Rsvd_Blk_Cnt_Tot PO--CK 091 091 001 - 895
180 Unused_Rsvd_Blk_Cnt_Tot -O--CK 091 091 000 - 9734
181 Program_Fail_Cnt_Total -O--CK 100 100 000 - 0
182 Erase_Fail_Count_Total -O--CK 100 100 000 - 0
183 Runtime_Bad_Block PO--CK 100 100 001 - 0
194 Temperature_Celsius -O---K 076 071 000 - 24 (Min/Max 22/28)
195 Hardware_ECC_Recovered -O--CK 100 100 000 - 0
198 Offline_Uncorrectable ----CK 100 100 000 - 0
199 UDMA_CRC_Error_Count -O--CK 100 100 000 - 0
202 Unknown_SSD_Attribute PO--CK 100 100 090 - 0
232 Available_Reservd_Space -O--CK 074 074 000 - 123
233 Media_Wearout_Indicator -O--CK 054 054 000 - 1953821162
240 Unknown_SSD_Attribute -O--CK 100 100 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
简单的分析一下上面这个输出,
1. Raw_Read_Error_Rate 这个值已经非常接近THRESH了, 但是它只是一个正常损耗值, 不一定会带来硬盘的failed.
2. 需要注意ID=202的这条, THRESH=90, 当前的Normalized value=100, 也有点接近了, 并且这个属性是Pre-failed的, 所以更加要注意它的变化.
1. (SMART Attributes define : OCZ Vertex2, Agility2, etc) :
补充 :
245TB 写入后
SSD_Life_Left 的VALUE 降低到了96 .
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 089 089 050 - 0/150282048
5 Retired_Block_Count PO--CK 100 100 003 - 0
9 Power_On_Hours_and_Msec -O--CK 100 100 000 - 516h+42m+05.505s
12 Power_Cycle_Count -O--CK 100 100 000 - 10
171 Program_Fail_Count -O--CK 000 000 000 - 0
172 Erase_Fail_Count -O--CK 000 000 000 - 0
174 Unexpect_Power_Loss_Ct ----CK 000 000 000 - 9
177 Wear_Range_Delta ------ 000 000 000 - 9
181 Program_Fail_Count -O--CK 000 000 000 - 0
182 Erase_Fail_Count -O--CK 000 000 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
194 Temperature_Celsius -O---K 030 030 000 - 30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count --SRC- 120 120 000 - 0/150282048
196 Reallocated_Event_Count PO--CK 100 100 003 - 0
201 Unc_Soft_Read_Err_Rate --SRC- 120 120 000 - 0/150282048
204 Soft_ECC_Correct_Rate --SRC- 120 120 000 - 0/150282048
230 Life_Curve_Status PO--C- 100 100 000 - 100
231 SSD_Life_Left PO--C- 096 096 010 - 0
233 SandForce_Internal ------ 000 000 000 - 0
234 SandForce_Internal ------ 000 000 000 - 0
241 Lifetime_Writes_GiB -O--CK 000 000 000 - 245127
242 Lifetime_Reads_GiB -O--CK 000 000 000 - 6946
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning