explain SMART Attributes

简介:
前一篇BLOG讲了一下smartmontools 工具的使用.
smartmontools的数据库在不断的更新, 如果你的硬盘还没有加入到smartmontools 的 drivedb.h数据库中, 那就再等等吧.
已经加入到这里的, 也不要欣喜若狂, 读取到的SMART Attributes你都读懂了么?
SMART Attributes的定义是硬盘厂商提供的, 没有一个统一的标准, 所以比较痛苦.
下面以OCZ PCI-E RevoDrive3为例, 来看看SMART Attribute的意思 :
[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   089   089   050    -    0/113484753
  5 Retired_Block_Count     PO--CK   100   100   003    -    0
  9 Power_On_Hours_and_Msec -O--CK   100   100   000    -    170h+08m+29.325s
 12 Power_Cycle_Count       -O--CK   100   100   000    -    8
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    7
177 Wear_Range_Delta        ------   000   000   000    -    2
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/113484753
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/113484753
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/113484753
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    44814
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    363
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

解释一下每个属性的意思 : 
请参看smartctl的帮助文件 -A 部分 : 
ID# :  属性ID, 从1到255.
ATTRIBUTE_NAME : 属性名.
FLAG : 表示这个属性携带的标记. 使用-f brief可以打印.
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

VALUE : Normalized value, 取值范围1到254. 越低表示越差. 越高表示越好. (with 1 representing the worst case and 254 representing the best)
注意wiki上说的是1到253. 这个值是硬盘厂商根据RAW_VALUE转换来的, smartmontools工具不负责转换工作.
Each  Attribute  has  a  "Raw"  value,  printed  under the heading "RAW_VALUE", and a "Normalized" value printed under the heading "VALUE".  
[Note: smartctl prints these values in  base-10.]   In  the  example just  given, the "Raw Value" for Attribute 12 would be the actual number of times that the disk has been power-cycled, for example 365 if the disk has been turned on once per day for exactly  one  year.
Each vendor  uses their own algorithm to convert this "Raw" value to a "Normalized" value in the range from 1 to 254.  
Please keep in mind that smartctl only reports  the  different  Attribute  types,  values,  andthresholds as read from the device.  
It does not carry out the conversion between "Raw" and "Normalized values: this is done by the disk's firmware.
The conversion from Raw value to a quantity with physical units is not specified by the SMART  standard.
In  most cases, the values printed by smartctl are sensible.  
For example the temperature Attribute generally has its raw value equal to the temperature in Celsius.  However in some cases vendors use unusual conventions.   
For  example  the  Hitachi  disk  on my laptop reports its power-on hours in minutes, not hours. 
Some IBM disks track three temperatures rather than one, in their raw values.  And so on.

WORST : 表示SMART开启以来的, 所有Normalized values的最低值. (which represents the lowest recorded normalized value.)
                Each Attribute also has a "Worst" value shown under the heading "WORST".  
                This is the smallest  (closest to  failure)  value  that  the disk has recorded at any time during its lifetime when SMART was enabled.
                [Note however that some vendors firmware may actually increase the "Worst" value  for  some  "rate-type" Attributes.]

THRESH : 阈值, 当Normalized value小于等于THRESH值时, 表示这项指标已经failed了.
                 Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under  the  heading "THRESH".   
                 If  the Normalized value is less than or equal to the Threshold value, then the Attribute is said to have failed.  
                 If the Attribute is a pre-failure Attribute, then disk failure is imminent.

                 注意这里提到, 如果这个属性是pre-failure的, 那么这项如果出现Normalized value<=THRESH, 那么磁盘将马上failed掉.
TYPE : 这里存在两种TYPE类型, Pre-failed和Old_age. 
            Pre-failed 类型的Normalized value可以用来预先知道磁盘是否要坏了. 例如Normalized value接近THRESH时, 就赶紧换硬盘吧.
            Old_age 类型的Normalized value是指正常的使用损耗值, 当Normalized value 接近THRESH时, 也需要注意, 但是比Pre-failed要好一点.
            The  Attribute  table printed out by smartctl also shows the "TYPE" of the Attribute. 
            Attributes are one of two possible types: Pre-failure or Old age.  
            Pre-failure Attributes are ones which, if Normalized value less  than  or equal  to their threshold values, indicate pending disk failure.
            Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute's               Normalized value is less than or equal to the threshold.
            Please note: the fact that an Attribute is of type ’Pre-fail’ does not mean that your disk is about to fail!  
            It only has this meaning if the Attribute′s  current  Normalized value is less than or equal to the threshold value.

UPDATED : 这个字段表示这个属性的值在什么情况下会被更新. 一种是通常的操作和离线测试都更新(Always), 另一种是只在离线测试的情况下更新(Offline).
                 The  table  column  labeled "UPDATED" shows if the SMART Attribute values are updated during both normal operation and off-line testing, or only during offline testing.  
                 The former are labeled "Always" and the latter are labeled "Offline".

WHEN_FAILED : 这个字段表示当前这个属性的状态 : failing_now(normalized_value <= THRESH), 或者in_the_past(WORST <= THRESH), 或者 - , 正常(normalized_value以及wrost >= THRESH).
            If  the  Attribute's  current  Normalized  value  is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". 
            If not, but the worst recorded value is  less  than  or equal  to the threshold value, then this column will display "In_the_past".  
            If the "WHEN_FAILED" column has no entry (indicated by a dash: '-') then this Attribute is OK now (not failing) and has  also  never failed in the past.

RAW_VALUE : 表示这个属性的未转换前的RAW值, 可能是计数, 也可能是温度, 也可能是其他的.
注意RAW_VALUE转换成Normalized value是由厂商的firmware提供的, smartmontools不提供转换.
So  to  summarize: the Raw Attribute values are the ones that might have a real physical interpretation,
such as "Temperature Celsius", "Hours", or "Start-Stop Cycles".  
Each manufacturer converts these, using their  detailed  knowledge of the disk's operations and failure modes, to Normalized Attribute values in the range 1-254.  
The current and worst (lowest measured)  of  these  Normalized  Attribute  values  are stored on the disk, along with a Threshold value that the manufacturer has determined will indicate that the disk is going to fail, or that it has exceeded its design age or aging  limit. smartctl  does  not calculate  any of the Attribute values, thresholds, or types, it merely reports them from the SMART data on the device.


其他注意事项 : 
1. 当SSD磁盘不在smartmontools 数据库中时, 显示的attribute name可能不准. 因为硬盘厂商可能更改了attribute id的定义.
Solid-state drives use different meanings for some of the attributes.  In this case the  attribute  name printed by smartctl is incorrect unless the drive is already in the smartmontools drive database.

2. 注意有个FLAG是KEEP, 如果不带这个FLAG的属性, 值将不会KEEP在磁盘中, 可能出现WORST值被刷新的情况, 例如这里的ID=1的值, 已经89了, 重新执行又变成91了, 但是WORST的值并不是历史以来的最低89.
遇到这种情况的解决办法是找个地方存储这些值的历史值.
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   091   091   050    -    0/113746865
  5 Retired_Block_Count     PO--CK   100   100   003    -    0
  9 Power_On_Hours_and_Msec -O--CK   100   100   000    -    170h+43m+38.270s
 12 Power_Cycle_Count       -O--CK   100   100   000    -    8
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    7
177 Wear_Range_Delta        ------   000   000   000    -    2
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/113746865
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/113746865
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/113746865
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    45175
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    363
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning


因此监控磁盘的重点在哪里呢?
严重情况从上到下 : 
1. 最严重的情况WHEN_FAILED =  FAILING_NOW 并且 TYPE=Pre-failed, 表示现在这个属性已经出问题了. 并且硬盘也已经failed了.
2. 次严重的情况 WHEN_FAILED =  in_the_past 并且 TYPE=Pre-failed, 表示这个属性曾经出问题了. 但是现在是正常的.
3.  WHEN_FAILED =  FAILING_NOW 并且 TYPE=Old_age, 表示现在这个属性已经出问题了. 但是硬盘可能还没有failed.
4.  WHEN_FAILED =  in_the_past 并且 TYPE=Old_age, 表示现在这个属性曾经出问题了. 但是现在是正常的.
为了避免这4种情况的发生.
1. 对于UPDATE=Offline的属性, 应该让smartd定期进行测试(smartd还可以发邮件). 或者crontab进行测试. 
2. 应该时刻关注磁盘的Normalized value以及WORST的值是否接近THRESH的值了. 当有值要接近THRESH了, 提前更换硬盘.
3. 温度, 有些磁盘对温度比较敏感, 例如PCI-E SSD硬盘. 如果温度过高可能就挂了. 这里读取RAW_VALUE就比较可靠了.

监控手段, 
1. 可以写nagios插件, 通过nagios来监控.
2. smartd, 它可以发邮件告警.
3. 定期收集smart数据, 分析每个属性的Normalized value的变化趋势, 更加精确的推算出磁盘的剩余使用寿命.

另一块SSD硬盘的输出 :
[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     -O-R-K   001   001   000    -    16936932
  5 Reallocated_Sector_Ct   PO--CK   100   100   001    -    0
  9 Power_On_Hours          -O--CK   094   094   000    -    2423
 12 Power_Cycle_Count       -O--CK   099   099   000    -    6
 13 Read_Soft_Error_Rate    -O-R-K   100   100   000    -    0
175 Program_Fail_Count_Chip -O--CK   100   100   000    -    0
176 Erase_Fail_Count_Chip   -O--CK   100   100   000    -    0
177 Wear_Leveling_Count     -O--CK   099   099   000    -    3
178 Used_Rsvd_Blk_Cnt_Chip  PO--CK   074   074   001    -    43
179 Used_Rsvd_Blk_Cnt_Tot   PO--CK   091   091   001    -    895
180 Unused_Rsvd_Blk_Cnt_Tot -O--CK   091   091   000    -    9734
181 Program_Fail_Cnt_Total  -O--CK   100   100   000    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   000    -    0
183 Runtime_Bad_Block       PO--CK   100   100   001    -    0
194 Temperature_Celsius     -O---K   076   071   000    -    24 (Min/Max 22/28)
195 Hardware_ECC_Recovered  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Unknown_SSD_Attribute   PO--CK   100   100   090    -    0
232 Available_Reservd_Space -O--CK   074   074   000    -    123
233 Media_Wearout_Indicator -O--CK   054   054   000    -    1953821162
240 Unknown_SSD_Attribute   -O--CK   100   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

简单的分析一下上面这个输出,
1. Raw_Read_Error_Rate 这个值已经非常接近THRESH了, 但是它只是一个正常损耗值, 不一定会带来硬盘的failed. 
2. 需要注意ID=202的这条, THRESH=90, 当前的Normalized value=100, 也有点接近了, 并且这个属性是Pre-failed的, 所以更加要注意它的变化.

【参考】 【附】
1. (SMART Attributes define : OCZ Vertex2, Agility2, etc) : 
explain SMART Attributes - 德哥@Digoal - The Heart,The World.
 

补充 : 
245TB 写入后
SSD_Life_Left 的VALUE 降低到了96 .
[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   089   089   050    -    0/150282048
  5 Retired_Block_Count     PO--CK   100   100   003    -    0
  9 Power_On_Hours_and_Msec -O--CK   100   100   000    -    516h+42m+05.505s
 12 Power_Cycle_Count       -O--CK   100   100   000    -    10
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    9
177 Wear_Range_Delta        ------   000   000   000    -    9
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/150282048
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/150282048
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/150282048
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   096   096   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    245127
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    6946
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

相关文章
|
4月前
simple-query
simple-query
23 0
Neither Quantity object nor its magnitude supports indexing
Neither Quantity object nor its magnitude supports indexing
1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column 'information_schema.PROFILING.SEQ' which is not functionally dependent on columns in GROUP BY clause
1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column 'information_schema.PROFILING.SEQ' which is not functionally dependent on columns in GROUP BY clause
172 0
1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column 'information_schema.PROFILING.SEQ' which is not functionally dependent on columns in GROUP BY clause
|
SQL Oracle 关系型数据库
Troubleshooting: Missing Automatic Workload Repository (AWR) Snapshots and Other Collection Issues (Doc ID 1301503.1)
Troubleshooting: Missing Automatic Workload Repository (AWR) Snapshots and Other Collection Issues (Doc ID 1301503.1)
623 0
Value &#39;EN&#39; violates facet information &#39;maxlength=1&#39;
Created by Wang, Jerry, last modified on Jan 15, 2015
127 0
Value &#39;EN&#39; violates facet information &#39;maxlength=1&#39;
Value 'EN' violates facet information 'maxlength=1'
Value 'EN' violates facet information 'maxlength=1'
145 0
Value 'EN' violates facet information 'maxlength=1'
|
前端开发 Go
smart filter无法从smart business应用获得值的问题分析
smart filter无法从smart business应用获得值的问题分析
smart filter无法从smart business应用获得值的问题分析
|
SQL 数据库
Database specific hint in One order search
Database specific hint in One order search
109 0
Database specific hint in One order search
How SAP concrete schema id is got based on transaction type plus catalog type
How SAP concrete schema id is got based on transaction type plus catalog type
100 0
How SAP concrete schema id is got based on transaction type plus catalog type
UDO report generate DDIC table
UDO report generate DDIC table
118 0
UDO report generate DDIC table

热门文章

最新文章