replace offline or FAULTED device in ZPOOL-阿里云开发者社区

开发者社区> 德哥> 正文

replace offline or FAULTED device in ZPOOL

简介:
+关注继续查看
今早发现一台zfsonlinux主机的zpool其中一块硬盘写错误次数过多变成FAULTED状态了, raidz1的话变成了DEGRADED状态.
但是hot spare还是AVAIL的状态, 也就是说hot spare没有自动用起来.
(后来发现确实如此, hot spare不会自动使用, 需要手工干预)
当前的zpool状态, sdl已经faulted了.
[root@db-192-168-173-219 ~]# zpool status zp1
  pool: zp1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            sdl                                         FAULTED     11   586     0  too many errors
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        AVAIL   

errors: No known data errors

从dmesge中可以读到大量的错误信息.
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 93 be f0 00 00 08 00
INFO: task txg_sync:25712 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync      D 0000000000000000     0 25712      2 0x00000080
 ffff8804ddaa7b70 0000000000000046 0000000000000001 ffff880232635530
 0000000000000000 0000000000000000 ffff8804ddaa7af0 ffffffff81065e02
 ffff8804ddb8b058 ffff8804ddaa7fd8 000000000000fbc8 ffff8804ddb8b058
Call Trace:
 [<ffffffff81065e02>] ? default_wake_function+0x12/0x20
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffffa02e7480>] ? zio_execute+0x0/0x140 [zfs]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa02787e3>] dsl_pool_sync+0xb3/0x440 [zfs]
 [<ffffffffa028c67b>] spa_sync+0x40b/0xae0 [zfs]
 [<ffffffffa02a0bb4>] txg_sync_thread+0x384/0x5e0 [zfs]
 [<ffffffff81059329>] ? set_user_nice+0xc9/0x130
 [<ffffffffa02a0830>] ? txg_sync_thread+0x0/0x5e0 [zfs]
 [<ffffffffa016f948>] thread_generic_wrapper+0x68/0x80 [spl]
 [<ffffffffa016f8e0>] ? thread_generic_wrapper+0x0/0x80 [spl]
 [<ffffffff8109aef6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ae60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task nfsd:28814 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000006     0 28814      2 0x00000080
 ffff8806d77adab0 0000000000000046 0000000000000000 0000000000000003
 0000000000000001 0000000000000086 ffff8806d77ada60 ffffffff81058d53
 ffff8806d7497098 ffff8806d77adfd8 000000000000fbc8 ffff8806d7497098
Call Trace:
 [<ffffffff81058d53>] ? __wake_up+0x53/0x70
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa02e8473>] ? zio_nowait+0xb3/0x170 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa02e39e0>] zil_commit+0x3b0/0x700 [zfs]
 [<ffffffffa02d83e2>] zfs_fsync+0x92/0x120 [zfs]
 [<ffffffffa02ee8ee>] zpl_commit_metadata+0x3e/0x60 [zfs]
 [<ffffffffa04f8e10>] commit_metadata+0x40/0x70 [nfsd]
 [<ffffffff8119775e>] ? fsnotify_create+0x5e/0x80
 [<ffffffff811983dc>] ? vfs_create+0xfc/0x110
 [<ffffffffa04fc444>] nfsd_create_v3+0x444/0x530 [nfsd]
 [<ffffffffa0503c13>] nfsd3_proc_create+0x123/0x1b0 [nfsd]
 [<ffffffffa04f4425>] nfsd_dispatch+0xe5/0x230 [nfsd]
 [<ffffffffa04a37e4>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa04a3e20>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f4b52>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f4a90>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff8109aef6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ae60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task postgres:46313 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000004     0 46313  46300 0x00000080
 ffff8803806519b8 0000000000000082 0000000000000000 ffff880380651a08
 0000000000000001 ffff8800404c5930 ffff880623f52040 0000000000000000
 ffff8804270d3058 ffff880380651fd8 000000000000fbc8 ffff8804270d3058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff8119938a>] ? __link_path_walk+0x7ca/0xff0
 [<ffffffff81528f0e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8122752f>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:46554 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000005     0 46554  46541 0x00000080
 ffff88048fc799b8 0000000000000082 0000000000000000 ffff88048fc79a08
 0000000000000001 ffff880240a8e4a0 ffff880606aa5750 0000000000000000
 ffff8806d775f058 ffff88048fc79fd8 000000000000fbc8 ffff8806d775f058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff814c5b8a>] ? inet_sendmsg+0x4a/0xb0
 [<ffffffff81447e03>] ? sock_sendmsg+0x123/0x150
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:46555 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000006     0 46555  46541 0x00000080
 ffff8807c1cab9b8 0000000000000086 0000000000000000 ffff8807c1caba08
 0000000000000001 ffff8801e8ea73e0 ffff88063991f230 0000000000000000
 ffff880806b9c638 ffff8807c1cabfd8 000000000000fbc8 ffff880806b9c638
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff8119938a>] ? __link_path_walk+0x7ca/0xff0
 [<ffffffff81528f0e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8122752f>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa017936f>] ? tsd_exit+0x5f/0x2b0 [spl]
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:47162 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000005     0 47162  47129 0x00000080
 ffff8802d1dbf9b8 0000000000000086 0000000000000000 ffff8802d1dbfa08
 0000000000000001 ffff8800bb012870 ffff880673838a80 0000000000000000
 ffff8804218f9058 ffff8802d1dbffd8 000000000000fbc8 ffff8804218f9058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff814c5b8a>] ? inet_sendmsg+0x4a/0xb0
 [<ffffffff81528f0e>] ? mutex_lock+0x1e/0x50
 [<ffffffff81447e03>] ? sock_sendmsg+0x123/0x150
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:47184 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000004     0 47184  47129 0x00000080
 ffff8802afd519b8 0000000000000082 0000000000000000 ffff8802afd51a08
 0000000000000001 ffff88010ea124a0 ffff880373f1aa80 0000000000000000
 ffff8804260a9058 ffff8802afd51fd8 000000000000fbc8 ffff8804260a9058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff81449ab3>] ? sock_recvmsg+0x133/0x160
 [<ffffffff8108b16e>] ? send_signal+0x3e/0x90
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:39751 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000000     0 39751  39591 0x00000080
 ffff8805dbd1b9b8 0000000000000082 0000000000000040 ffff8805dbd1ba08
 0000000000000001 ffff88020b036c40 ffff8804c9b02e58 0000000000000000
 ffff8808270125f8 ffff8805dbd1bfd8 000000000000fbc8 ffff8808270125f8
Call Trace:
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff81447e03>] ? sock_sendmsg+0x123/0x150
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:12310 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000000     0 12310  39591 0x00000080
 ffff8804844bd9b8 0000000000000086 0000000000000040 ffff8804844bda08
 0000000000000001 ffff8803319737b0 ffff8803b97b06a8 0000000000000000
 ffff880583917af8 ffff8804844bdfd8 000000000000fbc8 ffff880583917af8
Call Trace:
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff81449ab3>] ? sock_recvmsg+0x133/0x160
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:19243 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000000     0 19243  46300 0x00000080
 ffff8801c2fe59b8 0000000000000086 0000000000000000 ffff8801c2fe5a08
 0000000000000001 ffff88001beec250 ffff88021e804fa0 0000000000000000
 ffff880421903af8 ffff8801c2fe5fd8 000000000000fbc8 ffff880421903af8
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff81449ab3>] ? sock_recvmsg+0x133/0x160
 [<ffffffff8108b16e>] ? send_signal+0x3e/0x90
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 a4 ae 60 00 00 08 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 75 e4 08 00 00 08 00
scanning ...
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10)
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 50 4a 60 00 00 10 00
: 28 00 00 00 0a 10 00 00 10 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 67 e5 6d f8 00 00 08 00
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 66 95 36 98 00 00 10 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 7b fe 48 00 00 08 00

sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(16): 88 00 00 00 00 01 d1 af b4 10 00 00 00 10 00 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 93 bd 78 00 00 08 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Write(10): 2a 00 68 ad 3d 20 00 00 28 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 5f 5c 66 48 00 00 08 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 5f 5f 44 90 00 00 10 00
end_request: I/O error, dev sdl, sector 1715667304
end_request: I/O error, dev sdl, sector 7812920848
end_request: I/O error, dev sdl, sector 1716598240
end_request: I/O error, dev sdl, sector 2576
end_request: I/O error, dev sdl, sector 7812920336
end_request: I/O error, dev sdl, sector 7812920848


手工将fautaled的硬盘replace为hot spare盘.
[root@db-192-168-173-219 ~]# zpool replace zp1 sdl scsi-36c81f660eb18e8001b32c5c61a48318a
[root@db-192-168-173-219 ~]# zpool status -v
  pool: zp1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul 31 08:31:53 2014
    3.41G scanned out of 8.62T at 17.4M/s, 144h42m to go
    300M resilvered, 0.04% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            spare-10                                    FAULTED      0     0     0
              sdl                                       FAULTED     11   586     0  too many errors
              scsi-36c81f660eb18e8001b32c5c61a48318a    ONLINE       0     0     0  (resilvering)
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        INUSE     currently in use

errors: No known data errors

resilver的时间, 和扫描速度, 以及ZPOOL已使用的空间有关, 我们看到的这个例子, 已经使用的空间有8.62T, 扫描速度只有17.4M/s, 所以评估需要144小时42分钟.
 scan: resilver in progress since Thu Jul 31 08:31:53 2014
    3.41G scanned out of 8.62T at 17.4M/s, 144h42m to go
    300M resilvered, 0.04% done


如果有硬盘更换上去, 在更换前, 最好先把硬盘改成offline的.
[root@db-192-168-173-219 test1]# zpool offline zp1 sdl
[root@db-192-168-173-219 opt]# zpool status -v zp1
  pool: zp1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul 31 09:43:21 2014
    246M scanned out of 8.63T at 12.3M/s, 204h27m to go
    22.1M resilvered, 0.00% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            spare-10                                    DEGRADED     0     0     0
              sdl                                       OFFLINE     11   586     0
              scsi-36c81f660eb18e8001b32c5c61a48318a    ONLINE       0     0     0  (resilvering)
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        INUSE     currently in use

errors: No known data errors

然后更换硬盘, 更换好后, 使用replace, 将spare盘释放出来.
zpool replace zp1 sdl sdl

硬盘非常大, 所以resilvering过程很漫长, 从zpool status -v可以看到还需要多长时间.
更换过程如下 :
[root@digoal ~]# MegaCli -CfgLdAdd -r0 [32:10] WB Direct -a0
                                     

Adapter 0: Configure Adapter Failed

FW error description: 
  The current operation is not allowed because the controller has data in cache for offline or missing virtual drives.  

Exit Code: 0x54

[root@digoal ~]# MegaCli -GetPreservedCacheList -aALL
                                     
Adapter #0

Virtual Drive(Target ID 11): Missing.

Exit Code: 0x00
[root@digoal ~]# MegaCli -DiscardPreservedCache -L11 -a0
                                     
Adapter #0

Virtual Drive(Target ID 11): Preserved Cache Data Cleared.

Exit Code: 0x00
[root@digoal ~]# MegaCli -CfgLdAdd -r0 [32:10] WT Direct -a0
                                     
Adapter 0: Created VD 11

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

[root@digoal ~]# zpool replace zp1 /dev/sdl /dev/sdl
invalid vdev specification
use '-f' to override the following errors:
/dev/sdl does not contain an EFI label but it may contain partition
information in the MBR.
[root@digoal ~]# zpool replace -f zp1 /dev/sdl /dev/sdl
[root@digoal ~]# zpool status -v
  pool: zp1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Aug  5 15:47:35 2014
    100M scanned out of 9.45T at 16.7M/s, 165h6m to go
    9.08M resilvered, 0.00% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            spare-10                                    DEGRADED     0     0     0
              replacing-0                               OFFLINE      0     0     0
                old                                     OFFLINE     11   586     0
                sdl                                     ONLINE       0     0     0  (resilvering)
              scsi-36c81f660eb18e8001b32c5c61a48318a    ONLINE       0     0     0
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        INUSE     currently in use

errors: No known data errors


接下来模拟一下整个的过程, offline一个盘, 使用hot spare顶上去, 更换硬盘, 将更换后的硬盘顶替坏盘, hot spare盘自动释放回hot spare avail状态.
创建3个文件.
# dd if=/dev/zero of=/opt/zfs.disk1 bs=8192 count=102400
# dd if=/dev/zero of=/opt/zfs.disk2 bs=8192 count=102400
# dd if=/dev/zero of=/opt/zfs.disk3 bs=8192 count=102400

创建zpool
[root@db-192-168-173-219 opt]# zpool create -o ashift=12 -o autoreplace=off zp2 mirror /opt/zfs.disk1 /opt/zfs.disk2 spare /opt/zfs.disk3

查看当前状态
[root@db-192-168-173-219 opt]# zpool status zp2
  pool: zp2
 state: ONLINE
  scan: none requested
config:

        NAME                STATE     READ WRITE CKSUM
        zp2                 ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            /opt/zfs.disk1  ONLINE       0     0     0
            /opt/zfs.disk2  ONLINE       0     0     0
        spares
          /opt/zfs.disk3    AVAIL   

errors: No known data errors

手工offline一块盘.
[root@db-192-168-173-219 opt]# zpool offline zp2 /opt/zfs.disk1

查看当前状态
[root@db-192-168-173-219 test1]# zpool status -v zp2
  pool: zp2
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu Jul 31 09:20:25 2014
config:

        NAME                STATE     READ WRITE CKSUM
        zp2                 DEGRADED     0     0     0
          mirror-0          DEGRADED     0     0     0
            /opt/zfs.disk1  OFFLINE      0     0     0
            /opt/zfs.disk2  ONLINE       0     0     0
        spares
          /opt/zfs.disk3    AVAIL   

errors: No known data errors

使用hot spare盘顶替offline的盘.
[root@db-192-168-173-219 opt]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk3
[root@db-192-168-173-219 opt]# zpool status zp2
  pool: zp2
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 1000K in 0h0m with 0 errors on Thu Jul 31 09:26:25 2014
config:

        NAME                  STATE     READ WRITE CKSUM
        zp2                   DEGRADED     0     0     0
          mirror-0            DEGRADED     0     0     0
            spare-0           OFFLINE      0     0     0
              /opt/zfs.disk1  OFFLINE      0     0     0
              /opt/zfs.disk3  ONLINE       0     0     0
            /opt/zfs.disk2    ONLINE       0     0     0
        spares
          /opt/zfs.disk3      INUSE     currently in use

接下来直接使用原盘替换时会报错, 因为这个盘目前有zpool的信息. 起到了保护作用.
[root@db-192-168-173-219 test1]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk1
invalid vdev specification
use '-f' to override the following errors:
/opt/zfs.disk1 is part of active pool 'zp2'

删除原盘对应的文件, 并新建一个文件来顶替
[root@db-192-168-173-219 opt]# rm -f zfs.disk1
[root@db-192-168-173-219 opt]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk1
cannot resolve path '/opt/zfs.disk1'
[root@db-192-168-173-219 opt]# dd if=/dev/zero of=/opt/zfs.disk1 bs=8192 count=102400 
102400+0 records in
102400+0 records out
838860800 bytes (839 MB) copied, 1.48687 s, 564 MB/s

使用新建的文件定义offline的盘. hot spare盘回到avail状态.
[root@db-192-168-173-219 opt]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk1
[root@db-192-168-173-219 opt]# zpool status -v zp2
  pool: zp2
 state: ONLINE
  scan: resilvered 1.05M in 0h0m with 0 errors on Thu Jul 31 09:27:28 2014
config:

        NAME                STATE     READ WRITE CKSUM
        zp2                 ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            /opt/zfs.disk1  ONLINE       0     0     0
            /opt/zfs.disk2  ONLINE       0     0     0
        spares
          /opt/zfs.disk3    AVAIL   

errors: No known data errors


[注意]
1. zpool的autoreplace和hot spare没有关系, 只和原盘有关, 当原盘被一个新的盘插入时自动加入原盘所在的vdev.
       autoreplace=on | off

           Controls automatic device replacement. If set to "off", device replacement must be initiated by the  admin-
           istrator  by  using the "zpool replace" command. If set to "on", any new device, found in the same physical
           location as a device that previously belonged to the pool, is automatically  formatted  and  replaced.  The
           default behavior is "off". This property can also be referred to by its shortened column name, "replace".

2.  从下面的exp来看, 如果hot spare处于resilvering状态, 最好等resilver结束再replace到更换后的盘.
       zpool scrub [-s] pool ...

           Begins  a  scrub. The scrub examines all data in the specified pools to verify that it checksums correctly.
           For replicated (mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub.
           The  "zpool  status" command reports the progress of the scrub and summarizes the results of the scrub upon
           completion.

           Scrubbing and resilvering are very similar operations. The difference is  that  resilvering  only  examines
           data that ZFS knows to be out of date (for example, when attaching a new device to a mirror or replacing an
           existing device), whereas scrubbing examines all data to discover silent errors due to hardware  faults  or
           disk failure.

           Because  scrubbing  and resilvering are I/O-intensive operations, ZFS only allows one at a time. If a scrub
           is already in progress, the "zpool scrub" command terminates it and starts a new scrub. If a resilver is in
           progress, ZFS does not allow a scrub to be started until the resilver completes.

           -s    Stop scrubbing.
       Example 11 Managing Hot Spares

       The following command creates a new pool with an available hot spare:

         # zpool create tank mirror sda sdb spare sdc

       If one of the disks were to fail, the pool would be reduced to the degraded state. The  failed  device  can  be
       replaced using the following command:

         # zpool replace tank sda sdd

       Once  the  data  has  been  resilvered, the spare is automatically removed and is made available for use should
       another device fails. The hot spare can be permanently removed from the pool using the following command:

         # zpool remove tank sdc


[参考]

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
阿里云服务器怎么设置密码?怎么停机?怎么重启服务器?
如果在创建实例时没有设置密码,或者密码丢失,您可以在控制台上重新设置实例的登录密码。本文仅描述如何在 ECS 管理控制台上修改实例登录密码。
4065 0
怎么设置阿里云服务器安全组?阿里云安全组规则详细解说
阿里云服务器安全组设置规则分享,阿里云服务器安全组如何放行端口设置教程
6895 0
阿里云服务器端口号设置
阿里云服务器初级使用者可能面临的问题之一. 使用tomcat或者其他服务器软件设置端口号后,比如 一些不是默认的, mysql的 3306, mssql的1433,有时候打不开网页, 原因是没有在ecs安全组去设置这个端口号. 解决: 点击ecs下网络和安全下的安全组 在弹出的安全组中,如果没有就新建安全组,然后点击配置规则 最后如上图点击添加...或快速创建.   have fun!  将编程看作是一门艺术,而不单单是个技术。
4476 0
windows server 2008阿里云ECS服务器安全设置
最近我们Sinesafe安全公司在为客户使用阿里云ecs服务器做安全的过程中,发现服务器基础安全性都没有做。为了为站长们提供更加有效的安全基础解决方案,我们Sinesafe将对阿里云服务器win2008 系统进行基础安全部署实战过程! 比较重要的几部分 1.
5454 0
阿里云服务器远程登录用户名和密码的查询方法
阿里云服务器远程连接登录用户名和密码在哪查看?阿里云服务器默认密码是什么?云服务器系统不同默认用户名不同
443 0
阿里云服务器如何登录?阿里云服务器的三种登录方法
购买阿里云ECS云服务器后如何登录?场景不同,云吞铺子总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系统盘、创建快照、配置安全组等操作如何登录ECS云服务器控制台? 1、先登录到阿里云ECS服务器控制台 2、点击顶部的“控制台” 3、通过左侧栏,切换到“云服务器ECS”即可,如下图所示 通过ECS控制台的远程连接来登录到云服务器 阿里云ECS云服务器自带远程连接功能,使用该功能可以登录到云服务器,简单且方便,如下图:点击“远程连接”,第一次连接会自动生成6位数字密码,输入密码即可登录到云服务器上。
16804 0
阿里云ECS云服务器初始化设置教程方法
阿里云ECS云服务器初始化是指将云服务器系统恢复到最初状态的过程,阿里云的服务器初始化是通过更换系统盘来实现的,是免费的,阿里云百科网分享服务器初始化教程: 服务器初始化教程方法 本文的服务器初始化是指将ECS云服务器系统恢复到最初状态,服务器中的数据也会被清空,所以初始化之前一定要先备份好。
3227 0
阿里云服务器ECS登录用户名是什么?系统不同默认账号也不同
阿里云服务器Windows系统默认用户名administrator,Linux镜像服务器用户名root
1129 0
+关注
德哥
公益是一辈子的事, I&#39;m digoal, just do it.
2153
文章
245
问答
文章排行榜
最热
最新
相关电子书
更多
文娱运维技术
立即下载
《SaaS模式云原生数据仓库应用场景实践》
立即下载
《看见新力量:二》电子书
立即下载