replace offline or FAULTED device in ZPOOL

简介:
今早发现一台zfsonlinux主机的zpool其中一块硬盘写错误次数过多变成FAULTED状态了, raidz1的话变成了DEGRADED状态.
但是hot spare还是AVAIL的状态, 也就是说hot spare没有自动用起来.
(后来发现确实如此, hot spare不会自动使用, 需要手工干预)
当前的zpool状态, sdl已经faulted了.
[root@db-192-168-173-219 ~]# zpool status zp1
  pool: zp1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            sdl                                         FAULTED     11   586     0  too many errors
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        AVAIL   

errors: No known data errors

从dmesge中可以读到大量的错误信息.
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 93 be f0 00 00 08 00
INFO: task txg_sync:25712 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync      D 0000000000000000     0 25712      2 0x00000080
 ffff8804ddaa7b70 0000000000000046 0000000000000001 ffff880232635530
 0000000000000000 0000000000000000 ffff8804ddaa7af0 ffffffff81065e02
 ffff8804ddb8b058 ffff8804ddaa7fd8 000000000000fbc8 ffff8804ddb8b058
Call Trace:
 [<ffffffff81065e02>] ? default_wake_function+0x12/0x20
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffffa02e7480>] ? zio_execute+0x0/0x140 [zfs]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa02787e3>] dsl_pool_sync+0xb3/0x440 [zfs]
 [<ffffffffa028c67b>] spa_sync+0x40b/0xae0 [zfs]
 [<ffffffffa02a0bb4>] txg_sync_thread+0x384/0x5e0 [zfs]
 [<ffffffff81059329>] ? set_user_nice+0xc9/0x130
 [<ffffffffa02a0830>] ? txg_sync_thread+0x0/0x5e0 [zfs]
 [<ffffffffa016f948>] thread_generic_wrapper+0x68/0x80 [spl]
 [<ffffffffa016f8e0>] ? thread_generic_wrapper+0x0/0x80 [spl]
 [<ffffffff8109aef6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ae60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task nfsd:28814 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000006     0 28814      2 0x00000080
 ffff8806d77adab0 0000000000000046 0000000000000000 0000000000000003
 0000000000000001 0000000000000086 ffff8806d77ada60 ffffffff81058d53
 ffff8806d7497098 ffff8806d77adfd8 000000000000fbc8 ffff8806d7497098
Call Trace:
 [<ffffffff81058d53>] ? __wake_up+0x53/0x70
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa02e8473>] ? zio_nowait+0xb3/0x170 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa02e39e0>] zil_commit+0x3b0/0x700 [zfs]
 [<ffffffffa02d83e2>] zfs_fsync+0x92/0x120 [zfs]
 [<ffffffffa02ee8ee>] zpl_commit_metadata+0x3e/0x60 [zfs]
 [<ffffffffa04f8e10>] commit_metadata+0x40/0x70 [nfsd]
 [<ffffffff8119775e>] ? fsnotify_create+0x5e/0x80
 [<ffffffff811983dc>] ? vfs_create+0xfc/0x110
 [<ffffffffa04fc444>] nfsd_create_v3+0x444/0x530 [nfsd]
 [<ffffffffa0503c13>] nfsd3_proc_create+0x123/0x1b0 [nfsd]
 [<ffffffffa04f4425>] nfsd_dispatch+0xe5/0x230 [nfsd]
 [<ffffffffa04a37e4>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffffa04a3e20>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f4b52>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f4a90>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff8109aef6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ae60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
INFO: task postgres:46313 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000004     0 46313  46300 0x00000080
 ffff8803806519b8 0000000000000082 0000000000000000 ffff880380651a08
 0000000000000001 ffff8800404c5930 ffff880623f52040 0000000000000000
 ffff8804270d3058 ffff880380651fd8 000000000000fbc8 ffff8804270d3058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff8119938a>] ? __link_path_walk+0x7ca/0xff0
 [<ffffffff81528f0e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8122752f>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:46554 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000005     0 46554  46541 0x00000080
 ffff88048fc799b8 0000000000000082 0000000000000000 ffff88048fc79a08
 0000000000000001 ffff880240a8e4a0 ffff880606aa5750 0000000000000000
 ffff8806d775f058 ffff88048fc79fd8 000000000000fbc8 ffff8806d775f058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff814c5b8a>] ? inet_sendmsg+0x4a/0xb0
 [<ffffffff81447e03>] ? sock_sendmsg+0x123/0x150
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:46555 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000006     0 46555  46541 0x00000080
 ffff8807c1cab9b8 0000000000000086 0000000000000000 ffff8807c1caba08
 0000000000000001 ffff8801e8ea73e0 ffff88063991f230 0000000000000000
 ffff880806b9c638 ffff8807c1cabfd8 000000000000fbc8 ffff880806b9c638
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff8119938a>] ? __link_path_walk+0x7ca/0xff0
 [<ffffffff81528f0e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8122752f>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa017936f>] ? tsd_exit+0x5f/0x2b0 [spl]
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:47162 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000005     0 47162  47129 0x00000080
 ffff8802d1dbf9b8 0000000000000086 0000000000000000 ffff8802d1dbfa08
 0000000000000001 ffff8800bb012870 ffff880673838a80 0000000000000000
 ffff8804218f9058 ffff8802d1dbffd8 000000000000fbc8 ffff8804218f9058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff814c5b8a>] ? inet_sendmsg+0x4a/0xb0
 [<ffffffff81528f0e>] ? mutex_lock+0x1e/0x50
 [<ffffffff81447e03>] ? sock_sendmsg+0x123/0x150
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:47184 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000004     0 47184  47129 0x00000080
 ffff8802afd519b8 0000000000000082 0000000000000000 ffff8802afd51a08
 0000000000000001 ffff88010ea124a0 ffff880373f1aa80 0000000000000000
 ffff8804260a9058 ffff8802afd51fd8 000000000000fbc8 ffff8804260a9058
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff81449ab3>] ? sock_recvmsg+0x133/0x160
 [<ffffffff8108b16e>] ? send_signal+0x3e/0x90
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:39751 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000000     0 39751  39591 0x00000080
 ffff8805dbd1b9b8 0000000000000082 0000000000000040 ffff8805dbd1ba08
 0000000000000001 ffff88020b036c40 ffff8804c9b02e58 0000000000000000
 ffff8808270125f8 ffff8805dbd1bfd8 000000000000fbc8 ffff8808270125f8
Call Trace:
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff81447e03>] ? sock_sendmsg+0x123/0x150
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:12310 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000000     0 12310  39591 0x00000080
 ffff8804844bd9b8 0000000000000086 0000000000000040 ffff8804844bda08
 0000000000000001 ffff8803319737b0 ffff8803b97b06a8 0000000000000000
 ffff880583917af8 ffff8804844bdfd8 000000000000fbc8 ffff880583917af8
Call Trace:
 [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff81449ab3>] ? sock_recvmsg+0x133/0x160
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task postgres:19243 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
postgres      D 0000000000000000     0 19243  46300 0x00000080
 ffff8801c2fe59b8 0000000000000086 0000000000000000 ffff8801c2fe5a08
 0000000000000001 ffff88001beec250 ffff88021e804fa0 0000000000000000
 ffff880421903af8 ffff8801c2fe5fd8 000000000000fbc8 ffff880421903af8
Call Trace:
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffffa0177bcc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa024dde9>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
 [<ffffffffa0177cf8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02e76bb>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa0264805>] dmu_tx_count_write+0x695/0x6f0 [zfs]
 [<ffffffff8116fc6c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa016e20f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
 [<ffffffffa02648af>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa02dada6>] zfs_write+0x406/0xcf0 [zfs]
 [<ffffffff81449ab3>] ? sock_recvmsg+0x133/0x160
 [<ffffffff8108b16e>] ? send_signal+0x3e/0x90
 [<ffffffffa02ef3f2>] zpl_write_common+0x52/0x80 [zfs]
 [<ffffffffa02ef488>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 a4 ae 60 00 00 08 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 75 e4 08 00 00 08 00
scanning ...
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10)
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 50 4a 60 00 00 10 00
: 28 00 00 00 0a 10 00 00 10 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 67 e5 6d f8 00 00 08 00
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 66 95 36 98 00 00 10 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 7b fe 48 00 00 08 00

sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(16): 88 00 00 00 00 01 d1 af b4 10 00 00 00 10 00 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 68 93 bd 78 00 00 08 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Write(10): 2a 00 68 ad 3d 20 00 00 28 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 5f 5c 66 48 00 00 08 00
sd 0:2:11:0: [sdl] Unhandled error code
sd 0:2:11:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:2:11:0: [sdl] CDB: Read(10): 28 00 5f 5f 44 90 00 00 10 00
end_request: I/O error, dev sdl, sector 1715667304
end_request: I/O error, dev sdl, sector 7812920848
end_request: I/O error, dev sdl, sector 1716598240
end_request: I/O error, dev sdl, sector 2576
end_request: I/O error, dev sdl, sector 7812920336
end_request: I/O error, dev sdl, sector 7812920848


手工将fautaled的硬盘replace为hot spare盘.
[root@db-192-168-173-219 ~]# zpool replace zp1 sdl scsi-36c81f660eb18e8001b32c5c61a48318a
[root@db-192-168-173-219 ~]# zpool status -v
  pool: zp1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul 31 08:31:53 2014
    3.41G scanned out of 8.62T at 17.4M/s, 144h42m to go
    300M resilvered, 0.04% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            spare-10                                    FAULTED      0     0     0
              sdl                                       FAULTED     11   586     0  too many errors
              scsi-36c81f660eb18e8001b32c5c61a48318a    ONLINE       0     0     0  (resilvering)
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        INUSE     currently in use

errors: No known data errors

resilver的时间, 和扫描速度, 以及ZPOOL已使用的空间有关, 我们看到的这个例子, 已经使用的空间有8.62T, 扫描速度只有17.4M/s, 所以评估需要144小时42分钟.
 scan: resilver in progress since Thu Jul 31 08:31:53 2014
    3.41G scanned out of 8.62T at 17.4M/s, 144h42m to go
    300M resilvered, 0.04% done


如果有硬盘更换上去, 在更换前, 最好先把硬盘改成offline的.
[root@db-192-168-173-219 test1]# zpool offline zp1 sdl
[root@db-192-168-173-219 opt]# zpool status -v zp1
  pool: zp1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul 31 09:43:21 2014
    246M scanned out of 8.63T at 12.3M/s, 204h27m to go
    22.1M resilvered, 0.00% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            spare-10                                    DEGRADED     0     0     0
              sdl                                       OFFLINE     11   586     0
              scsi-36c81f660eb18e8001b32c5c61a48318a    ONLINE       0     0     0  (resilvering)
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        INUSE     currently in use

errors: No known data errors

然后更换硬盘, 更换好后, 使用replace, 将spare盘释放出来.
zpool replace zp1 sdl sdl

硬盘非常大, 所以resilvering过程很漫长, 从zpool status -v可以看到还需要多长时间.
更换过程如下 :
[root@digoal ~]# MegaCli -CfgLdAdd -r0 [32:10] WB Direct -a0
                                     

Adapter 0: Configure Adapter Failed

FW error description: 
  The current operation is not allowed because the controller has data in cache for offline or missing virtual drives.  

Exit Code: 0x54

[root@digoal ~]# MegaCli -GetPreservedCacheList -aALL
                                     
Adapter #0

Virtual Drive(Target ID 11): Missing.

Exit Code: 0x00
[root@digoal ~]# MegaCli -DiscardPreservedCache -L11 -a0
                                     
Adapter #0

Virtual Drive(Target ID 11): Preserved Cache Data Cleared.

Exit Code: 0x00
[root@digoal ~]# MegaCli -CfgLdAdd -r0 [32:10] WT Direct -a0
                                     
Adapter 0: Created VD 11

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

[root@digoal ~]# zpool replace zp1 /dev/sdl /dev/sdl
invalid vdev specification
use '-f' to override the following errors:
/dev/sdl does not contain an EFI label but it may contain partition
information in the MBR.
[root@digoal ~]# zpool replace -f zp1 /dev/sdl /dev/sdl
[root@digoal ~]# zpool status -v
  pool: zp1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Aug  5 15:47:35 2014
    100M scanned out of 9.45T at 16.7M/s, 165h6m to go
    9.08M resilvered, 0.00% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zp1                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            sdb                                         ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
            sdi                                         ONLINE       0     0     0
            sdj                                         ONLINE       0     0     0
            sdk                                         ONLINE       0     0     0
            spare-10                                    DEGRADED     0     0     0
              replacing-0                               OFFLINE      0     0     0
                old                                     OFFLINE     11   586     0
                sdl                                     ONLINE       0     0     0  (resilvering)
              scsi-36c81f660eb18e8001b32c5c61a48318a    ONLINE       0     0     0
        logs
          scsi-36c81f660eb18e8001af8e4ec0420e21f-part4  ONLINE       0     0     0
        spares
          scsi-36c81f660eb18e8001b32c5c61a48318a        INUSE     currently in use

errors: No known data errors


接下来模拟一下整个的过程, offline一个盘, 使用hot spare顶上去, 更换硬盘, 将更换后的硬盘顶替坏盘, hot spare盘自动释放回hot spare avail状态.
创建3个文件.
# dd if=/dev/zero of=/opt/zfs.disk1 bs=8192 count=102400
# dd if=/dev/zero of=/opt/zfs.disk2 bs=8192 count=102400
# dd if=/dev/zero of=/opt/zfs.disk3 bs=8192 count=102400

创建zpool
[root@db-192-168-173-219 opt]# zpool create -o ashift=12 -o autoreplace=off zp2 mirror /opt/zfs.disk1 /opt/zfs.disk2 spare /opt/zfs.disk3

查看当前状态
[root@db-192-168-173-219 opt]# zpool status zp2
  pool: zp2
 state: ONLINE
  scan: none requested
config:

        NAME                STATE     READ WRITE CKSUM
        zp2                 ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            /opt/zfs.disk1  ONLINE       0     0     0
            /opt/zfs.disk2  ONLINE       0     0     0
        spares
          /opt/zfs.disk3    AVAIL   

errors: No known data errors

手工offline一块盘.
[root@db-192-168-173-219 opt]# zpool offline zp2 /opt/zfs.disk1

查看当前状态
[root@db-192-168-173-219 test1]# zpool status -v zp2
  pool: zp2
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu Jul 31 09:20:25 2014
config:

        NAME                STATE     READ WRITE CKSUM
        zp2                 DEGRADED     0     0     0
          mirror-0          DEGRADED     0     0     0
            /opt/zfs.disk1  OFFLINE      0     0     0
            /opt/zfs.disk2  ONLINE       0     0     0
        spares
          /opt/zfs.disk3    AVAIL   

errors: No known data errors

使用hot spare盘顶替offline的盘.
[root@db-192-168-173-219 opt]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk3
[root@db-192-168-173-219 opt]# zpool status zp2
  pool: zp2
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 1000K in 0h0m with 0 errors on Thu Jul 31 09:26:25 2014
config:

        NAME                  STATE     READ WRITE CKSUM
        zp2                   DEGRADED     0     0     0
          mirror-0            DEGRADED     0     0     0
            spare-0           OFFLINE      0     0     0
              /opt/zfs.disk1  OFFLINE      0     0     0
              /opt/zfs.disk3  ONLINE       0     0     0
            /opt/zfs.disk2    ONLINE       0     0     0
        spares
          /opt/zfs.disk3      INUSE     currently in use

接下来直接使用原盘替换时会报错, 因为这个盘目前有zpool的信息. 起到了保护作用.
[root@db-192-168-173-219 test1]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk1
invalid vdev specification
use '-f' to override the following errors:
/opt/zfs.disk1 is part of active pool 'zp2'

删除原盘对应的文件, 并新建一个文件来顶替
[root@db-192-168-173-219 opt]# rm -f zfs.disk1
[root@db-192-168-173-219 opt]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk1
cannot resolve path '/opt/zfs.disk1'
[root@db-192-168-173-219 opt]# dd if=/dev/zero of=/opt/zfs.disk1 bs=8192 count=102400 
102400+0 records in
102400+0 records out
838860800 bytes (839 MB) copied, 1.48687 s, 564 MB/s

使用新建的文件定义offline的盘. hot spare盘回到avail状态.
[root@db-192-168-173-219 opt]# zpool replace zp2 /opt/zfs.disk1 /opt/zfs.disk1
[root@db-192-168-173-219 opt]# zpool status -v zp2
  pool: zp2
 state: ONLINE
  scan: resilvered 1.05M in 0h0m with 0 errors on Thu Jul 31 09:27:28 2014
config:

        NAME                STATE     READ WRITE CKSUM
        zp2                 ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            /opt/zfs.disk1  ONLINE       0     0     0
            /opt/zfs.disk2  ONLINE       0     0     0
        spares
          /opt/zfs.disk3    AVAIL   

errors: No known data errors


[注意]
1. zpool的autoreplace和hot spare没有关系, 只和原盘有关, 当原盘被一个新的盘插入时自动加入原盘所在的vdev.
       autoreplace=on | off

           Controls automatic device replacement. If set to "off", device replacement must be initiated by the  admin-
           istrator  by  using the "zpool replace" command. If set to "on", any new device, found in the same physical
           location as a device that previously belonged to the pool, is automatically  formatted  and  replaced.  The
           default behavior is "off". This property can also be referred to by its shortened column name, "replace".

2.  从下面的exp来看, 如果hot spare处于resilvering状态, 最好等resilver结束再replace到更换后的盘.
       zpool scrub [-s] pool ...

           Begins  a  scrub. The scrub examines all data in the specified pools to verify that it checksums correctly.
           For replicated (mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub.
           The  "zpool  status" command reports the progress of the scrub and summarizes the results of the scrub upon
           completion.

           Scrubbing and resilvering are very similar operations. The difference is  that  resilvering  only  examines
           data that ZFS knows to be out of date (for example, when attaching a new device to a mirror or replacing an
           existing device), whereas scrubbing examines all data to discover silent errors due to hardware  faults  or
           disk failure.

           Because  scrubbing  and resilvering are I/O-intensive operations, ZFS only allows one at a time. If a scrub
           is already in progress, the "zpool scrub" command terminates it and starts a new scrub. If a resilver is in
           progress, ZFS does not allow a scrub to be started until the resilver completes.

           -s    Stop scrubbing.
       Example 11 Managing Hot Spares

       The following command creates a new pool with an available hot spare:

         # zpool create tank mirror sda sdb spare sdc

       If one of the disks were to fail, the pool would be reduced to the degraded state. The  failed  device  can  be
       replaced using the following command:

         # zpool replace tank sda sdd

       Once  the  data  has  been  resilvered, the spare is automatically removed and is made available for use should
       another device fails. The hot spare can be permanently removed from the pool using the following command:

         # zpool remove tank sdc


[参考]
目录
相关文章
|
数据库
达梦数据库的物理备份和还原简解
达梦数据库的物理备份和还原简解
926 1
达梦数据库的物理备份和还原简解
|
4月前
|
人工智能 运维 Prometheus
别等系统“炸了”才慌!聊聊AI搞运维故障检测的那些真香时刻
别等系统“炸了”才慌!聊聊AI搞运维故障检测的那些真香时刻
185 0
|
8月前
|
机器学习/深度学习 人工智能 Java
谈谈AI时代到来以及35岁危机双重压力下,作为一个普通开发者的想法
在AI快速发展的背景下,Java后端开发人员可通过系统学习转型至AI领域。建议步骤包括:1. 学习Python编程;2. 掌握数据处理与分析工具;3. 学习机器学习基础及框架;4. 深入研究深度学习;5. 结合Java与AI技术;6. 参与开源项目和社区;7. 持续更新知识并实践;8. 寻找转型机会。尽管转型需要时间和努力,但前景广阔。
318 4
|
存储 安全 算法
Linux命令sha256sum详解
`sha256sum`是Linux中用于计算文件SHA-256哈希的命令,确保数据完整性。它基于安全哈希算法,产生不可逆的64字符哈希值,用于验证文件未被篡改。主要参数包括`-b`(二进制模式)、`-c`(检查校验和)、`-t`(文本模式)。应用示例包括计算单个文件哈希、校验文件哈希及批量处理多个文件。最佳实践包括定期验证文件、保存校验和文件和结合其他安全工具使用。
|
人工智能 自然语言处理 语音技术
机器翻译
机器翻译
307 0
|
Oracle 关系型数据库
Oracle 19c OCP 认证考试 082 题库(第24题)- 2024年修正版
这是关于Oracle 19c OCP认证考试082题库的修正版,包含90道题目,通过分数为60%,考试时间为150分钟。本文由CUUG原创整理,解析了考试题目,并提供了正确答案和详细解释。通过该认证需完成两门科目考试,合格后可获得OCP证书。
302 4
|
XML 安全 搜索推荐
大数据之 Solr 集群搭建
Solr是一个基于Java的高性能全文搜索服务器,扩展自Lucene,提供丰富查询语言和优化的查询性能。它支持通过HTTP接口以XML格式进行索引和搜索操作。在集群环境中,涉及解压软件包,配置环境变量,修改Zookeeper参数,文件分发,启动Solr和Zookeeper服务。建议使用非root用户启动,确保安全。启动后,通过WEB UI验证Solr集群部署是否成功,遇到问题如找不到solr.xml文件,可通过指定-Dsolr.solr.home参数解决。
418 8
|
NoSQL 网络协议 MongoDB
docker安装mongodb(单点)图文详解
该文档提供了一个快速搭建MongoDB环境的指南,适用于开发和学习,但不适用于生产。主要步骤包括:1) 使用Docker创建数据卷`mongo_data_db`和`mongo_data_configdb`。2) 拉取`mongo`镜像并运行名为`mymongo`的容器,映射端口并挂载数据卷。3) 初始化管理员账号,创建具有`root`权限的用户`admin`,密码为`adminpwd`。4) 防火墙开放端口27017。最后,提到了使用第三方客户端进行连接。
880 0
GrayLog5.2版本单节点一键部署脚本
GrayLog5.2版本单节点一键部署脚本
663 0
|
Linux 数据安全/隐私保护 虚拟化
【复盘】记一次LVM元数据丢失的恢复过程
【复盘】记一次LVM元数据丢失的恢复过程
509 0