ceph OSD 故障记录

简介:  故障发生时间:  2015-11-05  20.30 故障解决时间: 2015-11-05 20:52:33   故障现象:    由于  hh-yun-ceph-cinder016-128056.vclound.com   硬盘故障,  导致 ceph 集群产生异常报警   故障处理:   ceph 集群自动进行数据迁移,  没有产生数据丢失,    待 IDC 同


故障发生时间:  2015-11-05  20.30

故障解决时间: 2015-11-05 20:52:33

 

故障现象:    由于  hh-yun-ceph-cinder016-128056.vclound.com   硬盘故障,  导致 ceph 集群产生异常报警

 

故障处理:   ceph 集群自动进行数据迁移,  没有产生数据丢失,    待 IDC 同事更换硬盘后再重新迁移数据

 

日志分析如下

1  CEPH 集群再执行深度清理,    清理过程中,  发现  hh-yun-ceph-cinder016-128056.vclound.com ( 10.199.128.56 )  中的硬盘故障,  导致 OSD 进程没有回应,  然后出现故障

2015-11-05 20:30:26.084840 mon.0 240.30.128.55:6789/0 7291068 : cluster [INF] pgmap v7319699: 20544 pgs: 20542 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent; 13357 GB data, 40048 GB used, 215 TB / 254 TB avail; 142 kB/s rd, 880 kB/s wr, 232 op/s

2015-11-05 20:30:27.034297 mon.0 240.30.128.55:6789/0 7291071 : <span style="color:#ff0000;">cluster [INF] osd.14 240.30.128.56:6820/137420 failed (3 reports from 3 peers after 20.000246 >= grace 20.000000)</span>

2015-11-05 20:30:27.087421 mon.0 240.30.128.55:6789/0 7291072 : cluster [INF] pgmap v7319700: 20544 pgs: 20542 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent; 13357 GB data, 40048 GB used, 215 TB / 254 TB avail; 1001 kB/s rd, 1072 kB/s wr, 256 op/s

2015-11-05 20:30:27.142073 mon.0 240.30.128.55:6789/0 7291073 : cluster [INF] osdmap e503: <span style="color:#ff0000;">70 osds: 69 up, 70 in</span>

 

2  故障发生后,  ceph  进行 pg 计算,   计算存放在 osd 中的 object 数据量

2015-11-05 20:30:34.230435 mon.0 240.30.128.55:6789/0 7291081 : cluster [INF] pgmap v7319706: 20544 pgs: 19595 active+clean, 1 active+undersized+degraded+inconsistent, 948 active+undersized+degraded; 13357 GB data, 40048 GB used, 215 TB / 254 TB avail; 6903 kB/s wr, 287 op/s; <span style="color:#ff0000;">160734/10441218 objects degraded (1.539%)</span>

               


 

3. ceph 集群把故障 osd 踢出集群中,

2015-11-05 20:35:28.839639 mon.0 240.30.128.55:6789/0 7291328 : cluster [INF] pgmap v7319909: 20544 pgs: 19590 active+clean, 1 active+undersized+degraded+inconsistent, 5 active+clean+scrubbing, 948 active+undersized+degraded; 13358 GB data, 40049 GB used, 215 TB / 254 TB avail; 5988 kB/s rd, 21084 kB/s wr, 1406 op/s; 160742/10441470 objects degraded (1.539%)

2015-11-05 20:35:31.419279 mon.0 240.30.128.55:6789/0 7291329 <span style="color:#ff0000;">: cluster [INF] osd.14 out (down for 304.292431)</span>

 

4  ceph 自动执行了数据恢复操作,  没有造成数据丢失

2015-11-05 20:35:37.483627 mon.0 240.30.128.55:6789/0 7291338 : cluster [INF] pgmap v7319914: 20544 pgs: 19590 active+clean, 1 active+undersized+degraded+inconsistent, 5 active+clean+scrubbing, 948 active+undersized+degraded; 13358 GB data, 39433 GB used, 212 TB / 250 TB avail; 252 kB/s wr, 31 op/s; <span style="color:#ff0000;">160742/10441473 objects degraded (1.539%)</span>

2015-11-05 20:35:39.345830 mon.0 240.30.128.55:6789/0 7291340 : cluster [INF] pgmap v7319915: 20544 pgs: 19599 active+clean, 5 undersized+degraded+remapped, 557 active+undersized+degraded+remapped, 62 active+recovering+degraded, 1 active+undersized+degraded+inconsistent, 5 active+undersized+degraded+remapped+backfilling, 3 active+undersized+degraded+remapped+wait_backfill, 232 remapped+peering, 80 active+undersized+degraded; 13358 GB data, 39437 GB used, 212 TB / 250 TB avail; 5607 kB/s rd, 17453 kB/s wr, 1028 op/s; 136092/10556428 objects degraded (1.289%); 249227/10556428 objects misplaced (2.361%); 2413 MB/s, 6 keys/s,<span style="color:#ff0000;"> 627 objects/s recovering
</span>
2015-11-05 20:35:40.045989 mon.0 240.30.128.55:6789/0 7291341 : cluster [INF] pgmap v7319916: 20544 pgs: 19599 active+clean, 5 undersized+degraded+remapped, 576 active+undersized+degraded+remapped, 67 active+recovering+degraded, 5 active+undersized+degraded+remapped+backfilling, 3 active+undersized+degraded+remapped+wait_backfill, 273 remapped+peering, 15 active+undersized+degraded, 1 active+undersized+degraded+remapped+inconsistent; 13358 GB data, 39437 GB used, 212 TB / 250 TB avail; 4825 kB/s rd, 18997 kB/s wr, 913 op/s; 130350/10567081 objects degraded (1.234%); 261108/10567081 objects misplaced (2.471%); 2136 MB/s, 5 keys/s,<span style="color:#ff0000;"> 555 objects/s recovering</span>

<p>...............
...............
...............</p><p>2015-11-05 20:52:28.966672 mon.0 240.30.128.55:6789/0 7293152 : cluster [INF]  pgmap v7321045: 20544 pgs: 20542 active+clean, 1  active+undersized+degraded+remapped+backfilling, 1  active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 3360 kB/s rd, 5667 kB/s wr, 490 op/s; <span style="color: rgb(255, 0, 0);">243/10441812 objects degraded (0.002%);</span> <span style="color: rgb(255, 0, 0);">14/10441812 objects misplaced (0.000%)</span></p><p>2015-11-05 20:52:30.039527 mon.0 240.30.128.55:6789/0 7293154 : cluster [INF]  pgmap v7321046: 20544 pgs: 20542 active+clean, 1  active+undersized+degraded+remapped+backfilling, 1  active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 2491 kB/s rd, 6139 kB/s wr, 302 op/s; <span style="color: rgb(255, 0, 0);">243/10441812 objects degraded (0.002%); 14/10441812 objects misplaced (0.000%)</span></p><p>2015-11-05 20:52:31.087910 mon.0 240.30.128.55:6789/0 7293155 : cluster [INF]  pgmap v7321047: 20544 pgs: 20543 active+clean, 1  active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 3439 kB/s rd, 6470 kB/s wr, 356 op/s; 16398 kB/s, 4 objects/s  recovering</p><p>2015-11-05 20:52:32.066947 mon.0 240.30.128.55:6789/0  7293156 : cluster [INF] pgmap v7321048: 20544 pgs: 20543 active+clean, 1 active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250  TB avail; 2529 kB/s rd, 8932 kB/s wr, 388 op/s; 16853 kB/s,<span style="color: rgb(255, 0, 0);"> 4 objects/s recovering</span></p><p>2015-11-05 20:52:33.054558 mon.0 240.30.128.55:6789/0 7293157 : cluster [INF]  pgmap v7321049: 20544 pgs: 20543 active+clean, 1  active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 2389 kB/s rd, 13650 kB/s wr, 380 op/s</p>


 

5.  经验证,  ceph 集群重新计算 pg 路径,   已经没有 pg 计划存储数据到  osd.14 中,  意味着本次故障没有产生数据丢失,  暂时不影响使用


[root@hh-yun-ceph-cinder015-128055 tmp]# ceph pg dump | awk '{ print $15}' | grep 14

dumped all in format plain


目录
相关文章
|
4天前
|
存储 关系型数据库 API
|
7月前
|
存储 缓存 运维
Ceph 是什么
Ceph 是什么
96 0
|
存储 Prometheus 监控
使用cephadm安装ceph octopus
使用cephadm安装ceph octopus
使用cephadm安装ceph octopus
|
存储 算法 Ubuntu
Ceph 心得分享
##Ceph ceph :统一开、分布式的云存储 统一指的是 ceph 在业内同 openstack 、swift 一样可以当作 块存储、文件存储、对象存储来使用。并自带了分布式的特性,保证了生产业务的高可用。其主要核心高频的使用点就是 Ceph 的块存储以及对象存储,我们逐一介绍。 ###块存储特性 * 通过 ceph clients 使用块设备 * 精简配置 * 动态扩容
527 0
Ceph 心得分享
|
块存储
Ceph-disk手动添加OSD
Ceph-disk手动添加OSD   最近部署了L版的ceph,在进行扩容时,发现手动使用ceph-disk命令出现了问题,根据ceph官网说明,L版要用ceph-volume,然而目前生产环境并未使用这种方式来进行配置,但是L版的使用ceph-disk prepare时创建出来的osd和journal分区效果并不会像J版那样根据配置文件里面的配置来进行创建。
5856 0
|
Ubuntu 开发工具 块存储
|
Ubuntu 开发工具 块存储