一、现象
1. 报警:我们的redis私有云有几百个redis实例,对于每个实例的存活,都有定时任务会每隔5分钟,重试多次ping命令去检查它的存活性,比如多次都ping不通,这时候就会收到报警短信: xxx系统-实例(10.10.xx.xx:xxxx)-由运行中变为心跳停止。这时候我就会去线上看一下出现了哪些问题。
2. 检测代码:比较简易,但是比较有效。
public boolean isRun(final String ip, final int port) { boolean isRun = new IdempotentConfirmer() { private int timeOutFactor = 1; @Override public boolean execute() { Jedis jedis = new Jedis(ip, port); try { jedis.getClient().setConnectionTimeout(Protocol.DEFAULT_TIMEOUT * (timeOutFactor++)); jedis.getClient().setSoTimeout(Protocol.DEFAULT_TIMEOUT * (timeOutFactor++)); String pong = jedis.ping(); return pong != null && pong.equalsIgnoreCase("PONG"); } catch (JedisDataException e) { String message = e.getMessage(); logger.warn(e.getMessage()); if (StringUtils.isNotBlank(message) && message.startsWith("LOADING")) { return true; } return false; } catch (Exception e) { logger.warn("{}:{} error message is {} ", ip, port, e.getMessage()); return false; } finally { jedis.close(); } } }.run(); return isRun; }一个简易的幂等操作器:
package com.sohu.cache.util; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * 幂等操作器 */ public abstract class IdempotentConfirmer { private final Logger logger = LoggerFactory.getLogger(this.getClass()); private int retry = 3; protected IdempotentConfirmer(int retry) { this.retry = retry; } public IdempotentConfirmer() { } public abstract boolean execute(); public boolean run() { while (retry-- > 0) { try { boolean isOk = execute(); if (isOk){ return true; } } catch (Exception e) { logger.error(e.getMessage(), e); continue; } } return false; } }
二、Redis-Cluster的拓扑结构
1. 拓扑图:
2. 拓扑表
nodeid | ip:port | replication status |
b6073bedf256d45e1dce97cd9242bb4789d52343 | 10.10.81.94:7497 | master |
efc1dfb19e120a2f41e0cec3d03d02f43e168238 | 10.10.79.150:6389 | slave |
792a356d3cd1474b8f1374aba1e5ceced1c6868d | 10.10.81.95:7497 | master |
472823a5f04ad10027b3cde2672cc33f5accdf99 | 10.10.79.157:6390 | slave |
19544b078897d8999b68a41f2538a20cbf84288d | 10.10.83.180:6382 | master |
74d5cda02571651bba185cbc1b10d5810d048ac3 | 10.10.81.96:6391 | slave |
b95759d39b6714544917e4aefa383b4de80f871c | 10.10.81.97:7499 | master |
3379f2da275825c9ce4e476820254ccaf7b997d1 | 10.10.78.52:6398 | slave |
5711268a4881b723273e618894ba9276b27a297f | 10.10.81.98:7496 | master |
5b2aa80014f168fd26c246104b5f6adf6426275e | 10.10.78.53:6396 | slave |
三、日志分析:
1. [10:35:09] 10.10.81.97:7499(b95759d39b6714544917e4aefa383b4de80f871c)发现10.10.81.94:7497(b6073bedf256d45e1dce97cd9242bb4789d52343)不可达:
4051:M 26 Nov 10:35:09.768 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
2.10.10.81.97:7499(b95759d39b6714544917e4aefa383b4de80f871c)通过gossip协议广播10.10.81.94:7497(b6073bedf256d45e1dce97cd9242bb4789d52343)失败的信息,其他实例收到了这个流言:
29177:S 26 Nov 10:35:10.939 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
3.投票:(在redisCluster中,只有Master会参与投票,因为从节点只是负责备份用),进入新的epoch
9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
redis源码:
// 主机已经是一个出错节点了,自己作为从机可以升级为主机 void clusterHandleSlaveFailover(void) { ...... // 故障修复超时,重新启动故障修复 if (auth_age > auth_retry_time) { // 两次故障修复间隔不能过短 // 更新一些时间 ...... redisLog(REDIS_WARNING, "Start of election delayed for %lld milliseconds " "(rank #%d, offset %lld).", server.cluster->failover_auth_time - mstime(), server.cluster->failover_auth_rank, replicationGetSlaveOffset()); // 告知其他从机 /* Now that we have a scheduled election, broadcast our offset * to all the other slaves so that they'll updated their offsets * if our offset is better. */ clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES); return; } ...... // 开头投票 /* Ask for votes if needed. */ if (server.cluster->failover_auth_sent == 0) { server.cluster->currentEpoch++; server.cluster->failover_auth_epoch = server.cluster->currentEpoch; redisLog(REDIS_WARNING,"Starting a failover election for epoch %llu.", (unsigned long long) server.cluster->currentEpoch); clusterRequestFailoverAuth(); server.cluster->failover_auth_sent = 1; clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG| CLUSTER_TODO_UPDATE_STATE| CLUSTER_TODO_FSYNC_CONFIG); return; /* Wait for replies. */ } }
4. 投票结果:
(1) 10.10.79.150:6389是新的master, 成功执行了failover
3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352). 3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18. 3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master. 3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover
(2) 原来的master:10.10.81.94:7497(主) ,收到了判决书:节点失败(fail), 授权10.10.79.150:6389为它的master, 修改自己的cluster nodes, 没有了slave, cluster status ok, 开始主从复制...
3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail 3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up 3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238 3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost. 3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok .................
(3)新的主从通信成功,主从关系确定
5. 其他机器开始,认定新的slave
7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached). 7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
6. 完整日志
(1). 10.10.81.94:7497(主)
3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail 3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up 3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238 3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost. 3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok 3022:S 26 Nov 10:35:16.061 * Connecting to MASTER 10.10.79.150:6389 3022:S 26 Nov 10:35:16.061 * MASTER <-> SLAVE sync started 3022:S 26 Nov 10:35:16.062 * Non blocking connect for SYNC fired the event. 3022:S 26 Nov 10:35:16.062 * Master replied to PING, replication can continue... 3022:S 26 Nov 10:35:16.062 * Partial resynchronization not possible (no cached master) 3022:S 26 Nov 10:35:16.062 * Full resync from master: 80ac45a2b236fc44ebbef3d851e535bcbdd6367b:21516181353 3022:S 26 Nov 10:35:45.273 * MASTER <-> SLAVE sync: receiving 494097946 bytes from master 3022:S 26 Nov 10:35:51.052 * MASTER <-> SLAVE sync: Flushing old data 3022:S 26 Nov 10:36:22.165 * MASTER <-> SLAVE sync: Loading DB in memory 3022:S 26 Nov 10:37:44.355 * MASTER <-> SLAVE sync: Finished with success 3022:S 26 Nov 10:37:44.440 * Background append only file rewriting started by pid 4036 3022:S 26 Nov 10:38:34.988 * AOF rewrite child asks to stop sending diffs. 4036:C 26 Nov 10:38:34.988 * Parent agreed to stop sending diffs. Finalizing AOF... 4036:C 26 Nov 10:38:34.988 * Concatenating 0.02 MB of AOF diff received from parent. 4036:C 26 Nov 10:38:34.989 * SYNC append only file rewrite performed 4036:C 26 Nov 10:38:35.029 * AOF rewrite: 5 MB of memory used by copy-on-write 3022:S 26 Nov 10:38:35.127 * Background AOF rewrite terminated with success 3022:S 26 Nov 10:38:35.127 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB) 3022:S 26 Nov 10:38:35.128 * Background AOF rewrite finished successfully
(2) 10.10.79.150:6389(从)
3173:S 26 Nov 10:35:10.934 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352). 3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18. 3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master. 3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover 3173:M 26 Nov 10:35:11.660 # Connection with master lost. 3173:M 26 Nov 10:35:11.660 * Caching the disconnected master state. 3173:M 26 Nov 10:35:11.660 * Discarding previously cached master state. 3173:M 26 Nov 10:35:15.605 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 3173:M 26 Nov 10:35:16.071 * Slave 10.10.81.94:7497 asks for synchronization 3173:M 26 Nov 10:35:16.071 * Full resync requested by slave 10.10.81.94:7497 3173:M 26 Nov 10:35:16.071 * Starting BGSAVE for SYNC with target: disk 3173:M 26 Nov 10:35:16.188 * Background saving started by pid 23531 3173:M 26 Nov 10:35:18.091 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. 3173:M 26 Nov 10:35:31.056 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. 3173:M 26 Nov 10:35:40.040 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. 23531:C 26 Nov 10:35:45.048 * DB saved on disk 23531:C 26 Nov 10:35:45.113 * RDB: 6 MB of memory used by copy-on-write 3173:M 26 Nov 10:35:45.281 * Background saving terminated with success 3173:M 26 Nov 10:35:51.059 * Synchronization with slave 10.10.81.94:7497 succeeded 3173:M 26 Nov 10:36:11.107 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343 3173:M 26 Nov 10:36:22.277 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(3) 10.10.81.95:7497(主)
14795:M 26 Nov 10:35:10.309 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 14795:M 26 Nov 10:35:11.034 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18 14795:M 26 Nov 10:35:14.979 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 14795:M 26 Nov 10:36:20.480 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached). 14795:M 26 Nov 10:36:31.652 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(4) 10.10.79.157:6390(从)
29177:S 26 Nov 10:35:10.939 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 29177:S 26 Nov 10:35:15.609 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 29177:S 26 Nov 10:36:11.111 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343 29177:S 26 Nov 10:36:22.281 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(5) 10.10.83.180:6382(主)
9611:M 26 Nov 10:35:11.342 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18 9611:M 26 Nov 10:35:16.013 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 9611:M 26 Nov 10:36:11.516 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343 9611:M 26 Nov 10:36:22.686 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(6) 10.10.81.96:6391(从)
7925:S 26 Nov 10:35:10.260 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 7925:S 26 Nov 10:35:14.930 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached). 7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(7) 10.10.81.97:7499(主)
4051:M 26 Nov 10:35:09.768 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached). 4051:M 26 Nov 10:35:10.507 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18 4051:M 26 Nov 10:35:14.452 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 4051:M 26 Nov 10:36:09.955 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343 4051:M 26 Nov 10:36:21.125 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(8) 10.10.78.52:6398(从)
22972:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 22972:S 26 Nov 10:35:15.635 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 22972:S 26 Nov 10:36:11.116 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343 22972:S 26 Nov 10:36:22.286 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(9) 10.10.81.98:7496(主)
2062:M 26 Nov 10:35:10.908 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 2062:M 26 Nov 10:35:11.634 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18 2062:M 26 Nov 10:35:15.599 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 2062:M 26 Nov 10:36:11.080 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343 2062:M 26 Nov 10:36:22.250 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(10) 10.10.78.53:6396(从)
19527:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343 19527:S 26 Nov 10:35:15.615 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again. 19527:S 26 Nov 10:36:11.117 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343 19527:S 26 Nov 10:36:22.287 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
四、原因查找
redis节点被认为失败,可能有如下原因:
1. 机器本身问题:负载过高,网络阻塞,cpu过高等等。
通过tsar命令,查询机器当时各个指标,未发现有什么问题:
Time ---cpu-- ---mem-- ---tcp-- -----traffic---- --sda--- --sdb--- --sdc--- --sdd--- --sde--- --sdf--- --sdg--- --sdh--- --sdi--- --sdj--- --sdk--- --sdl--- --sdm--- ---load- Time util util retran bytin bytout util util util util util util util util util util util util util load1 26/11/15-10:20 4.07 67.89 0.00 1.7M 1.7M 0.05 0.00 1.75 0.00 8.07 0.00 0.00 0.00 0.00 0.00 0.38 0.25 0.00 1.77 26/11/15-10:25 4.11 67.89 0.00 1.7M 1.6M 0.01 0.00 1.95 0.00 7.45 0.00 0.00 0.00 0.00 0.00 3.47 0.27 0.00 0.94 26/11/15-10:30 4.37 67.89 0.00 1.7M 1.7M 0.18 0.00 2.00 0.00 7.79 0.00 0.00 0.00 0.00 0.00 5.31 0.15 0.00 0.94 26/11/15-10:35 4.86 68.38 0.00 1.8M 1.7M 0.47 0.00 2.40 0.00 7.99 0.00 0.00 0.00 0.00 0.00 3.06 0.15 0.00 1.22 26/11/15-10:40 7.29 67.66 0.00 2.9M 1.6M 1.55 0.00 2.92 0.00 8.46 0.00 0.00 0.00 0.00 0.00 0.87 0.11 0.00 1.14 26/11/15-10:45 4.08 67.68 0.00 1.2M 1.6M 0.09 0.00 0.34 0.00 9.12 0.00 0.00 0.00 0.00 0.00 0.25 1.35 0.00 0.46
2. cluster-node-timeout: 这个我们设置的是15秒,一直未发现问题,从别人的分享看,这个数字已经足够了。
3. 被判断失败的redis本身的问题:阻塞操作(比较常见的有: aof阻塞、慢查询):从redis的日志看没有发生aof阻塞的情况,但是从慢查询的日志中,发现了如下,瞬间就明白了,许多慢查询,redis的单线程什么都堵住了。
五、注意:
本文的日志时间做过纠正,实际有些机器没有做时间同步,造成日志实际对不上,不利排查,而且在分布式系统中,时间的一致性通常非常重要,所以这个要注意。
参考文献:
- Random and small clusterdown(github redis)
- redis集群定时clusterdown分析