Redis集群报错cluster_state:fail,如何解决并重新恢复集群(IP问题/ slot未完全分配问题)

本文涉及的产品
云数据库 Tair(兼容Redis),内存型 2GB
Redis 开源版,标准版 2GB
推荐场景:
搭建游戏排行榜
简介: Redis集群报错cluster_state:fail,如何解决并重新恢复集群(IP问题/ slot未完全分配问题)

报错

127.0.0.1:6379> set name tom        ---》测试在redis集群中存数据时报错
-> Redirected to slot [5798] located at 192.168.3.2:6379
(error) CLUSTERDOWN The cluster is down
192.168.3.2:6379> cluster info
cluster_state:fail       ---》显示集群状态已关闭
cluster_slots_assigned:16384
cluster_slots_ok:10923
cluster_slots_pfail:0
cluster_slots_fail:5461
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:2
cluster_stats_messages_ping_sent:2203
cluster_stats_messages_pong_sent:392
cluster_stats_messages_meet_sent:4
cluster_stats_messages_fail_sent:4
cluster_stats_messages_sent:2603
cluster_stats_messages_ping_received:391
cluster_stats_messages_pong_received:310
cluster_stats_messages_meet_received:1
cluster_stats_messages_fail_received:1
cluster_stats_messages_received:703

解决


查看所有redis日志发现redis-5容器一直在反复连接192.168.3.1:6379这个master节点node-1

[root@es-node22 ~]# docker logs -f redis-5
......
1:S 28 May 2022 13:07:53.233 # Cluster state changed: fail
1:S 28 May 2022 13:07:53.442 * Connecting to MASTER 192.168.3.1:6379
1:S 28 May 2022 13:07:53.442 * MASTER <-> REPLICA sync started
1:S 28 May 2022 13:07:53.442 # Error condition on socket for SYNC: Connection refused
1:S 28 May 2022 13:07:54.481 * Connecting to MASTER 192.168.3.1:6379
1:S 28 May 2022 13:07:54.481 * MASTER <-> REPLICA sync started
......

查看node-1的redis节点配置文件redis.conf中的节点IP

[root@es-node22 ~]# cat /root/redis/node-1/conf/redis.conf
port 6379
bind 0.0.0.0
cluster-enabled yes
cluster-config-file nodes.conf   ---》redis集群节点配置文件
cluster-node-timeout 5000
cluster-announce-ip 192.168.3.11   ---》可以看到node-1节点配置文件中IP为192.168.3.11
cluster-announce-port 6379
cluster-announce-bus-port 16379
appendonly yes

查看当前的redis集群状态,以数组形式展示

192.168.3.2:6379> cluster slots    ---》当前的集群状态,以数组形式展示
1) 1) (integer) 10923
   2) (integer) 16383
   3) 1) "192.168.3.3"
      2) (integer) 6379
      3) "ff0d1d636f94d9b092e6012408c1d0918e00e6ed"
   4) 1) "192.168.3.4"
      2) (integer) 6379
      3) "2113cf366ad27ebd73585f03d368e77f03b1a2e1"
2) 1) (integer) 0
   2) (integer) 5460
   3) 1) "192.168.3.1"           ---》可以看到集群中该节点的IP是192.168.3.1
      2) (integer) 6379
      3) "c856c94ba8d2c55a0d176831bc85aa34a96fde88"
   4) 1) "192.168.3.5"
      2) (integer) 6379
      3) "d92ff5984ab29370af0adeaca71e7938c0287ca5"
3) 1) (integer) 5461
   2) (integer) 10922
   3) 1) "192.168.3.2"
      2) (integer) 6379
      3) "8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3"
   4) 1) "192.168.3.6"
      2) (integer) 6379
      3) "2108a90495c147c675328f9b8b4fa49e2b856faf"

查看redis集群节点配置文件nodes.conf

[root@es-node22 ~]# cat /root/redis/node-1/data/nodes.conf
c856c94ba8d2c55a0d176831bc85aa34a96fde88 192.168.3.1:6379@16379 myself,master - 0 1653743266000 1 connected 0-5460
d92ff5984ab29370af0adeaca71e7938c0287ca5 192.168.3.5:6379@16379 slave c856c94ba8d2c55a0d176831bc85aa34a96fde88 0 1653743274000 5 connected
2108a90495c147c675328f9b8b4fa49e2b856faf 192.168.3.6:6379@16379 slave 8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 0 1653743275531 6 connected
2113cf366ad27ebd73585f03d368e77f03b1a2e1 192.168.3.4:6379@16379 slave ff0d1d636f94d9b092e6012408c1d0918e00e6ed 0 1653743275531 4 connected
8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 192.168.3.2:6379@16379 master - 0 1653743275531 2 connected 5461-10922
ff0d1d636f94d9b092e6012408c1d0918e00e6ed 192.168.3.3:6379@16379 master - 0 1653743275000 3 connected 10923-16383
vars currentEpoch 6 lastVoteEpoch 0
[root@es-node22 ~]# cat /root/redis/node-2/data/nodes.conf
ff0d1d636f94d9b092e6012408c1d0918e00e6ed 192.168.3.3:6379@16379 master - 0 1653743273233 3 connected 10923-16383
2113cf366ad27ebd73585f03d368e77f03b1a2e1 192.168.3.4:6379@16379 slave ff0d1d636f94d9b092e6012408c1d0918e00e6ed 0 1653743271151 4 connected
c856c94ba8d2c55a0d176831bc85aa34a96fde88 192.168.3.1:6379@16379 master,fail - 1653743267074 1653743266961 1 connected 0-5460
d92ff5984ab29370af0adeaca71e7938c0287ca5 192.168.3.5:6379@16379 slave c856c94ba8d2c55a0d176831bc85aa34a96fde88 0 1653743272000 1 connected
8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 192.168.3.2:6379@16379 myself,master - 0 1653743271000 2 connected 5461-10922
2108a90495c147c675328f9b8b4fa49e2b856faf 192.168.3.6:6379@16379 slave 8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 0 1653743272194 6 connected
vars currentEpoch 6 lastVoteEpoch 0

可以看到redis所有节点的集群配置文件nodes.conf中是192.168.3.1:6379,与node-1节点的redis.conf文件中不一致。


 批量修改所有redis节点nodes.conf文件中该节点IP配置

[root@es-node22 ~]# for i in $(seq 1 6); do \
> sed -i 's/192.168.3.1/192.168.3.11/' /root/redis/node-${i}/data/nodes.conf
> done

查看修改后的所有redis集群nodes.conf文件

[root@es-node22 ~]# cat /root/redis/node-1/data/nodes.conf                                                  
c856c94ba8d2c55a0d176831bc85aa34a96fde88 192.168.3.11:6379@16379 myself,master - 0 1653743266000 1 connected 0-5460
d92ff5984ab29370af0adeaca71e7938c0287ca5 192.168.3.5:6379@16379 slave c856c94ba8d2c55a0d176831bc85aa34a96fde88 0 1653743274000 5 connected
2108a90495c147c675328f9b8b4fa49e2b856faf 192.168.3.6:6379@16379 slave 8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 0 1653743275531 6 connected
2113cf366ad27ebd73585f03d368e77f03b1a2e1 192.168.3.4:6379@16379 slave ff0d1d636f94d9b092e6012408c1d0918e00e6ed 0 1653743275531 4 connected
8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 192.168.3.2:6379@16379 master - 0 1653743275531 2 connected 5461-10922
ff0d1d636f94d9b092e6012408c1d0918e00e6ed 192.168.3.3:6379@16379 master - 0 1653743275000 3 connected 10923-16383
vars currentEpoch 6 lastVoteEpoch 0
[root@es-node22 ~]# cat /root/redis/node-2/data/nodes.conf
ff0d1d636f94d9b092e6012408c1d0918e00e6ed 192.168.3.3:6379@16379 master - 0 1653743273233 3 connected 10923-16383
2113cf366ad27ebd73585f03d368e77f03b1a2e1 192.168.3.4:6379@16379 slave ff0d1d636f94d9b092e6012408c1d0918e00e6ed 0 1653743271151 4 connected
c856c94ba8d2c55a0d176831bc85aa34a96fde88 192.168.3.11:6379@16379 master,fail - 1653743267074 1653743266961 1 connected 0-5460
d92ff5984ab29370af0adeaca71e7938c0287ca5 192.168.3.5:6379@16379 slave c856c94ba8d2c55a0d176831bc85aa34a96fde88 0 1653743272000 1 connected
8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 192.168.3.2:6379@16379 myself,master - 0 1653743271000 2 connected 5461-10922
2108a90495c147c675328f9b8b4fa49e2b856faf 192.168.3.6:6379@16379 slave 8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 0 1653743272194 6 connected
vars currentEpoch 6 lastVoteEpoch 0
...
...

批量重启redis集群所有节点容器

[root@es-node22 ~]# docker restart $(docker ps | grep redis | awk '{print $1}')
dcd802a160c6
6e2f628457f6
f05d3dfb9c8b
220df78836e9
31e7b232f1d1
1de91b4d4e68
[root@es-node22 ~]# docker ps
CONTAINER ID   IMAGE                    COMMAND                  CREATED       STATUS       PORTS                                                                                      NAMES
6e2f628457f6   redis:5.0.9-alpine3.11   "docker-entrypoint.s…"   3 hours ago   Up 2 hours   0.0.0.0:6376->6379/tcp, :::6376->6379/tcp, 0.0.0.0:16376->16379/tcp, :::16376->16379/tcp   redis-6
f05d3dfb9c8b   redis:5.0.9-alpine3.11   "docker-entrypoint.s…"   3 hours ago   Up 2 hours   0.0.0.0:6375->6379/tcp, :::6375->6379/tcp, 0.0.0.0:16375->16379/tcp, :::16375->16379/tcp   redis-5
220df78836e9   redis:5.0.9-alpine3.11   "docker-entrypoint.s…"   3 hours ago   Up 2 hours   0.0.0.0:6374->6379/tcp, :::6374->6379/tcp, 0.0.0.0:16374->16379/tcp, :::16374->16379/tcp   redis-4
31e7b232f1d1   redis:5.0.9-alpine3.11   "docker-entrypoint.s…"   3 hours ago   Up 2 hours   0.0.0.0:6373->6379/tcp, :::6373->6379/tcp, 0.0.0.0:16373->16379/tcp, :::16373->16379/tcp   redis-3
1de91b4d4e68   redis:5.0.9-alpine3.11   "docker-entrypoint.s…"   3 hours ago   Up 2 hours   0.0.0.0:6372->6379/tcp, :::6372->6379/tcp, 0.0.0.0:16372->16379/tcp, :::16372->16379/tcp   redis-2
dcd802a160c6   redis:5.0.9-alpine3.11   "docker-entrypoint.s…"   3 hours ago   Up 2 hours   0.0.0.0:6371->6379/tcp, :::6371->6379/tcp, 0.0.0.0:16371->16379/tcp, :::16371->16379/tcp   redis-1

重新查看redis集群状态

[root@es-node22 ~]# docker exec -it redis-1 /bin/sh    ---》redis中默认没有bash解释器
/data # redis-cli -c
127.0.0.1:6379> cluster info
cluster_state:ok      ---》可以看到redis集群状态已经为OK
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:1
cluster_stats_messages_ping_sent:236
cluster_stats_messages_pong_sent:233
cluster_stats_messages_sent:469
cluster_stats_messages_ping_received:233
cluster_stats_messages_pong_received:232
cluster_stats_messages_received:465
127.0.0.1:6379> cluster nodes
c856c94ba8d2c55a0d176831bc85aa34a96fde88 192.168.3.11:6379@16379 master - 0 1653752958838 1 connected 0-5460
8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 192.168.3.2:6379@16379 myself,master - 0 1653752957000 2 connected 5461-10922
2113cf366ad27ebd73585f03d368e77f03b1a2e1 192.168.3.4:6379@16379 slave ff0d1d636f94d9b092e6012408c1d0918e00e6ed 0 1653752957804 4 connected
2108a90495c147c675328f9b8b4fa49e2b856faf 192.168.3.6:6379@16379 slave 8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3 0 1653752957086 6 connected
ff0d1d636f94d9b092e6012408c1d0918e00e6ed 192.168.3.3:6379@16379 master - 0 1653752958000 3 connected 10923-16383
d92ff5984ab29370af0adeaca71e7938c0287ca5 192.168.3.5:6379@16379 slave c856c94ba8d2c55a0d176831bc85aa34a96fde88 0 1653752958529 1 connected
127.0.0.1:6379> cluster slots
1) 1) (integer) 5461
   2) (integer) 10922
   3) 1) "192.168.3.2"
      2) (integer) 6379
      3) "8b01b1bc6202e1dc7ff9f15013d8200b10ecb3f3"
   4) 1) "192.168.3.6"
      2) (integer) 6379
      3) "2108a90495c147c675328f9b8b4fa49e2b856faf"
2) 1) (integer) 0
   2) (integer) 5460
   3) 1) "192.168.3.11"      ---》可以看到集群中该节点的IP已经为修改后的IP
      2) (integer) 6379
      3) "c856c94ba8d2c55a0d176831bc85aa34a96fde88"
   4) 1) "192.168.3.5"
      2) (integer) 6379
      3) "d92ff5984ab29370af0adeaca71e7938c0287ca5"
3) 1) (integer) 10923
   2) (integer) 16383
   3) 1) "192.168.3.3"
      2) (integer) 6379
      3) "ff0d1d636f94d9b092e6012408c1d0918e00e6ed"
   4) 1) "192.168.3.4"
      2) (integer) 6379
      3) "2113cf366ad27ebd73585f03d368e77f03b1a2e1"

另一种情况


 当集群报错cluster_state:fail时,也有可能是因为slot未完全分配的问题导致集群不可用。因为redis为了保证集群完整性, 默认情况下当集群16384个槽任何一个没有指派到节点时,整个redis集群都会不可用。这是对集群完整性的一种保护措施, 保证所有的槽都指派给在线的redis节点。这种情况时,重新分配这些slots即可解决集群不可用问题。


 这种情况时可以看看这篇:未指派的slots问题解决


相关实践学习
基于Redis实现在线游戏积分排行榜
本场景将介绍如何基于Redis数据库实现在线游戏中的游戏玩家积分排行榜功能。
云数据库 Redis 版使用教程
云数据库Redis版是兼容Redis协议标准的、提供持久化的内存数据库服务,基于高可靠双机热备架构及可无缝扩展的集群架构,满足高读写性能场景及容量需弹性变配的业务需求。 产品详情:https://www.aliyun.com/product/kvstore &nbsp; &nbsp; ------------------------------------------------------------------------- 阿里云数据库体验:数据库上云实战 开发者云会免费提供一台带自建MySQL的源数据库&nbsp;ECS 实例和一台目标数据库&nbsp;RDS实例。跟着指引,您可以一步步实现将ECS自建数据库迁移到目标数据库RDS。 点击下方链接,领取免费ECS&amp;RDS资源,30分钟完成数据库上云实战!https://developer.aliyun.com/adc/scenario/51eefbd1894e42f6bb9acacadd3f9121?spm=a2c6h.13788135.J_3257954370.9.4ba85f24utseFl
相关文章
|
3月前
|
存储 缓存 NoSQL
Redis常见面试题(二):redis分布式锁、redisson、主从一致性、Redlock红锁;Redis集群、主从复制,哨兵模式,分片集群;Redis为什么这么快,I/O多路复用模型
redis分布式锁、redisson、可重入、主从一致性、WatchDog、Redlock红锁、zookeeper;Redis集群、主从复制,全量同步、增量同步;哨兵,分片集群,Redis为什么这么快,I/O多路复用模型——用户空间和内核空间、阻塞IO、非阻塞IO、IO多路复用,Redis网络模型
Redis常见面试题(二):redis分布式锁、redisson、主从一致性、Redlock红锁;Redis集群、主从复制,哨兵模式,分片集群;Redis为什么这么快,I/O多路复用模型
|
2月前
|
监控 NoSQL Redis
看完这篇就能弄懂Redis的集群的原理了
看完这篇就能弄懂Redis的集群的原理了
56 0
|
2月前
|
NoSQL Java Redis
【Azure Webjob + Redis】WebJob一直链接Azure Redis一直报错 Timeout Exception
【Azure Webjob + Redis】WebJob一直链接Azure Redis一直报错 Timeout Exception
|
3月前
|
存储 NoSQL 算法
Redis 集群模式搭建
Redis 集群模式搭建
69 5
|
3月前
|
存储 缓存 NoSQL
高并发架构设计三大利器:缓存、限流和降级问题之Redis用于搭建分布式缓存集群问题如何解决
高并发架构设计三大利器:缓存、限流和降级问题之Redis用于搭建分布式缓存集群问题如何解决
|
2月前
|
NoSQL Redis
Redis——单机迁移cluster集群如何快速迁移
Redis——单机迁移cluster集群如何快速迁移
46 0
|
2月前
|
NoSQL Linux Redis
使用docker-compose搭建redis-cluster集群
使用docker-compose搭建redis-cluster集群
226 0
|
2月前
|
NoSQL Linux Redis
基于redis6搭建集群
基于redis6搭建集群
|
3月前
|
NoSQL Redis
Redis 使用 hyperLogLog 实现请求ip去重的浏览量
Redis 使用 hyperLogLog 实现请求ip去重的浏览量
34 0
下一篇
无影云桌面