背景
最近项目中,用akka(2.6.8) cluster在k8s做分布式的部署,,其中遇到unreachable node 如果一直未手动重启,则会导致其他的node加入不到cluster中来,
具体的操作为其中的一个非seed node节点由于pod 重启导致,部署到了其他的节点上,而之前的node(ip),cluster则会一直去连接该node(ip),从而导致异常
具体原因分析
首先我们先看一下概念Gossip Convergence,如下:
Gossip convergence cannot occur while any nodes are unreachable. The nodes need to become reachable again, or moved to the down and removed states (see the Cluster Membership Lifecycle section). This only blocks the leader from performing its cluster membership management and does not influence the application running on top of the cluster. For example this means that during a network partition it is not possible to add more nodes to the cluster. The nodes can join, but they will not be moved to the up state until the partition has healed or the unreachable nodes have been downed.
翻译过来就是: 当任何节点都不可达时,Gossip convergence就不达成一致。节点需要再次变得reachable,或转移到down和removed状态。这仅阻止领导者执行其集群成员资格管理,并且不会影响在集群顶部运行的应用程序。例如,这意味着在网络分
区期间不可能将更多节点添加到群集。节点可以加入,但在分区修复或无法访问的节点已关闭之前,它们将不会移入up状态。
很明显,akka就是要保证每个节点是reachable或者down,这样才能进行一致性协商
membership-lifecycle也有提到:
If a node is unreachable then gossip convergence is not possible and therefore most leader actions are impossible (for instance, allowing a node to become a part of the cluster). To be able to move forward, the node must become reachable again or the node must be explicitly “downed”. This is required because the state of an unreachable node is unknown and the cluster cannot know if the node has crashed or is only temporarily unreachable because of network issues or GC pauses. See the section about User Actions below for ways a node can be downed.
也就是说,如果一个节点是unreachable的,必须保证节点是reachable或者downed状态,因为unreachable状态也有可能是网络抖动,或者GC导致服务器负载过高引起的,这些状态akka无法分辨,只能无限的进行重连
解决方法
既然有了问题,问题咱们就得解决,解决方法自然就可以去官网解决,通过把unreachable节点自动的转化为down状态
以http请求的形式,主动的进行状态转化
引入split-brain-resolver(SBR)
第一种方式自行研究,我们采用第二种方式:
其中SBR分tatic-quorum, keep-majority, keep-oldest, down-all, lease-majority 五种strategies
我们采用keep-majority策略,其中五种策略的优缺点以及使用场景自行通过官网strategies进行分析
我们看一下keep-majority策略下的akka配置
akka.coordinated-shutdown.exit-jvm = on akka.coordinated-shutdown.exit-code = 0 akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider" akka.cluster.split-brain-resolver.down-all-when-unstable = off akka.cluster.split-brain-resolver.stable-after = 20s akka.cluster.split-brain-resolver.active-strategy = keep-majority akka.cluster.split-brain-resolver.keep-majority.role = "admin"
注意:对于akka.cluster.split-brain-resolver.keep-majority.role,如果cluster由于其他原因,导致只存在少数节点(小于集群节点的一半),而该少数节点的role刚好等于该值,则该少数节点不会退出,
如果不配置该项,则少数节点就会全部退出,从而导致整个集群down