akka cluster split-brain-resolver(SBR)

简介: akka cluster split-brain-resolver(SBR)

背景


最近项目中,用akka(2.6.8) cluster在k8s做分布式的部署,,其中遇到unreachable node 如果一直未手动重启,则会导致其他的node加入不到cluster中来,

具体的操作为其中的一个非seed node节点由于pod 重启导致,部署到了其他的节点上,而之前的node(ip),cluster则会一直去连接该node(ip),从而导致异常


具体原因分析


首先我们先看一下概念Gossip Convergence,如下:

 Gossip convergence cannot occur while any nodes are unreachable. The nodes need to become reachable again, or moved to the down and removed states (see the Cluster Membership Lifecycle section).    
 This only blocks the leader from performing its cluster membership management and does not influence the application running on top of the cluster. For example this means that during a network    
 partition it is not possible to add more nodes to the cluster. The nodes can join, but they will not be moved to the up state until the partition has healed or the unreachable nodes have been downed.

翻译过来就是: 当任何节点都不可达时,Gossip convergence就不达成一致。节点需要再次变得reachable,或转移到down和removed状态。这仅阻止领导者执行其集群成员资格管理,并且不会影响在集群顶部运行的应用程序。例如,这意味着在网络分

区期间不可能将更多节点添加到群集。节点可以加入,但在分区修复或无法访问的节点已关闭之前,它们将不会移入up状态。

很明显,akka就是要保证每个节点是reachable或者down,这样才能进行一致性协商


membership-lifecycle也有提到:

 If a node is unreachable then gossip convergence is not possible and therefore most leader actions are impossible (for instance, allowing a node to become a part of the cluster). To be able to    
 move forward, the node must become reachable again or the node must be explicitly “downed”. This is required because the state of an unreachable node is unknown and the cluster cannot know if 
 the node has crashed or is only temporarily unreachable because of network issues or GC pauses. See the section about User Actions below for ways a node can be downed.

也就是说,如果一个节点是unreachable的,必须保证节点是reachable或者downed状态,因为unreachable状态也有可能是网络抖动,或者GC导致服务器负载过高引起的,这些状态akka无法分辨,只能无限的进行重连


解决方法


既然有了问题,问题咱们就得解决,解决方法自然就可以去官网解决,通过把unreachable节点自动的转化为down状态


以http请求的形式,主动的进行状态转化

引入split-brain-resolver(SBR)

第一种方式自行研究,我们采用第二种方式:

其中SBR分tatic-quorum, keep-majority, keep-oldest, down-all, lease-majority 五种strategies

我们采用keep-majority策略,其中五种策略的优缺点以及使用场景自行通过官网strategies进行分析

我们看一下keep-majority策略下的akka配置

 akka.coordinated-shutdown.exit-jvm = on
 akka.coordinated-shutdown.exit-code = 0
 akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
 akka.cluster.split-brain-resolver.down-all-when-unstable = off
 akka.cluster.split-brain-resolver.stable-after = 20s
 akka.cluster.split-brain-resolver.active-strategy = keep-majority
 akka.cluster.split-brain-resolver.keep-majority.role = "admin"

image.png

注意:对于akka.cluster.split-brain-resolver.keep-majority.role,如果cluster由于其他原因,导致只存在少数节点(小于集群节点的一半),而该少数节点的role刚好等于该值,则该少数节点不会退出,

如果不配置该项,则少数节点就会全部退出,从而导致整个集群down


相关文章
|
分布式计算 Apache Spark
《Building Robust ETL Pipelines with Apache Spark》电子版地址
Building Robust ETL Pipelines with Apache Spark
85 0
《Building Robust ETL Pipelines with Apache Spark》电子版地址
|
分布式计算 Apache Spark
《How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding》电子版地址
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding
84 0
《How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding》电子版地址
|
分布式计算 Spark
《Sparklint a Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster》电子版地址
Sparklint a Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster
95 0
《Sparklint a Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster》电子版地址
|
分布式计算 Apache Spark
《SCALING FACTORIZATION MACHINES ON APACHE SPARK WITH PARAMETER SERVERS》电子版地址
SCALING FACTORIZATION MACHINES ON APACHE SPARK WITH PARAMETER SERVERS
90 0
《SCALING FACTORIZATION MACHINES ON APACHE SPARK WITH PARAMETER SERVERS》电子版地址
|
分布式计算 Apache Spark
《Hail Scaling Genetic Data Analysis with Apache Spark》电子版地址
Hail: Scaling Genetic Data Analysis with Apache Spark
80 0
《Hail Scaling Genetic Data Analysis with Apache Spark》电子版地址
|
Kubernetes 安全 容器
Kubernetes CKS【18】---Supply Chain Security - Static Analysis(OPA)
Kubernetes CKS【18】---Supply Chain Security - Static Analysis(OPA)
Kubernetes CKS【18】---Supply Chain Security - Static Analysis(OPA)
|
Apache 流计算
《Lessons Learned on Apache Flink Application Availability》电子版地址
04-简锋-Lessons Learned on Apache Flink Application Availability-Final
86 0
《Lessons Learned on Apache Flink Application Availability》电子版地址
|
存储 缓存 监控
Flink State - Backend Improvements and Evolution in 2021
李钰 (绝顶)、唐云 (茶干) 在 FFA 2021 核心技术专场的分享
Flink State - Backend Improvements and Evolution in 2021
|
存储 分布式计算 资源调度
【spark系列12】spark remote shuffle service(RSS)杂谈
【spark系列12】spark remote shuffle service(RSS)杂谈
711 0
RFC Destination WORKFLOW_LOCAL_001 - User WF-BATCH
RFC Destination WORKFLOW_LOCAL_001 - User WF-BATCH
133 0
RFC Destination WORKFLOW_LOCAL_001 - User WF-BATCH