业务背景和要求
为了让有状态应用在k8s中部署可以获得尽可能高的可用性,对业务多可用区的部署带来了更高的要求::
-
为了更高的可用性,需要让业务pod尽量均衡的分布在多个可用区中
-
业务pod要可以分别在不同的可用区中挂载上云盘做持久化,需要保障pod和disk始终在一个AZ(云盘不可以跨区挂载)
-
如果一个pod或一个节点发生故障,需要第一时间将pod重新调度到同可用区的另一台可用机器上去,机器可以是弹性新建的,但为了故障能最快速的自愈,该机器最好是预热在节点池中的,pod可以第一时间调度到同区的可用节点上,且可以重新挂载使用原先使用的云盘
-
如果业务需要扩容,可以自动完成资源扩充,云盘创建挂载,且尽量做到多可用区均衡部署
方案规划
有状态业务通常为了存储有更高的io性能会选用块存储作为存储设备,但因为块存储只能挂载到相同可用区下的ECS实例上,那么有状态业务跨区部署时,就需要考虑如何保证应用pod及对应的块存储资源始终在一个可用区,不会由于业务pod迁移到了其他可用区无法挂载之前使用的块存储设备而造成故障,为了保障业务的可用性及运维人员的可操作性,我们规划设计如下:
-
在三个可用区,分别建一个弹性节点池,配置统一的标签和污点,选用统一的机型,且保证选用的三个可用区有同样的云盘可以使用(比如ESSD),保证只有sts业务可以调度到该组资源中去,避免和其他业务混合部署带来性能或可用性的干扰
-
定义一个storageclass,type定义cloud_essd,cloud_ssd,cloud_efficiency(essd会优先创建),同时配置volumeBindingMode为WaitForFirstConsumer,确保在多可用区的集群中调度时,先创建pod,再根据pod所在可用区信息创建云盘
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: sts-sc
provisioner: diskplugin.csi.alibabacloud.com
parameters:
type: cloud_essd,cloud_ssd,cloud_efficiency
#regionId: cn-beijing
#zoneId: cn-beijing-h,cn-beijing-g,cn-beijing-k
#fstype: ext4
#diskTags: "a:b,b:c"
#encrypted: "false"
#performanceLevel: PL1
#volumeExpandAutoSnapshot: "forced" #该设置仅在type为"cloud_essd"时生效。
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain
allowVolumeExpansion: true
-
配合上占位pod的能力,用占位pod提前扩容预热一台计算节点,有状态pod可自动调度到占位pod所在节点上,占位pod被驱逐后自动再扩容一台节点做预留
方案落地——多可用区实现同时快速弹性扩容
多可用区实现同时快速弹性扩容,可参考:https://help.aliyun.com/document_detail/410611.html
-
在三个可用区创建三个弹性节点池,H, K, G, 分别配置了如下label和污点
label:
avaliable_zone: k
usage: sts
taints:
- effect: NoSchedule
key: usage
value: sts
label:
avaliable_zone: g
usage: sts
taints:
- effect: NoSchedule
key: usage
value: sts
label:
avaliable_zone: h
usage: sts
taints:
- effect: NoSchedule
key: usage
value: sts
-
部署placeholder及占位deployment
deployments:
- affinity: {}
annotations: {}
containers:
- image: registry-vpc.cn-beijing.aliyuncs.com/acs/pause:3.1
imagePullPolicy: IfNotPresent
name: placeholder
resources:
requests:
cpu: 3000m
memory: 6
imagePullSecrets: {}
labels: {}
name: ack-place-holder-h
nodeSelector:
avaliable_zone: h
replicaCount: 1
tolerations:
- effect: NoSchedule
key: usage
operator: Equal
value: sts
- affinity: {}
annotations: {}
containers:
- image: registry-vpc.cn-beijing.aliyuncs.com/acs/pause:3.1
imagePullPolicy: IfNotPresent
name: placeholder
resources:
requests:
cpu: 3000m
memory: 6
imagePullSecrets: {}
labels: {}
name: ack-place-holder-k
nodeSelector:
avaliable_zone: k
replicaCount: 1
tolerations:
- effect: NoSchedule
key: usage
operator: Equal
value: sts
- affinity: {}
annotations: {}
containers:
- image: registry-vpc.cn-beijing.aliyuncs.com/acs/pause:3.1
imagePullPolicy: IfNotPresent
name: placeholder
resources:
requests:
cpu: 3000m
memory: 6
imagePullSecrets: {}
labels: {}
name: ack-place-holder-g
nodeSelector:
avaliable_zone: g
replicaCount: 1
tolerations:
- effect: NoSchedule
key: usage
operator: Equal
value: sts
fullnameOverride: ""
nameOverride: ""
podSecurityContext: {}
priorityClassDefault:
enabled: true
name: default-priority-class
value: -1
可以看到,当前集群中,先前定义的3个节点池中,都弹起了一个节点,分别都有一个占位的deployment将pod部署在节点中:
k get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
ack-place-holder-g 1/1 1 1 18m
ack-place-holder-h 1/1 1 1 18m
ack-place-holder-k 1/1 1 1 18m
# pod分布在三个可用区的三个节点中
k get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ack-place-holder-g-7b87d547d6-2h92f 1/1 Running 0 14m 10.4.48.187 worker.10.4.48.186.g7ne <none> <none>
ack-place-holder-h-5cd56f49ff-qpw44 1/1 Running 0 14m 10.1.86.119 worker.10.0.63.207.g7ne <none> <none>
ack-place-holder-k-6b5c96457d-vjzhp 1/1 Running 0 14m 10.2.57.72 worker.10.2.57.70.g7ne <none> <none>
此时,我们部署一个sts应用:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx-dynamic
spec:
replicas: 3
selector:
matchLabels:
app: nginx
serviceName: nginx
template:
metadata:
labels:
app: nginx
usage: sts
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: usage
operator: In
values:
- sts
topologyKey: kubernetes.io/hostname
#topologyKey: topology.kubernetes.io/zone
nodeSelector:
usage: sts
tolerations:
- key: "usage"
operator: "Equal"
value: "sts"
effect: "NoSchedule"
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: web
resources:
requests:
cpu: 2
limits:
cpu: 4
volumeMounts:
- name: disk-essd
mountPath: /data
volumeClaimTemplates:
- metadata:
name: disk-essd
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: "sts-sc"
resources:
requests:
storage: 20Gi
部署上面的statefulset后,可以看到,sts的pod立刻驱逐了占位pod,开始部署业务pod:
k get po -w
NAME READY STATUS RESTARTS AGE
ack-place-holder-g-7b87d547d6-vsv8z 0/1 Pending 0 9s
ack-place-holder-h-5cd56f49ff-qpw44 1/1 Running 0 18m
ack-place-holder-k-6b5c96457d-vjzhp 1/1 Running 0 18m
nginx-dynamic-0 0/1 ContainerCreating 0 9s
被驱逐的占位pod,立马开始触发对应可用区的弹性节点池的扩容,预热一台机器做资源补充:
k get po
NAME READY STATUS RESTARTS AGE
ack-place-holder-g-7b87d547d6-vsv8z 1/1 Running 0 4m49s
ack-place-holder-h-5cd56f49ff-dbhns 1/1 Running 0 4m17s
ack-place-holder-k-6b5c96457d-xxjqs 0/1 ContainerCreating 0 3m45s
nginx-dynamic-0 1/1 Running 0 4m49s
nginx-dynamic-1 1/1 Running 0 4m17s
nginx-dynamic-2 1/1 Running 0 3m45s
最终,可以看到sts的业务pod调度到了之前占位pod所在的三台机器上(worker.10.4.48.186.g7ne, worker.10.0.63.207.g7ne ,worker.10.2.57.70.g7ne),同时占位pod调度到了新扩容的三台节点上:
k get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ack-place-holder-g-7b87d547d6-vsv8z 1/1 Running 0 7m26s 10.4.48.190 worker.10.4.48.189.g7ne <none> <none>
ack-place-holder-h-5cd56f49ff-dbhns 1/1 Running 0 6m54s 10.1.86.121 worker.10.0.63.208.g7ne <none> <none>
ack-place-holder-k-6b5c96457d-xxjqs 1/1 Running 0 6m22s 10.2.57.74 worker.10.2.57.73.g7ne <none> <none>
nginx-dynamic-0 1/1 Running 0 7m26s 10.4.48.187 worker.10.4.48.186.g7ne <none> <none>
nginx-dynamic-1 1/1 Running 0 6m54s 10.1.86.119 worker.10.0.63.207.g7ne <none> <none>
nginx-dynamic-2 1/1 Running 0 6m22s 10.2.57.72 worker.10.2.57.70.g7ne <none> <none>
总结
通过上述方法,我们成功实现了:
-
pod多区分布
-
根据初次创建的pod所在可用区来创建并挂载云盘
-
后续pod重建后,可自动根据云盘所在的可用区信息实现同可用区的正确调度
-
如果需要扩容,scale pod可以自动通过占位pod极速拉起业务pod,同时占位pod可继续扩容资源做储备
另外,需要特别注意的是:
-
本方案只是从部署层面,帮助有状态应用尽量的高可用性,但有状态的应用,中间件及DB等类型的业务,本身状态数据的一致性并不可以通过这个方式来保障,集群本身的数据一致性机制才是实现状态一致性的核心
-
当前只是采用了相对简单的podAntiAffinity来做多区部署,并没有绝对意义上的多区均衡部署,后续结合上k8s的topologySpread机制,可以让多区部署更均衡