本文介绍了在多可用区Kubernetes集群中挂载云盘的实现方案,核心原理为:
通过StorageClass中的volumeBindingMode参数配置给出两种挂载云盘的方案:”Pod跟着云盘走“、”云盘跟着Pod走“;
volumeBindingMode配置参考:卷绑定模式说明
多可用区集群
部署集群的时候使用多可用区方案来保证集群的高可用已得到大家的共识,这样集群在某个可用区出现异常时仍然可以保持集群对外提供正常服务,阿里云ACK服务提供了多可用区集群部署方案,可参考使用文档。
容器存储卷为用户数据持久化提供了实现方案,通过将外置存储挂载到容器内部的方式为应用提供存储空间,在容器被销毁后存储卷中的数据依然可以继续被其他的应用使用。常用的存储卷类型有块存储、文件存储、对象存储,其对可用区的敏感性如下:
块存储:如阿里云云盘;ECS和云盘需要在同一个可用区内联合使用;
文件存储:如阿里云NAS、CPFS服务;ECS和NAS可以跨可用区挂载,但要求在相同VPC内;
对象存储:如阿里云OSS;ECS和OSS可以跨可用区、跨Region挂载;
可见,只有云盘存储卷对可用区有强制要求,需要在集群内部根据数据卷可用区调度Pod,PV(云盘)的可用区和Pod被调度的可用区必须一致才可以挂载成功。k8s实现了针对数据卷的调度器,会根据Pod使用的PV信息把Pod调度到合适的可用区;
下面针对阿里云云盘讲述一下如何在多可用环境实现块存储数据卷的可用区调度。
云盘多可用区使用方案
使用StatefulSet在多可用区集群挂载云盘
关于应用负载,Pod中挂载云盘数据卷的场景有如下特点:
云盘作为块存储只能挂载到相同可用区的节点上,且云盘不是共享存储,只能被一个Pod挂载使用;
常用的应用编排模板中:Deployment和Daemonset都是具有相同Pod配置的多个副本模式,如果Pod挂载数据卷,则所有Pod的数据卷需要是一样的,而云盘只能挂载到一个Pod的限制,所以Deployment、DaemonSet不适合挂载云盘数据卷。
Statefulset类型应用提供了两种配置Volume的方式,一种和Deployment一样直接通过volumes定义persistentVolumeClaim,这种方式只能为StatefulSet应用配置一个特定的PVC(云盘),不能用在多副本场景中;
另一种是通过volumeClaimTemplates配置,这个方式为每个Pod配置一个具有一定规则名字的PVC,不通的PVC对应不同的云盘卷。这样的配置可以启动多个副本,且每个副本都挂载了独立的云盘。
即:在应用使用云盘的场景中,需要使用StatefulSet进行编排,并使用volumeClaimTemplates配置挂载云盘。多可用区场景亦是如此。
通过volumeClaimTemplates为每个Pod配置的PVC名字规则:
PVC的名字 = {volumeClaimTemplates名字} + "-" + {StatefulSet名字} + "-" + 序号
例如如下配置的StatefulSet:
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
name: web
spec:
replicas: 3
***
volumeClaimTemplates:
- metadata:
name: disk-ssd
spec:
***
为3个Pod分别创建配置的pvc名字为:disk-ssd-web-0、disk-ssd-web-1、disk-ssd-web-2;
多可用区集群挂载云盘方案
没有挂载云盘数据卷的情况下,pod启动过程为:
Pod创建好后进入调度流程;
检查Pod引用的pvc,如果pvc处于unbound状态,等待pvc变成bound;如果pvc处于bound状态,继续调度;
根据nodeSelector、Node状态等配置选择最合适的目标节点;
在目标节点启动Pod;
考虑挂载云盘卷调度方案1:
Pod创建好后进入调度流程;
检查Pod引用的pvc,如果pvc处于unbound状态,等待pvc变成bound;如果pvc处于bound状态,继续调度;
根据PV中的调度信息选择满足条件的目标节点集合;
根据nodeSelector、Node状态等配置选择满足条件的目标节点;
在目标节点启动Pod;
考虑挂载云盘卷调度方案2:
Pod创建好后进入调度流程;
检查Pod引用的pvc,如果pvc处于unbound状态;
根据nodeSelector、Node状态等配置选择满足条件的目标节点;
根据目标节点的可用区信息,动态创建云盘(PV),这时PV的可用区和目标启动节点相同;
在目标节点启动Pod;
上面两种挂载云盘数据卷方案中:
方案1:先确定了PV(云盘)可用区信息,并将云盘可用区信息作为Pod调度的一个参量,即Pod随着云盘走;
方案2:先确定了Pod运行的目标节点,再动态创建PV(云盘)并绑定,即云盘跟着Pod走。
Pod跟着云盘走:
此方案的关键点是:Pod调度前确定创建好PV,并确定PV的可用区信息。在静态存储卷或动态存储卷中都可以实现此方案。
部署前需要规划好应用期望运行的可用区,并手动(静态卷)/自动(动态卷)创建好云盘、PV对象;如果一个应用期望运行在多个可用区,则需要在多个可用区申请云盘并创建多个PV。每个PV对象中需要添加相应的可用区调度信息。
调度信息可以通过在PV的Label中添加如下配置(由VolumeZonePredicate调度器进行调度):
labels:
failure-domain.beta.kubernetes.io/zone: cn-hangzhou-b
failure-domain.beta.kubernetes.io/region: cn-hangzhou
调度信息也可以通过在PV的nodeAffinity中添加如下配置(由VolumeBindingPredicate调度器进行调度):
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.diskplugin.csi.alibabacloud.com/zone
operator: In
values:
- cn-shenzhen-a
下面示例是在一个包含三可用区节点的集群进行:
# kubectl describe node | grep failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-a
failure-domain.beta.kubernetes.io/zone=cn-beijing-a
failure-domain.beta.kubernetes.io/zone=cn-beijing-b
failure-domain.beta.kubernetes.io/zone=cn-beijing-b
failure-domain.beta.kubernetes.io/zone=cn-beijing-c
failure-domain.beta.kubernetes.io/zone=cn-beijing-c
静态数据卷
静态存储卷指PV对象、云盘实例需要管理员手动创建,并将可用区等信息配置到数据卷中。在使用静态存储卷前需要先创建好云盘、pv。
本示例在StatefulSet应用中启动三个Pod,每个pod挂载一个云盘数据卷,且分别运行在cn-beijing-a、cn-beijing-b、cn-beijing-c三个可用区;期望:StatefulSet名字为 web,volumeClaimTemplates名字为disk-ssd;
分别按照下面模板创建pvc、pv:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: disk-ssd-web-0
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 25Gi
selector:
matchLabels:
alicloud-pvname: pv-disk-ssd-web-0
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-disk-ssd-web-0
labels:
alicloud-pvname: pv-disk-ssd-web-0
spec:
capacity:
storage: 25Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
csi:
driver: diskplugin.csi.alibabacloud.com
volumeHandle: d-2zeeujx1zexxkbc8ny4b
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.diskplugin.csi.alibabacloud.com/zone
operator: In
values:
- cn-beijing-a
其中pvc、pv、可用区、云盘ID对应关系:
disk-ssd-web-0 ==> pv-disk-ssd-web-0 ==> cn-beijing-a ==> d-2zeeujx1zexxkbc8ny4b
disk-ssd-web-1 ==> pv-disk-ssd-web-1 ==> cn-beijing-b ==> d-2ze4n7co1x8w8xs95sqk
disk-ssd-web-2 ==> pv-disk-ssd-web-2 ==> cn-beijing-c ==> d-2zeaed32ln6d8sbpichh
# kubectl get pvc | grep disk-ssd-web
disk-ssd-web-0 Bound pv-disk-ssd-web-0 25Gi RWO 59s
disk-ssd-web-1 Bound pv-disk-ssd-web-1 25Gi RWO 56s
disk-ssd-web-2 Bound pv-disk-ssd-web-2 25Gi RWO 54s
# kubectl get pv | grep disk-ssd-web
pv-disk-ssd-web-0 25Gi RWO Retain Bound default/disk-ssd-web-0 2m43s
pv-disk-ssd-web-1 25Gi RWO Retain Bound default/disk-ssd-web-1 2m40s
pv-disk-ssd-web-2 25Gi RWO Retain Bound default/disk-ssd-web-2 2m38s
部署下面StatefulSet模板:
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: web
volumeMounts:
- name: disk-ssd
mountPath: /data
volumeClaimTemplates:
- metadata:
name: disk-ssd
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: csi-disk-topology
resources:
requests:
storage: 20Gi
执行下面命令获取Pod信息,上述模板启动的三个Pod分别运行在a、b、c三个可用区:
# kubectl get pod
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 14m
web-1 1/1 Running 0 13m
web-2 1/1 Running 0 13m
# kubectl describe pod | grep Node
Node: cn-beijing.172.16.1.101/172.16.1.101
Node: cn-beijing.172.16.2.87/172.16.2.87
Node: cn-beijing.172.16.3.197/172.16.3.197
# kubectl describe node cn-beijing.172.16.1.101 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-a
# kubectl describe node cn-beijing.172.16.2.87 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-b
# kubectl describe node cn-beijing.172.16.3.197 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-c
分别删除三个pod,验证重启后的Pod依然落在相应可用区:
# kubectl delete pod --all
pod "web-0" deleted
pod "web-1" deleted
pod "web-2" deleted
# kubectl get pod
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 61s
web-1 1/1 Running 0 41s
web-2 1/1 Running 0 21s
# kubectl describe pod | grep Node
Node: cn-beijing.172.16.1.101/172.16.1.101
Node: cn-beijing.172.16.2.87/172.16.2.87
Node: cn-beijing.172.16.3.197/172.16.3.197
# kubectl describe node cn-beijing.172.16.1.101 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-a
# kubectl describe node cn-beijing.172.16.2.87 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-b
# kubectl describe node cn-beijing.172.16.3.197 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-c
动态数据卷
创建支持多可用区的StorageClass,如下配置:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-disk-multizone
provisioner: diskplugin.csi.alibabacloud.com
parameters:
type: cloud_ssd
zoneId: cn-beijing-a,cn-beijing-b,cn-beijing-c
reclaimPolicy: Delete
zoneId:配置多个可用区,生成多个云盘时会在多个可用区之间循环创建;
按照下面StatefulSet配置创建应用:
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: web
volumeMounts:
- name: disk-ssd
mountPath: /data
volumeClaimTemplates:
- metadata:
name: disk-ssd
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: csi-disk-topology
resources:
requests:
storage: 20Gi
查看生成的Pod、PVC、PV信息:
# kubectl get pod
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 2m2s
web-1 1/1 Running 0 84s
web-2 1/1 Running 0 52s
# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
disk-ssd-web-0 Bound disk-9e6a6f65-f3fc-11e9-a7a7-00163e165b60 20Gi RWO csi-disk-multizone 2m6s
disk-ssd-web-1 Bound disk-b5071f37-f3fc-11e9-a7a7-00163e165b60 20Gi RWO csi-disk-multizone 88s
disk-ssd-web-2 Bound disk-c81b6163-f3fc-11e9-a7a7-00163e165b60 20Gi RWO csi-disk-multizone 56s
# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
disk-9e6a6f65-f3fc-11e9-a7a7-00163e165b60 20Gi RWO Delete Bound default/disk-ssd-web-0 csi-disk-multizone 116s
disk-b5071f37-f3fc-11e9-a7a7-00163e165b60 20Gi RWO Delete Bound default/disk-ssd-web-1 csi-disk-multizone 85s
disk-c81b6163-f3fc-11e9-a7a7-00163e165b60 20Gi RWO Delete Bound default/disk-ssd-web-2 csi-disk-multizone 39s
查看Pod、pv所在的可用区,分别落在3个可用区:
# kubectl describe pod web-0 | grep Node
Node: cn-beijing.172.16.1.101/172.16.1.101
# kubectl describe node cn-beijing.172.16.1.101 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-a
# kubectl describe pod web-1 | grep Node
Node: cn-beijing.172.16.2.87/172.16.2.87
# kubectl describe node cn-beijing.172.16.2.87 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-b
# kubectl describe pod web-2 | grep Node
Node: cn-beijing.172.16.3.197/172.16.3.197
# kubectl describe node cn-beijing.172.16.3.197 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-c
# kubectl describe pv disk-9e6a6f65-f3fc-11e9-a7a7-00163e165b60 | grep zone
Term 0: topology.diskplugin.csi.alibabacloud.com/zone in [cn-beijing-a]
# kubectl describe pv disk-b5071f37-f3fc-11e9-a7a7-00163e165b60 | grep zone
Term 0: topology.diskplugin.csi.alibabacloud.com/zone in [cn-beijing-b]
# kubectl describe pv disk-c81b6163-f3fc-11e9-a7a7-00163e165b60 | grep zone
Term 0: topology.diskplugin.csi.alibabacloud.com/zone in [cn-beijing-c]
云盘跟着Pod走:
云盘跟着Pod走是指在Pod完成调度,已经确定了运行节点所在可用区以后,再动态创建云盘、PV的方式。所以云盘跟着Pod走的方案只适用于动态数据卷。
创建WaitForFirstConsumer类型的StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-disk-topology
provisioner: diskplugin.csi.alibabacloud.com
parameters:
type: cloud_ssd
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
WaitForFirstConsumer:表示使用定义了这个storageClass的pvc,在pod启动的时候先进行pod调度,再触发pv、云盘的Provision操作。
创建下面StatefulSet应用:
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: web
volumeMounts:
- name: disk-ssd
mountPath: /data
volumeClaimTemplates:
- metadata:
name: disk-ssd
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: csi-disk-topology
resources:
requests:
storage: 20Gi
获取Pod、PV的信息:
# kubectl get pod
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 2m5s
web-1 1/1 Running 0 100s
web-2 1/1 Running 0 74s
# kubectl describe pod web-0 | grep Node
Node: cn-beijing.172.16.3.197/172.16.3.197
# kubectl describe node cn-beijing.172.16.3.197 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-c
# kubectl describe pod web-1 | grep Node
Node: cn-beijing.172.16.1.101/172.16.1.101
# kubectl describe node cn-beijing.172.16.1.101 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-a
# kubectl describe pod web-2 | grep Node
Node: cn-beijing.172.16.2.87/172.16.2.87
# kubectl describe node cn-beijing.172.16.2.87 | grep zone
failure-domain.beta.kubernetes.io/zone=cn-beijing-b
# kubectl describe pv disk-d4b08afa-f3fe-11e9-a7a7-00163e165b60 | grep zone
Term 0: topology.diskplugin.csi.alibabacloud.com/zone in [cn-beijing-c]
# kubectl describe pv disk-e32d5fcf-f3fe-11e9-a7a7-00163e165b60 | grep zone
Term 0: topology.diskplugin.csi.alibabacloud.com/zone in [cn-beijing-a]
# kubectl describe pv disk-f2cec31a-f3fe-11e9-a7a7-00163e165b60 | grep zone
Term 0: topology.diskplugin.csi.alibabacloud.com/zone in [cn-beijing-b]
”云盘跟着Pod走“的方案需要澄清一下:
只有第一次启动pod的时候,且配置了”云盘跟着Pod走“策略,才会出现先调度pod再触发创建pv、云盘的场景。当pod重启的时候,这时走的流程是根据pv的zone信息进行调度,即”pod跟着云盘走“。
总结:
本文给出了2种多可用区集群使用云盘数据卷方案:
Pod跟着云盘走:需要先创建好云盘、PV对象,根据云盘的可用区信息调度负载,这个场景更多适合挂载已有云盘的场景;且这个方案在调度层面更多的依赖云盘可用区的调度信息,而弱化了其他调度器的调度能力。
云盘跟着Pod走:先执行负载调度,再根据Pod所在可用区信息创建云盘、PV对象;这个方案充分考虑了所有调度器的调度策略,在调度完成后创建PV并挂载使用,这个过程云盘可用区信息不参与调度。
对于应用负载使用云盘数据卷的场景,更推荐使用”云盘跟着Pod走“的方案。