前因
我这边有两套k8s集群,环境如下:
[root@k81 ~]# kubectl get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k81 Ready control-plane,master 31d v1.23.1 10.10.21.249 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0 k82 Ready <none> 5h20m v1.23.1 10.10.21.250 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0 k83 Ready <none> 5h19m v1.23.1 10.10.21.251 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0
[zettakit@rocky1 ~]$ kubectl get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME rocky1 Ready control-plane,master 15d v1.23.6 10.10.21.58 <none> Rocky Linux 9.1 (Blue Onyx) 5.14.0-162.6.1.el9_1.x86_64 docker://23.0.1 rocky2 Ready <none> 5h13m v1.23.6 10.10.21.59 <none> Rocky Linux 9.1 (Blue Onyx) 5.14.0-162.6.1.el9_1.x86_64 docker://23.0.1 rocky3 Ready <none> 5h12m v1.23.6 10.10.21.60 <none> Rocky Linux 9.1 (Blue Onyx) 5.14.0-162.6.1.el9_1.x86_64 docker://23.0.1
由于发生了一些事情我把这两集群重组了(在所有的worker节点kubeadm reset之后再分别kubeadm join到新的master中),重组之后的环境如下:
[root@k81 ~]# kubectl get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k81 Ready control-plane,master 31d v1.23.1 10.10.21.249 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0 rocky2 Ready <none> 5h13m v1.23.6 10.10.21.59 <none> Rocky Linux 9.1 (Blue Onyx) 5.14.0-162.6.1.el9_1.x86_64 docker://23.0.1 rocky3 Ready <none> 5h12m v1.23.6 10.10.21.60 <none> Rocky Linux 9.1 (Blue Onyx) 5.14.0-162.6.1.el9_1.x86_64 docker://23.0.1
[zettakit@rocky1 ~]$ kubectl get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k82 Ready <none> 5h20m v1.23.1 10.10.21.250 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0 k83 Ready <none> 5h19m v1.23.1 10.10.21.251 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0 rocky1 Ready control-plane,master 15d v1.23.6 10.10.21.58 <none> Rocky Linux 9.1 (Blue Onyx) 5.14.0-162.6.1.el9_1.x86_64 docker://23.0.1
故障现象
重组集群之后肯定是要进行一些验证的,我直接在两集群分别kubectl run --image=busybox:1.28 mybusybox ,然后第一个集群可以正常create,第二个集群的状态一直是ContainerCreating ,同时第二个集群中的metrics-server对应pod也一直是这个状态,根据正常思维就直接describe来查看详情排查,结果出现以下报错:
Normal Scheduled 2m42s default-scheduler Successfully assigned kube-system/metrics-server-744f6976b-6rspj to k83 Warning FailedCreatePodSandBox 2m42s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "4cf99c45e8c559c93d4fcdc71d7d4656a69f9ba4c6ae771e592bbd16ec8224ea" network for pod "metrics-server-744f6976b-6rspj": networkPlugin cni failed to set up pod "metrics-server-744f6976b-6rspj_kube-system" network: error getting ClusterInformation: Get "https://192.168.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"), failed to clean up sandbox container "4cf99c45e8c559c93d4fcdc71d7d4656a69f9ba4c6ae771e592bbd16ec8224ea" network for pod "metrics-server-744f6976b-6rspj": networkPlugin cni failed to teardown pod "metrics-server-744f6976b-6rspj_kube-system" network: error getting ClusterInformation: Get "https://192.168.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]
解决思路
其中的https://192.168.0.1:443是我kubernetes对应service地址,然后我看见这个报错和证书有关系,我就想着检查一下节点是否ready。
[zettakit@rocky1 ~]$ kubectl get svc -owide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kubernetes ClusterIP 192.168.0.1 <none> 443/TCP 15d <none> [zettakit@rocky1 ~]$ kubectl get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k82 Ready <none> 5h20m v1.23.1 10.10.21.250 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0 k83 Ready <none> 5h19m v1.23.1 10.10.21.251 <none> CentOS Linux 7 (Core) 6.1.10-1.el7.elrepo.x86_64 docker://23.0.0 rocky1 Ready control-plane,master 15d v1.23.6 10.10.21.58 <none> Rocky Linux 9.1 (Blue Onyx) 5.14.0-162.6.1.el9_1.x86_64 docker://23.0.1
此时我怀疑是不是master中kubernetes以及docker的版本和worker中不一致,但是转念一想另一套集群版本也不一致但是是正常的,而且所有的软件大版本其实是一致的,只有小版本有区别。
这时候我又回头仔细看了下报错,发现和network有一定关系,我就想着难道是网络插件没装好,就重点往这方面排查,首先看flannel的pod,果然节点ready状态pod也是没毛病的。
[zettakit@rocky1 ~]$ kubectl -n kube-flannel get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-flannel-ds-ccx2w 1/1 Running 0 5h34m 10.10.21.251 k83 <none> <none> kube-flannel-ds-grnb7 1/1 Running 0 5h35m 10.10.21.250 k82 <none> <none> kube-flannel-ds-jsz8f 1/1 Running 4 (9d ago) 15d 10.10.21.58 rocky1 <none> <none>
结果
重点来了
突然想起来了安装了flannel会在/opt以及/etc下都创建目录并且生成一些文件,检查了一下发现果然worker节点的文件数量和master节点的不一致,因为懒得一个个检查是什么情况,我就先把worker节点对应目录给移走然后上master节点scp一份过来
[root@k82 ~]# mv /opt/cni /tmp/ [root@k82 ~]# mv /etc/cni/ /tmp/ # worker节点操作 [root@k82 ~]# mv /opt/cni /tmp/ [root@k82 ~]# mv /etc/cni/ /tmp/ # worker节点操作 [zettakit@rocky1 ~]$ sudo scp -r /opt/cni/ root@k82:/opt/ [zettakit@rocky1 ~]$ sudo scp -r /etc/cni/ root@k82:/etc/ # master节点操作
然后再去看pod是否成功创建在worker节点上
[zettakit@rocky1 ~]$ kubectl -n kube-system get pod metrics-server-744f6976b-6rspj -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES metrics-server-744f6976b-6rspj 1/1 Running 0 70m 172.18.4.4 k83 <none> <none> [zettakit@rocky1 ~]$ kubectl get pod mybusybox1 -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mybusybox 1/1 Running 0 48m 172.18.4.3 k83 <none> <none>
分析
问题虽然解决了,但是我还是很奇怪为什么有一些pod没有受影响,看了一下没受影响的pod都是DaemonSet控制的,分别是kube-proxy、ingress(我在部署ingress的时候将他调整为daemonset的,默认应该是deployment)和flannel
[zettakit@rocky1 ~]$ kubectl -n kube-system describe pod kube-proxy-kq8kn |grep -i ^controlled Controlled By: DaemonSet/kube-proxy [zettakit@rocky1 ~]$ kubectl -n kube-flannel describe pod kube-flannel-ds-jsz8f |grep -i ^controlled Controlled By: DaemonSet/kube-flannel-ds [zettakit@rocky1 ~]$ kubectl -n ingress-nginx describe pod ingress-nginx-controller-p222x |grep -i ^controlled Controlled By: DaemonSet/ingress-nginx-controller [zettakit@rocky1 ~]$ kubectl get daemonsets.apps -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ingress-nginx ingress-nginx-controller 3 3 3 3 3 <none> 2d7h kube-flannel kube-flannel-ds 3 3 3 3 3 <none> 15d kube-system kube-proxy 3 3 3 3 3 kubernetes.io/os=linux 15d
同时虽然问题解决了,但是产生问题的原因以及daemonset控制的pod不受影响的原因还没有找到,希望大伙知道的在评论区帮忙解释一下。