一,
安装的etcd版本是3.4,如果是安装的etcd3下面的配置应该不会报错。
查询etcd状态报错: conflicting environment variable "ETCD_NAME" is shadowed by corresponding command-line flag (either unset environment variable or disable flag)”
这个可以大概翻译一下,就是说环境变量ETCD_NAME冲突了,由相应的命令行标志隐藏(解决方法是要么取消环境变量,要么禁用这个环境变量。)
那么,这个环境变量设置在哪里了?
启动脚本:
[root@master bin]# cat /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=/opt/etcd/cfg/etcd.conf ExecStart=/opt/etcd/bin/etcd \ --initial-advertise-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} --listen-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} \ --advertise-client-urls http://${THIS_IP}:${THIS_PORT_API} --listen-client-urls http://${THIS_IP}:${THIS_PORT_API} \ --initial-cluster ${CLUSTER} \ --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN} --cert-file=/opt/etcd/ssl/server.pem \ --key-file=/opt/etcd/ssl/server-key.pem \ --peer-cert-file=/opt/etcd/ssl/server.pem \ --peer-key-file=/opt/etcd/ssl/server-key.pem \ --trusted-ca-file=/opt/etcd/ssl/ca.pem \ --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target
etcd的配置文件:
[root@master bin]# cat /opt/etcd/cfg/etcd.conf #[Member] ETCD_NAME="etcd-1" ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="https://192.168.217.16:2380" ETCD_LISTEN_CLIENT_URLS="https://192.168.217.16:2379" #[Clustering] ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.16:2380" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.16:2379" ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.16:2380,etcd-2=https://192.168.217.17:2380,etcd-3=https://192.168.217.18:2380" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ETCD_INITIAL_CLUSTER_STATE="new"
根本原因是etcd-3.4只需要一个配置就可以了,两个文件都写了initial初始化,就冲突啦,而etcd-3.3以及之前的版本不会报错,两个文件可以重复配置。因此,将启动脚本内的initial相关删除就可以了,也就是这些内容:
--initial-advertise-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} --listen-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} \ --advertise-client-urls http://${THIS_IP}:${THIS_PORT_API} --listen-client-urls http://${THIS_IP}:${THIS_PORT_API} \ --initial-cluster ${CLUSTER} \ --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}
删除完毕后,重启服务,etcd恢复正常。
[root@master bin]# systemctl status etcd ● etcd.service - Etcd Server Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-08-25 17:25:08 CST; 2h 31min ago Main PID: 3998 (etcd) Memory: 43.7M CGroup: /system.slice/etcd.service └─3998 /opt/etcd/bin/etcd --cert-file=/opt/etcd/ssl/server.pem --key-file=/opt/etcd/ssl/server-key.pem --peer-cert-file=/opt/etcd/ssl/server.pem --peer-key-file=/opt/etc... Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: 1a58a86408898c44 became follower at term 81 Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: 1a58a86408898c44 [logterm: 1, index: 3, vote: 0] cast MsgVote for e078026890aff6e3 [logterm: 2, index: 5] at term 81 Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: raft.node: 1a58a86408898c44 elected leader e078026890aff6e3 at term 81 Aug 25 17:25:08 master etcd[3998]: published {Name:etcd-1 ClientURLs:[https://192.168.217.16:2379]} to cluster e4c1916e49e5defc Aug 25 17:25:08 master etcd[3998]: ready to serve client requests Aug 25 17:25:08 master systemd[1]: Started Etcd Server. Aug 25 17:25:08 master etcd[3998]: serving client requests on 192.168.217.16:2379 Aug 25 17:25:08 master etcd[3998]: set the initial cluster version to 3.4 Aug 25 17:25:08 master etcd[3998]: enabled capabilities for version 3.4
二,coredns的pod一直是CrashLoopBackOff
查看pod日志有报错:/etc/coredns/Corefile:3 -Error during parsing: Unknow driective ‘proxy’
查看cm相关文件,内容如下:
apiVersion: v1 kind: ConfigMap metadata: name: coredns namespace: kube-system data: Corefile: | .:53 { errors log health kubernetes cluster.local 10.254.0.0/18 proxy . /etc/resolv.conf cache 30 }
倒数第二行更改为forward . /etc/resolv.conf,删除coredns的pod重新生成pod,错误消失,pod转为正常running状态。
三,
kubectl exec -it pod名称 报错:error: unable to upgrade connection: Forbidden (user=k8s-apiserver, verb=create, resource=nodes, sub
其原因是括号里的user k8s-apiserver这个用户没有权限,当然,可能别的用户也没权限,只看括号里的user后面是什么,给该用户提权到cluster-admin就可以了:
kubectl create clusterrolebinding k8s-apiserver --clusterrole=cluster-admin --user=k8s-apiserver
也可以这样 ,(cluster-admin 是集群的管理员权限)。
kubectl create clusterrolebinding k8s-apiserver --clusterrole=system:admin --user=k8s-apiserver
四,kubelet服务启动,短时间看正常,过1 2 分钟就停止了。
查询kubelet服务状态,是失败的,查看系统日志/var/log/messages可以看到如下内容:
F0827 15:18:26.995457 29538 server.go:274] failed to run Kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
这个说的是docker底层用的引擎和kubelet里定义的引擎不一致,kubelet没有办法启动(当然了,出这个问题肯定是二进制安装啦,如果是kubeadm,它自动就给你调整好了)。
kubelet的配置文件:
kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 address: 0.0.0.0 port: 10250 readOnlyPort: 10255 cgroupDriver: cgroupfs clusterDNS: - 10.0.0.2
docker的配置文件:
[root@slave1 ~]# cat /etc/docker/daemon.json { "registry-mirrors": ["http://bc437cce.m.daocloud.io"], "exec-opts":["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m" }, "storage-driver": "overlay2" }
任意的修改,两者改成一致就可以了,比如,docker的配置文件修改成"exec-opts":["native.cgroupdriver=cgroupfs"], 然后重启docker服务就可以啦。或者是cgroupDriver: systemd 在启动kubelet服务就可以了,总之,两边一致就行。
五,
错误现象:
集群状态正常,node节点看的正常,pod里的kube-flannel一会是running,一会就转到CrashLoopBackOff状态。
[root@master ~]# k get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default busybox-7bf6d6f9b5-jg922 1/1 Running 2 23h 10.244.0.12 k8s-master <none> <none> default dns-test 0/1 Error 1 27h <none> k8s-node1 <none> <none> default nginx-7c96855774-28b5w 1/1 Running 2 29h 10.244.0.11 k8s-master <none> <none> default nginx-7c96855774-4b5vg 0/1 Completed 1 29h <none> k8s-node1 <none> <none> default nginx1 0/1 Error 1 27h <none> k8s-node2 <none> <none> kube-system coredns-76648cbfc9-lb75g 0/1 Completed 1 24h <none> k8s-node2 <none> <none> kube-system kube-flannel-ds-mhkdq 0/1 CrashLoopBackOff 11 29h 192.168.217.17 k8s-node1 <none> <none> kube-system kube-flannel-ds-mlb7l 0/1 CrashLoopBackOff 11 29h 192.168.217.18 k8s-node2 <none> <none> kube-system kube-flannel-ds-sl4qv 1/1 Running 4 29h 192.168.217.16 k8s-master <none> <none>
解决思路:
master节点都正常,那么,应该是某个服务master和work node不一样,几个关键服务状态一看,是kube-proxy 服务在node节点没有启动,启动后删除不正常的pod(不删除也可以,只是需要等待的时间长一些而已):
在node1和node2上执行命令:
systemctl start kube-proxy
在master上等待,自动的,它的重启次数加1变成12了:
[root@master ~]# k get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default busybox-7bf6d6f9b5-jg922 1/1 Running 2 23h 10.244.0.12 k8s-master <none> <none> default dns-test 0/1 ImagePullBackOff 1 27h 10.244.1.7 k8s-node1 <none> <none> default nginx-7c96855774-28b5w 1/1 Running 2 29h 10.244.0.11 k8s-master <none> <none> default nginx-7c96855774-4b5vg 1/1 Running 2 29h 10.244.1.6 k8s-node1 <none> <none> default nginx1 1/1 Running 2 27h 10.244.2.10 k8s-node2 <none> <none> kube-system coredns-76648cbfc9-lb75g 1/1 Running 2 24h 10.244.2.11 k8s-node2 <none> <none> kube-system kube-flannel-ds-mhkdq 1/1 Running 12 30h 192.168.217.17 k8s-node1 <none> <none> kube-system kube-flannel-ds-mlb7l 1/1 Running 12 30h 192.168.217.18 k8s-node2 <none> <none> kube-system kube-flannel-ds-sl4qv 1/1 Running 4 30h 192.168.217.16 k8s-master <none> <none>
六,
故障现象:
集群搭建完毕后,使用token方式可以登录,但使用config文件方式无法登录,dashboard控制台右上角报错:clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "kubelet-bootstrap" cannot list resouces,所有资源在dashboard上无法显示。
解决方案:
查询clusterrolebindings
[root@master ~]# kubectl get clusterrolebindings kubelet-bootstrap NAME ROLE AGE kubelet-bootstrap ClusterRole/system:node-bootstrapper 16s
看到这个用户是非admin权限,因此,将这个用户赋予cluster-admin即可:
[root@master ~]# kubectl delete clusterrolebindings kubelet-bootstrap clusterrolebinding.rbac.authorization.k8s.io "kubelet-bootstrap" deleted [root@master ~]# kubectl create clusterrolebinding kubelet-bootstrap --clusterrole=cluster-admin --user=kubelet-bootstrap clusterrolebinding.rbac.authorization.k8s.io/kubelet-bootstrap created
再次查询:
[root@master ~]# kubectl get clusterrolebindings kubelet-bootstrap NAME ROLE AGE kubelet-bootstrap ClusterRole/cluster-admin 4m38s
稍作总结:报错的用户先查看它的权限,如果不是admin,那么,立刻赋权就可以解决问题了。
七,
错误现象:
进行ingress-nginx测试的时候报错,测试文件是:
[root@master ~]# cat ingress-nginx.yaml apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: "nginx" name: example spec: rules: # 一个ingress可以配置多个rules - host: foo.bar.com # 域名配置,可以不写,匹配*,或者写 *.bar.com http: paths: # 相当于nginx的location,同一个host可以配置多个path - backend: serviceName: svc-demo # 代理到哪个svc servicePort: 8080 # svc的端口 path: /
执行文件时:
[root@master ~]# k apply -f ingress-nginx.yaml Error from server (InternalError): error when creating "ingress-nginx.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://ingress3-ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1beta1/ingresses?timeout=10s: EOF
解决方案:
[root@master ~]# kubectl get validatingwebhookconfigurations NAME WEBHOOKS AGE ingress3-ingress-nginx-admission 1 43m [root@master ~]# kubectl delete -A ValidatingWebhookConfiguration ingress3-ingress-nginx-admission validatingwebhookconfiguration.admissionregistration.k8s.io "ingress3-ingress-nginx-admission" deleted
先查询出webhooks,然后删除,再次执行测试文件成功。
八,
错误现象:helm3 程序无法使用,helm list都看不了,错误代码如下:
[root@master ~]# helm list Error: Kubernetes cluster unreachable [root@master ~]# helm repo list Error: no repositories to show
解决方案:
设定环境变量,export KUBECONFIG=你的config文件,我的config文件名称和路径是/opt/kubernetes/cfg/bootstrap.kubeconfig,因此,
export KUBECONFIG=/opt/kubernetes/cfg/bootstrap.kubeconfig
环境变量设置好后就可以正常使用helm了。
九,
故障现象:在主节点也就是master kubectl get nodes ,看不到其中的一个工作节点node1,只看到了node2
[root@master cfg]# k get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s-master NotReady <none> 60m v1.18.3 192.168.217.16 <none> CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7 k8s-node2 NotReady <none> 17m v1.18.3 192.168.217.18 <none> CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7
在node1节点,查看系统日志,内容如下:
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.037318 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.138272 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.239285 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.340365 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.441356 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.542351 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.643332 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.744277 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.845217 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.946301 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:06 slave1 kubelet: E0830 12:21:06.047337 21865 kubelet.go:2267] node "k8s-node1" not found Aug 30 12:21:06 slave1 kubelet: E0830 12:21:06.593145 21865 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: leases.coordination.k8s.io "k8s-node1" is forbidden: User "system:node:k8s-node2" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
解决方案:
根据以上的日志,可以看到是kubelet服务有问题,具体原因是配置文件写错了,本来应该写node1的,错误写成node2了,而node2已经注册成功了。
具体解决方法为删除node1节点内的证书,kubelet-client-current.pem 将此文件删除后,重启kubelet服务,在master节点就可以看到可以正常获取node了:
[root@master cfg]# k get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s-master NotReady <none> 64m v1.18.3 192.168.217.16 <none> CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7 k8s-node1 NotReady <none> 9s v1.18.3 192.168.217.17 <none> CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7 k8s-node2 NotReady <none> 20m v1.18.3 192.168.217.18 <none> CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7
十,建立pv存储的时候报错
故障现象:
建立pv的时候报错如下:
[root@master mysql]# k apply -f mysql-pv.yaml The PersistentVolume "mysql-pv" is invalid: nodeAffinity: Invalid value: core.VolumeNodeAffinity{Required:(*core.NodeSelector)(0xc002d9bf20)}: field is immutable
查看pv状态如下(pv是错误状态哦,available):
[root@master mysql]# k get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE mysql-pv 15Gi RWO Delete Available local-storage 12m
原因分析:
建立pv的内容如下:
[root@master mysql]# cat mysql-pv.yaml apiVersion: v1 kind: PersistentVolume metadata: name: mysql-pv spec: capacity: storage: 15Gi volumeMode: Filesystem accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Delete storageClassName: local-storage local: # 指定它是一个 Local Persistent Volume path: /mnt/mysql-data # PV对应的本地磁盘路径 nodeAffinity: # 亲和性标志 required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - k8s-node1 # 必须部署在node1上
此前修改了values的值,也就是说pv已经建立过一次了,导致亲和度的值再次赋值失效。
解决方案:
删除此前建立的pv,重新apply,再次查看,pv和pvc都正常了:
[root@master mysql]# k get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE mysql-pv 15Gi RWO Delete Bound default/mysql-pv-claim local-storage 9m5s [root@master mysql]# k get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mysql-pv-claim Bound mysql-pv 15Gi RWO local-storage 57m
十一,
故障现象:创建ingress资源文件失败,报错如下:
[root@master ~]# k apply -f ingress-http.yaml Error from server (InternalError): error when creating "ingress-http.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1beta1/ingresses?timeout=10s: context deadline exceeded
vim ingress-http.yaml
apiVersion: extensions/v1beta1 kind: Ingress metadata: name: ingress-http namespace: dev annotations: nginx.ingress.kubernetes.io/rewrite-target: / spec: rules: - host: nginx.test.com http: paths: - path: / backend: serviceName: nginx-service servicePort: 80 - host: tomcat.test.com http: paths: - path: / backend: serviceName: tomcat-service servicePort: 80
原因分析:回忆了一哈,此文件是修改了最后的端口,一开始是8080,后面修改为了80,但ingress不能够自动升级,因为ingress创建使用到了webhook,但webhook已经隐式创建了,因此解决方案为删除ValidatingWebhookConfiguration
解决方案:
先查询有哪些ValidatingWebhookConfiguration
kubectl get ValidatingWebhookConfiguration
删除kubectl get ValidatingWebhookConfiguration
kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission
十二,
故障现象:
pod启动不了,kube-proxy kube-controller-manage都报错,查看系统日志,真正的满江红:
master节点的kube-proxy kube-controller-manage:
Oct 4 19:11:00 master kube-controller-manager: E1004 19:11:00.930275 22282 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
节点2的kubelet日志:
Oct 4 19:11:10 node2 kubelet: E1004 19:11:10.285790 31170 pod_workers.go:191] Error syncing pod 84a93201-5bee-4a40-85c1-b581c1faefa7 ("calico-kube-controllers-57546b46d6-sf26n_kube-system(84a93201-5bee-4a40-85c1-b581c1faefa7)"), skipping: failed to "KillPodSandbox" for "84a93201-5bee-4a40-85c1-b581c1faefa7" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-sf26n_kube-system\" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout"
节点2的kube-proxy日志:
E1004 18:54:49.280262 16325 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Service: Get https://192.168.217.16:6443/api/v1/services?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=64344&timeout=6m50s&timeoutSeconds=410&watch=true: dial tcp 192.168.217.16:6443: connect: connection refused
E1004 18:54:49.280330 16325 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Endpoints: Get https://192.168.217.16:6443/api/v1/endpoints?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=64886&timeout=8m58s&timeoutSeconds=538&watch=true: dial tcp 192.168.217.16:6443: connect: connection r
整个pod都不正常,calico-node反复重启,可以观察到master是正常的,剩下的都不正常。:
[root@k8s-master cfg]# k get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system calico-kube-controllers-57546b46d6-sf26n 0/1 ContainerCreating 0 12m <none> k8s-node2 <none> <none> kube-system calico-node-fskfk 0/1 CrashLoopBackOff 7 12m 192.168.217.18 k8s-node2 <none> <none> kube-system calico-node-gbv9d 1/1 Running 0 12m 192.168.217.16 k8s-master <none> <none> kube-system calico-node-vb88h 0/1 Error 7 12m 192.168.217.17 k8s-node1 <none> <none> kube-system coredns-76648cbfc9-8f45v 1/1 Running 2 3h36m 10.244.235.193 k8s-master <none> <none>
错误pod的日志:
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned kube-system/calico-kube-controllers-57546b46d6-sf26n to k8s-node2
Warning FailedCreatePodSandBox 1s kubelet, k8s-node2 Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "0ca9dd9a4cf391dec3163d80e50fcb6b6424c7d93f245dc5b4f011eefed53375" network for pod "calico-kube-controllers-57546b46d6-sf26n": networkPlugin cni failed to set up pod "calico-kube-controllers-57546b46d6-sf26n_kube-system" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout, failed to clean up sandbox container "0ca9dd9a4cf391dec3163d80e50fcb6b6424c7d93f245dc5b4f011eefed53375" network for pod "calico-kube-controllers-57546b46d6-sf26n": networkPlugin cni failed to teardown pod "calico-kube-controllers-57546b46d6-sf26n_kube-system" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout]
Normal SandboxChanged 0s kubelet, k8s-node2 Pod sandbox changed, it will be killed and re-created.
故障排除经过:
在node2节点,Telnet 10.0.0.1 443 确实不通,检查防火墙是否关闭,查看apiserver服务,看服务基本正常,然后所有节点所有服务都重启了,仍然报错。
后面翻找各个服务的配置文件,看是不是配置问题,结果发现问题:
[root@k8s-master cfg]# grep -r -i "10.244" ./ ./calico.yaml: value: "10.244.0.0/16" ./kube-flannel.yml: "Network": "10.244.0.0/16", ./kube-controller-manager.conf:--cluster-cidr=10.244.0.0/16 \ ./kube-proxy-config.yml:clusterCIDR: 10.0.0.0/16
发现kube-proxy 的cidr是10.0.0.0,和controller-manager是不一样的,遂改之(修改kube-proxy-config.yml),并重启kube-proxy和kubelet服务(真正的一字之差啊!!!!!):
[root@k8s-master cfg]# grep -r -i "10.244" ./ ./calico.yaml: value: "10.244.0.0/16" ./kube-flannel.yml: "Network": "10.244.0.0/16", ./kube-controller-manager.conf:--cluster-cidr=10.244.0.0/16 \ ./kube-proxy-config.yml:clusterCIDR: 10.244.0.0/16
再次查看日志没有问题,pod也正常了:
[root@k8s-master cfg]# k get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system calico-kube-controllers-57546b46d6-9xt7k 1/1 Running 0 32m 10.244.36.66 k8s-node1 <none> <none> kube-system calico-node-5tzdt 1/1 Running 0 40m 192.168.217.16 k8s-master <none> <none> kube-system calico-node-pllx6 1/1 Running 4 40m 192.168.217.17 k8s-node1 <none> <none> kube-system calico-node-tpjc9 1/1 Running 4 40m 192.168.217.18 k8s-node2 <none> <none> kube-system coredns-76648cbfc9-8f45v 1/1 Running 2 4h18m 10.244.235.193 k8s-master <none> <none>
整个世界终于清静了!!!!!!!!!!!!!
十三,
故障现象:
部署Metrics server服务时,发现pod正常启动,但无法使用kubectl top node 命令获取信息,报错:
[root@k8s-master kis]# k top node Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
查看kube-controller-manager 服务,报错如下:
[root@k8s-master kis]# systemctl status kube-controller-manager.service -l ● kube-controller-manager.service - Kubernetes Controller Manager Loaded: loaded (/usr/lib/systemd/system/kube-controller-manager.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2022-10-04 22:01:55 CST; 35min ago Docs: https://github.com/kubernetes/kubernetes Main PID: 757 (kube-controller) Memory: 115.1M CGroup: /system.slice/kube-controller-manager.service └─757 /opt/kubernetes/bin/kube-controller-manager --logtostderr=false --v=2 --log-dir=/opt/kubernetes/logs --leader-elect=true --master=127.0.0.1:8080 --bind-address=127.0.0.1 --allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16 --service-cluster-ip-range=10.0.0.0/16 --cluster-signing-cert-file=/opt/kubernetes/ssl/ca.pem --cluster-signing-key-file=/opt/kubernetes/ssl/ca-key.pem --root-ca-file=/opt/kubernetes/ssl/ca.pem --service-account-private-key-file=/opt/kubernetes/ssl/ca-key.pem --experimental-cluster-signing-duration=87600h0m0s Oct 04 22:32:23 k8s-master kube-controller-manager[757]: E1004 22:32:23.089005 757 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request Oct 04 22:32:53 k8s-master kube-controller-manager[757]: E1004 22:32:53.842126 757 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
其它几个服务也有报类似错误,系统日志如下(江山一片红啊啊啊啊 ):
Oct 4 23:20:03 master kube-apiserver: E1004 23:20:03.661075 27870 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated] Oct 4 23:20:03 master kubelet: E1004 23:20:03.665400 1343 cni.go:385] Error deleting kube-system_calico-kube-controllers-57546b46d6-zq5ds/a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502 from network calico/k8s-pod-network: error getting ClusterInformation: connection is unauthorized: Unauthorized Oct 4 23:20:03 master kubelet: E1004 23:20:03.666450 1343 remote_runtime.go:128] StopPodSandbox "a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "calico-kube-controllers-57546b46d6-zq5ds_kube-system" network: error getting ClusterInformation: connection is unauthorized: Unauthorized Oct 4 23:20:03 master kubelet: E1004 23:20:03.666506 1343 kuberuntime_manager.go:895] Failed to stop sandbox {"docker" "a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502"} Oct 4 23:20:03 master kubelet: E1004 23:20:03.666567 1343 kuberuntime_manager.go:674] killPodWithSyncResult failed: failed to "KillPodSandbox" for "556b1eeb-27a4-4b3d-bbc5-6bc9b172dced" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-zq5ds_kube-system\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Oct 4 23:20:03 master kubelet: E1004 23:20:03.666598 1343 pod_workers.go:191] Error syncing pod 556b1eeb-27a4-4b3d-bbc5-6bc9b172dced ("calico-kube-controllers-57546b46d6-zq5ds_kube-system(556b1eeb-27a4-4b3d-bbc5-6bc9b172dced)"), skipping: failed to "KillPodSandbox" for "556b1eeb-27a4-4b3d-bbc5-6bc9b172dced" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-zq5ds_kube-system\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.082579376+08:00" level=info msg="shim reaped" id=82ad6e61e556d31761bef3ebb390519e747baf37a5bb10c614cac79746ec1600 Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.093751545+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.663889369+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878/shim.sock" debug=false pid=8400 Oct 4 23:20:04 master systemd: Started libcontainer container 25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878. Oct 4 23:20:04 master systemd: Starting libcontainer container 25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878. Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.929696386+08:00" level=info msg="shim reaped" id=25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878 Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.940353322+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Oct 4 23:20:05 master dockerd: time="2022-10-04T23:20:05.689201068+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146/shim.sock" debug=false pid=8470 Oct 4 23:20:05 master systemd: Started libcontainer container 06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146. Oct 4 23:20:05 master systemd: Starting libcontainer container 06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146. Oct 4 23:20:10 master kube-apiserver: E1004 23:20:10.456368 27870 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.0.158.133:443/apis/metrics.k8s.io/v1beta1: Get https://10.0.158.133:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Metrics server是部署在node2节点的,查看pod 的IP,在master节点ping此IP不通。
解决方案:
通过以上日志,可以看出现在的嫌疑最大的就是calico这个网络插件了,master上ping了几个节点的IP都是不通,因此,查看calico的部署文件,发现使用的是vxlan模式:
typha_service_name: "none" # Configure the backend to use. calico_backend: "vxlan"
因此,将calico_backend 的值修改为bird,并重新部署即可。当然,calico网络的bgp模式也会造成节点之间的pod互相隔离ping不通的问题,使用calico网络的crosssubnet网络也可修复此问题,或者将Metrics server和master也就是apiserver部署在一起也可以勉强使用(这个方法我试用了,是没有问题的,长期来说就不一定了)。
十四,
故障前情:
etcd集群感觉卡顿,kubernetes集群总是一卡一卡的那种感觉,各种操作都不流畅。在排查别的问题的时候打开系统日志发现了一抹红色:
etcd: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:6
关联紧密的日志部分截取(日志是192.168.217.20服务器的,此服务器目前是leader):
Nov 1 22:55:27 master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.143421ms, to f5b8cb45a0dcf520) Nov 1 22:55:27 master2 etcd: server is likely overloaded Nov 1 22:55:27 master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.235065ms, to 3d70d11f824a5d8f) Nov 1 22:55:27 master2 etcd: server is likely overloaded Nov 1 22:55:36 master2 etcd: read-only range request "key:\"/registry/leases/kube-system/kube-scheduler\" " with result "range_response_count:1 size:483" took too long (102.762895ms) to execute Nov 1 22:55:42 master2 etcd: request "header:<ID:17663263095897001410 > txn:<compare:<target:MOD key:\"/registry/masterleases/192.168.217.19\" mod_revision:369490 > success:<request_put:<key:\"/registry/masterleases/192.168.217.19\" value_size:70 lease:5373221187761963523 >> failure:<>>" with result "size:18" took too long (122.623655ms) to execute Nov 1 22:55:42 master2 etcd: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:6" took too long (103.679383ms) to execute
故障分析:
这里面有一段关键的话:
此句说的是发送心跳失败了,100ms的时间不够,是发送到192.168.217.20这个服务器。
master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.143421ms, to f5b8cb45a0dcf520
[root@master2 ~]# etct_serch member list -w table +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | 3d70d11f824a5d8f | started | etcd-1 | https://192.168.217.19:2380 | https://192.168.217.19:2379 | false | | ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 | false | | f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 | false | +------------------+---------+--------+-----------------------------+-----------------------------+------------+
因此,此故障的原因应该是etcd集群心跳和选主的时间设置过小造成的。
故障解决:
修改etcd集群的配置文件,三个节点都修改(heartbeat-interval和election-timeout默认的值分别是100毫秒和1000毫秒),修改完毕后重启所有etcd服务:
[root@master2 ~]# cat /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=/opt/etcd/cfg/etcd.conf ExecStart=/opt/etcd/bin/etcd \ --cert-file=/opt/etcd/ssl/server.pem \ --key-file=/opt/etcd/ssl/server-key.pem \ --peer-cert-file=/opt/etcd/ssl/server.pem \ --peer-key-file=/opt/etcd/ssl/server-key.pem \ --trusted-ca-file=/opt/etcd/ssl/ca.pem \ --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem --wal-dir=/var/lib/etcd \ #快照日志路径 --snapshot-count=50000 \ #最大快照次数,指定有多少事务被提交时,触发截取快照保存到磁盘,释放wal日志,默认值100000 --auto-compaction-retention=1 \ #首次压缩周期为1小时,后续压缩周期为当前值的10%,也就是每隔6分钟压缩一次 --auto-compaction-mode=periodic \ #周期性压缩 --max-request-bytes=$((10*1024*1024)) \ #请求的最大字节数,默认一个key为1.5M,官方推荐最大为10M --quota-backend-bytes=$((8*1024*1024*1024)) \ --heartbeat-interval="5000" \ --election-timeout="25000" Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target
稍微总结一哈:
因为部署的kubernetes集群是etcd和apiserver同节点,并且是三master的节点,因此造成这个现象,apiserver和etcd都需要抢占网络资源这是一个根本性的原因,因此,如果是在实际的生产中,etcd集群最好是单独部署不和apiserver混合,两个参数如果默认,正常的情况下应该是够用的。
网络卡顿还是需要从根源上找出问题,调参只是权宜之计。两个参数如果要设置,相差五倍即可。
十五,
故障前情:
安装flannel插件的时候报错,pod一直不是running状态,集群是minikube:
Error registering network: failed to acquire lease: node "node3" pod cidr not assigned
故障分析:
以上报错的意思是找不到pod cidr因此无法部署应用,minikube底层其实用的就是kubeadm,也就是说它的静态pod的配置文件和kubeadm是一样的,都放置在/etc/kubernetes/manifests这个目录下。因此,查看kube-controller-manager文件,发现确实没有pod cidr的定义。而我用的flannel是使用的默认网段10.244.0.0.。
很奇怪,初始化命令指定了cidr,但配置文件里面没有:
初始化命令:
minikube start \ --extra-config=controller-manager.allocate-node-cidrs=true \ --extra-config=controller-manager.cluster-cidr=10.244.0.0/16 \ --extra-config=kubelet.network-plugin=cni \ --extra-config=kubelet.pod-cidr=10.244.0.0/16 \ --network-plugin=cni \ --kubernetes-version=1.18.8 \ --vm-driver=none
解决方案:
编辑/etc/kubernetes/manifests/kube-controller-manager 文件,添加如下三行:
- --allocate-node-cidrs=true - --cluster-cidr=10.244.0.0/16 - --service-cluster-ip-range=10.96.0.0/12
仔细排查了一下,上面的初始化命令也是有问题的,正确的初始化文件应该是这样的,此配置会安装flannel,省的使用yaml文件;
minikube start pod-network-cidr='10.244.0.0/16'\ --extra-config=kubelet.pod-cidr=10.244.0.0/16 \ --network-plugin=cni \ --image-repository='registry.aliyuncs.com/google_containers' \ --cni=flannel \ --apiserver-ips=192.168.217.23 \ --kubernetes-version=1.18.8 \ --vm-driver=none
稍等片刻后,在重新部署flannel,可以看到flannel恢复正常了。
[root@node3 manifests]# kubectl get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system coredns-66bff467f8-9glkl 0/1 Running 0 58m 10.244.0.3 node3 <none> <none> kube-system etcd-node3 1/1 Running 0 86m 192.168.217.23 node3 <none> <none> kube-system kube-apiserver-node3 1/1 Running 0 86m 192.168.217.23 node3 <none> <none> kube-system kube-controller-manager-node3 1/1 Running 0 15m 192.168.217.23 node3 <none> <none> kube-system kube-flannel-ds-thjml 1/1 Running 9 80m 192.168.217.23 node3 <none> <none> kube-system kube-proxy-j6j8c 1/1 Running 0 6m57s 192.168.217.23 node3 <none> <none> kube-system kube-scheduler-node3 1/1 Running 0 11m 192.168.217.23 node3 <none> <none> kube-system storage-provisioner 1/1 Running 0 86m 192.168.217.23 node3 <none> <none>
查看flannel的pod的日志:
[root@node3 manifests]# kubectl logs kube-flannel-ds-thjml -n kube-system 略略略 I1102 12:25:02.177604 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -j ACCEPT I1102 12:25:02.178853 1 iptables.go:167] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully I1102 12:25:02.276100 1 iptables.go:155] Adding iptables rule: -d 10.244.0.0/16 -j ACCEPT I1102 12:25:02.276598 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN I1102 12:25:02.375648 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully I1102 12:25:02.379296 1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/24 -j RETURN I1102 12:25:02.476040 1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
可以看到flannel插件正常了。虚拟网卡也出现了,子网配置文件也自动生成了:
[root@node3 manifests]# ls /run/flannel/subnet.env /run/flannel/subnet.env [root@node3 manifests]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000 link/ether 00:0c:29:70:12:12 brd ff:ff:ff:ff:ff:ff inet 192.168.217.23/24 brd 192.168.217.255 scope global ens33 valid_lft forever preferred_lft forever inet6 fe80::20c:29ff:fe70:1212/64 scope link valid_lft forever preferred_lft forever 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 02:42:59:55:e5:7f brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:59ff:fe55:e57f/64 scope link valid_lft forever preferred_lft forever 12: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN link/ether 2e:b4:f1:da:9b:d9 brd ff:ff:ff:ff:ff:ff inet 10.244.0.0/32 brd 10.244.0.0 scope global flannel.1 valid_lft forever preferred_lft forever inet6 fe80::2cb4:f1ff:feda:9bd9/64 scope link valid_lft forever preferred_lft forever
报错解决了!!!!!!!!!!!!
十六,
故障前情:
查看pod,发现coredns不正常,是running状态,但无法使用:
[root@node3 ~]# kubectl get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system coredns-7ff77c879f-55z6k 0/1 Running 0 5m49s 10.244.0.4 node3 <none> <none>
查看此pod的日志:
[INFO] plugin/ready: Still waiting on: "kubernetes" E1102 15:04:04.718632 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host E1102 15:04:05.719791 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host E1102 15:04:06.721905 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host E1102 15:04:07.723040 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host E1102 15:04:08.724991 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
关键的地方:dial tcp 10.96.0.1:443: connect: no route to host
故障分析:
安装telnet,telnet此IP+端口:
[root@node3 ~]# telnet 10.96.0.1 443 Trying 10.96.0.1... Connected to 10.96.0.1. Escape character is '^]'.
以上表示端口正常,也就是说端口是开放出来的,这就让人非常迷惑了。
[root@node3 ~]# curl -k https://10.96.0.1:443 curl: (7) Failed connect to 10.96.0.1:443; Connection refused
以上命令表示超时,因为卡顿了几十秒才给这个报错,这个现象非常类似防火墙的问题。
故障解决方案:
停止防火墙,再次curl 发现恢复了(由于此网址是带证书的,而我没有带证书,自然是失败,但可以访问,只是权限问题。):
[root@node3 ~]# systemctl stop firewalld [root@node3 ~]# curl -k https://10.96.0.1:443 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403
再次查看pod,发现恢复正常了:
}[root@node3 ~]#kubectl get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system coredns-7ff77c879f-55z6k 1/1 Running 0 21m 10.244.0.4 node3 <none> <none>
查看pod,也正常了:
I1102 15:13:35.813427 1 trace.go:116] Trace[1852186258]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2022-11-02 15:13:08.790440794 +0000 UTC m=+301.345800939) (total time: 27.022935709s): Trace[1852186258]: [27.022876523s] [27.022876523s] Objects listed I1102 15:13:35.813725 1 trace.go:116] Trace[1616138287]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2022-11-02 15:13:24.782953268 +0000 UTC m=+317.338313414) (total time: 11.030727079s): Trace[1616138287]: [11.030681851s] [11.030681851s] Objects listed
coredns的功能测试也正常了:
[root@node3 ~]# kubectl run -it --image busybox:1.28.3 dns-test --restart=Never --rm If you don't see a command prompt, try pressing enter. / # nslookup kubernetes Server: 10.96.0.10 Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local Name: kubernetes Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
故障修复了!!!!!!!!!!!!!!!!!!!!!!!!!
十七,
故障前情:
daemon方式部署,无法apply -f 文件,总是报错如下:
root@k8s-master:~# kubectl apply -f 4.yaml error: error validating "4.yaml": error validating data: [ValidationError(DaemonSet.status): missing required field "currentNumberScheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "numberMisscheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "desiredNumberScheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "numberReady" in io.k8s.api.apps.v1.DaemonSetStatus]; if you choose to ignore these errors, turn validation off with --validate=false
故障分析:
这个错误提示还是比较的明显,以上报错说的是dameset.status 错误,因此,打开部署文件,在末尾可以看到有 status: {} 完整文件如下:
apiVersion: apps/v1 kind: DaemonSet metadata: creationTimestamp: null labels: app: nginx name: nginx namespace: project-tiger spec: selector: matchLabels: app: nginx #strategy: {} template: metadata: labels: app: nginx spec: containers: - image: httpd:2.4-alpine name: nginx resources: {} status: {}
解决方案:
根据错误提示,要么增加currentNumberScheduled,numberReady,numberMisscheduled,desiredNumberScheduled等等以上错误提示的字段,要么删除status: {} 这两种方式。
很明显,无需增加状态描述字段,因为我们主要是部署,因此,删除status: {} 即可。
稍作小结:
daemonset是由deployment控制器更改而来的,因此,通常是由命令生成的模板文件,status: {}是自动生成的,删除即可,和阑尾一样,基本没用
十八,
kubelet服务报错
从系统日志里抓取的
Jan 20 13:44:18 k8s-master kubelet[1210]: E0120 13:44:18.882767 1210 summary_sys_containers.go:47] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service" Jan 20 13:44:26 k8s-master kubelet[1210]: E0120 13:44:26.799795 1210 summary_sys_containers.go:82] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service" Jan 20 13:44:28 k8s-master kubelet[1210]: E0120 13:44:28.906733 1210 summary_sys_containers.go:47] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service"
报错分析:
原因是 kubelet 启动时,会执行节点资源统计,需要 systemd 中开启对应的选项
解决方案:
修改10-kubeadm.conf这个kubelet的配置文件,增加以下两行:
CPUAccounting=true MemoryAccounting=true
完整的配置文件是这样的:
# Note: This dropin only works with kubeadm and kubelet v1.11+ [Service] CPUAccounting=true MemoryAccounting=true Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf" Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml" # This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env # This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use # the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file. EnvironmentFile=-/etc/default/kubelet ExecStart= ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
重启kubelet,可以看到没有错误日志了:
重启kubelet服务
systemctl daemon-reload && systemctl restart kubelet
再次查看系统日志,可以看到日志恢复正常了:
" Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638874 62687 reconciler.go:225] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-proxy\" (UniqueName: \"kubernetes.io/configmap/6decd6cc-a931-46b9-92fe-b3b1a03f9ea4-kube-proxy\") pod \"kube-proxy-5nj6l\" (UID: \"6decd6cc-a931-46b9-92fe-b3b1a03f9ea4\") " Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638918 62687 reconciler.go:225] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-local-net-dir\" (UniqueName: \"kubernetes.io/host-path/5ef5e743-ee71-4c80-a543-76e18a232a45-host-local-net-dir\") pod \"calico-node-4l4ll\" (UID: \"5ef5e743-ee71-4c80-a543-76e18a232a45\") " Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638951 62687 reconciler.go:157] "Reconciler: start to sync state" Jan 20 13:45:36 k8s-master kubelet[62687]: I0120 13:45:36.601380 62687 request.go:665] Waited for 1.151321922s due to client-side throttling, not priority and fair
十九,
安装nfs存储插件报错如下:
Mounting arguments: -t nfs 192.168.123.11:/data/nfs-sc /var/lib/kubelet/pods/4a0ead87-4932-4a9a-9fc0-2b89aac94b1a/volumes/kubernetes.io~nfs/nfs-client-root Output: mount: wrong fs type, bad option, bad superblock on 192.168.123.11:/data/nfs-sc, missing codepage or helper program, or other error (for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount.<type> helper program) In some cases useful info is found in syslog - try dmesg | tail or so.
报错分析:
缺少执行文件
解决方案:
yum install nfs-utils -y
三个节点,只有主节点安装了,其他的节点也要安装这个~!~~~~~~~~~!!!!