云原生|kubernetes部署和运行维护中的错误汇总(不定时更新)

本文涉及的产品
云数据库 RDS MySQL Serverless,0.5-2RCU 50GB
简介: 云原生|kubernetes部署和运行维护中的错误汇总(不定时更新)

一,

安装的etcd版本是3.4,如果是安装的etcd3下面的配置应该不会报错。

查询etcd状态报错: conflicting environment variable "ETCD_NAME" is shadowed by corresponding command-line flag (either  unset environment variable or disable flag)”

这个可以大概翻译一下,就是说环境变量ETCD_NAME冲突了,由相应的命令行标志隐藏(解决方法是要么取消环境变量,要么禁用这个环境变量。)

那么,这个环境变量设置在哪里了?

启动脚本:

[root@master bin]# cat /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd.conf
ExecStart=/opt/etcd/bin/etcd \
        --initial-advertise-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} 
        --listen-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} \
        --advertise-client-urls http://${THIS_IP}:${THIS_PORT_API} 
        --listen-client-urls http://${THIS_IP}:${THIS_PORT_API} \
        --initial-cluster ${CLUSTER} \
        --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}
        --cert-file=/opt/etcd/ssl/server.pem \
        --key-file=/opt/etcd/ssl/server-key.pem \
        --peer-cert-file=/opt/etcd/ssl/server.pem \
        --peer-key-file=/opt/etcd/ssl/server-key.pem \
        --trusted-ca-file=/opt/etcd/ssl/ca.pem \
        --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target

etcd的配置文件:

[root@master bin]# cat  /opt/etcd/cfg/etcd.conf 
#[Member]
ETCD_NAME="etcd-1"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://192.168.217.16:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.217.16:2379"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.16:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.16:2379"
ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.16:2380,etcd-2=https://192.168.217.17:2380,etcd-3=https://192.168.217.18:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"

根本原因是etcd-3.4只需要一个配置就可以了,两个文件都写了initial初始化,就冲突啦,而etcd-3.3以及之前的版本不会报错,两个文件可以重复配置。因此,将启动脚本内的initial相关删除就可以了,也就是这些内容:

 --initial-advertise-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} 
        --listen-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} \
        --advertise-client-urls http://${THIS_IP}:${THIS_PORT_API} 
        --listen-client-urls http://${THIS_IP}:${THIS_PORT_API} \
        --initial-cluster ${CLUSTER} \
        --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}

删除完毕后,重启服务,etcd恢复正常。

[root@master bin]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-08-25 17:25:08 CST; 2h 31min ago
 Main PID: 3998 (etcd)
   Memory: 43.7M
   CGroup: /system.slice/etcd.service
           └─3998 /opt/etcd/bin/etcd --cert-file=/opt/etcd/ssl/server.pem --key-file=/opt/etcd/ssl/server-key.pem --peer-cert-file=/opt/etcd/ssl/server.pem --peer-key-file=/opt/etc...
Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: 1a58a86408898c44 became follower at term 81
Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: 1a58a86408898c44 [logterm: 1, index: 3, vote: 0] cast MsgVote for e078026890aff6e3 [logterm: 2, index: 5] at term 81
Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: raft.node: 1a58a86408898c44 elected leader e078026890aff6e3 at term 81
Aug 25 17:25:08 master etcd[3998]: published {Name:etcd-1 ClientURLs:[https://192.168.217.16:2379]} to cluster e4c1916e49e5defc
Aug 25 17:25:08 master etcd[3998]: ready to serve client requests
Aug 25 17:25:08 master systemd[1]: Started Etcd Server.
Aug 25 17:25:08 master etcd[3998]: serving client requests on 192.168.217.16:2379
Aug 25 17:25:08 master etcd[3998]: set the initial cluster version to 3.4
Aug 25 17:25:08 master etcd[3998]: enabled capabilities for version 3.4

二,coredns的pod一直是CrashLoopBackOff

查看pod日志有报错:/etc/coredns/Corefile:3 -Error during parsing: Unknow driective ‘proxy’

查看cm相关文件,内容如下:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        log
        health
        kubernetes cluster.local 10.254.0.0/18
        proxy . /etc/resolv.conf
        cache 30
    }

倒数第二行更改为forward . /etc/resolv.conf,删除coredns的pod重新生成pod,错误消失,pod转为正常running状态。

三,

kubectl exec -it pod名称  报错:error: unable to upgrade connection: Forbidden (user=k8s-apiserver, verb=create, resource=nodes, sub

其原因是括号里的user  k8s-apiserver这个用户没有权限,当然,可能别的用户也没权限,只看括号里的user后面是什么,给该用户提权到cluster-admin就可以了:

kubectl create clusterrolebinding k8s-apiserver   --clusterrole=cluster-admin   --user=k8s-apiserver

也可以这样 ,(cluster-admin 是集群的管理员权限)。

kubectl create clusterrolebinding k8s-apiserver --clusterrole=system:admin --user=k8s-apiserver

四,kubelet服务启动,短时间看正常,过1 2 分钟就停止了。

查询kubelet服务状态,是失败的,查看系统日志/var/log/messages可以看到如下内容:

F0827 15:18:26.995457   29538 server.go:274] failed to run Kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"

这个说的是docker底层用的引擎和kubelet里定义的引擎不一致,kubelet没有办法启动(当然了,出这个问题肯定是二进制安装啦,如果是kubeadm,它自动就给你调整好了)。

kubelet的配置文件:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
address: 0.0.0.0
port: 10250
readOnlyPort: 10255
cgroupDriver: cgroupfs
clusterDNS:
  - 10.0.0.2

docker的配置文件:

[root@slave1 ~]# cat /etc/docker/daemon.json 
{
  "registry-mirrors": ["http://bc437cce.m.daocloud.io"],
  "exec-opts":["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}

任意的修改,两者改成一致就可以了,比如,docker的配置文件修改成"exec-opts":["native.cgroupdriver=cgroupfs"], 然后重启docker服务就可以啦。或者是cgroupDriver: systemd 在启动kubelet服务就可以了,总之,两边一致就行。

五,

错误现象:

集群状态正常,node节点看的正常,pod里的kube-flannel一会是running,一会就转到CrashLoopBackOff状态。

[root@master ~]# k get po -A -owide
NAMESPACE     NAME                       READY   STATUS             RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
default       busybox-7bf6d6f9b5-jg922   1/1     Running            2          23h   10.244.0.12      k8s-master   <none>           <none>
default       dns-test                   0/1     Error              1          27h   <none>           k8s-node1    <none>           <none>
default       nginx-7c96855774-28b5w     1/1     Running            2          29h   10.244.0.11      k8s-master   <none>           <none>
default       nginx-7c96855774-4b5vg     0/1     Completed          1          29h   <none>           k8s-node1    <none>           <none>
default       nginx1                     0/1     Error              1          27h   <none>           k8s-node2    <none>           <none>
kube-system   coredns-76648cbfc9-lb75g   0/1     Completed          1          24h   <none>           k8s-node2    <none>           <none>
kube-system   kube-flannel-ds-mhkdq      0/1     CrashLoopBackOff   11         29h   192.168.217.17   k8s-node1    <none>           <none>
kube-system   kube-flannel-ds-mlb7l      0/1     CrashLoopBackOff   11         29h   192.168.217.18   k8s-node2    <none>           <none>
kube-system   kube-flannel-ds-sl4qv      1/1     Running            4          29h   192.168.217.16   k8s-master   <none>           <none>

解决思路:

master节点都正常,那么,应该是某个服务master和work node不一样,几个关键服务状态一看,是kube-proxy 服务在node节点没有启动,启动后删除不正常的pod(不删除也可以,只是需要等待的时间长一些而已):

在node1和node2上执行命令:

systemctl start kube-proxy

在master上等待,自动的,它的重启次数加1变成12了:

[root@master ~]# k get po -A -owide
NAMESPACE     NAME                       READY   STATUS             RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
default       busybox-7bf6d6f9b5-jg922   1/1     Running            2          23h   10.244.0.12      k8s-master   <none>           <none>
default       dns-test                   0/1     ImagePullBackOff   1          27h   10.244.1.7       k8s-node1    <none>           <none>
default       nginx-7c96855774-28b5w     1/1     Running            2          29h   10.244.0.11      k8s-master   <none>           <none>
default       nginx-7c96855774-4b5vg     1/1     Running            2          29h   10.244.1.6       k8s-node1    <none>           <none>
default       nginx1                     1/1     Running            2          27h   10.244.2.10      k8s-node2    <none>           <none>
kube-system   coredns-76648cbfc9-lb75g   1/1     Running            2          24h   10.244.2.11      k8s-node2    <none>           <none>
kube-system   kube-flannel-ds-mhkdq      1/1     Running            12         30h   192.168.217.17   k8s-node1    <none>           <none>
kube-system   kube-flannel-ds-mlb7l      1/1     Running            12         30h   192.168.217.18   k8s-node2    <none>           <none>
kube-system   kube-flannel-ds-sl4qv      1/1     Running            4          30h   192.168.217.16   k8s-master   <none>           <none>

六,

故障现象:

集群搭建完毕后,使用token方式可以登录,但使用config文件方式无法登录,dashboard控制台右上角报错:clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "kubelet-bootstrap" cannot list resouces,所有资源在dashboard上无法显示。

解决方案:

查询clusterrolebindings

[root@master ~]# kubectl get clusterrolebindings kubelet-bootstrap
NAME                ROLE                                   AGE
kubelet-bootstrap   ClusterRole/system:node-bootstrapper   16s

看到这个用户是非admin权限,因此,将这个用户赋予cluster-admin即可:

[root@master ~]# kubectl delete clusterrolebindings kubelet-bootstrap
clusterrolebinding.rbac.authorization.k8s.io "kubelet-bootstrap" deleted
[root@master ~]# kubectl create clusterrolebinding kubelet-bootstrap --clusterrole=cluster-admin  --user=kubelet-bootstrap
clusterrolebinding.rbac.authorization.k8s.io/kubelet-bootstrap created

再次查询:

[root@master ~]# kubectl get clusterrolebindings kubelet-bootstrap
NAME                ROLE                        AGE
kubelet-bootstrap   ClusterRole/cluster-admin   4m38s

稍作总结:报错的用户先查看它的权限,如果不是admin,那么,立刻赋权就可以解决问题了。

七,

错误现象:

进行ingress-nginx测试的时候报错,测试文件是:

[root@master ~]# cat ingress-nginx.yaml 
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: "nginx"
  name: example
spec:
  rules: # 一个ingress可以配置多个rules
  - host: foo.bar.com # 域名配置,可以不写,匹配*,或者写 *.bar.com
    http:
      paths: # 相当于nginx的location,同一个host可以配置多个path
      - backend:
          serviceName: svc-demo  # 代理到哪个svc
          servicePort: 8080 # svc的端口
        path: /

执行文件时:

[root@master ~]# k apply -f ingress-nginx.yaml 
Error from server (InternalError): error when creating "ingress-nginx.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://ingress3-ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1beta1/ingresses?timeout=10s: EOF

解决方案:

[root@master ~]# kubectl get validatingwebhookconfigurations
NAME                               WEBHOOKS   AGE
ingress3-ingress-nginx-admission   1          43m
[root@master ~]# kubectl delete -A ValidatingWebhookConfiguration ingress3-ingress-nginx-admission
validatingwebhookconfiguration.admissionregistration.k8s.io "ingress3-ingress-nginx-admission" deleted

先查询出webhooks,然后删除,再次执行测试文件成功。

八,

错误现象:helm3 程序无法使用,helm list都看不了,错误代码如下:

[root@master ~]# helm list
Error: Kubernetes cluster unreachable
[root@master ~]# helm repo list
Error: no repositories to show

解决方案:

设定环境变量,export KUBECONFIG=你的config文件,我的config文件名称和路径是/opt/kubernetes/cfg/bootstrap.kubeconfig,因此,

export KUBECONFIG=/opt/kubernetes/cfg/bootstrap.kubeconfig

环境变量设置好后就可以正常使用helm了。

九,

故障现象:在主节点也就是master kubectl get nodes  ,看不到其中的一个工作节点node1,只看到了node2

[root@master cfg]# k get no -owide
NAME         STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
k8s-master   NotReady   <none>   60m   v1.18.3   192.168.217.16   <none>        CentOS Linux 7 (Core)   5.16.9-1.el7.elrepo.x86_64   docker://20.10.7
k8s-node2    NotReady   <none>   17m   v1.18.3   192.168.217.18   <none>        CentOS Linux 7 (Core)   5.16.9-1.el7.elrepo.x86_64   docker://20.10.7

在node1节点,查看系统日志,内容如下:

Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.037318   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.138272   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.239285   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.340365   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.441356   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.542351   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.643332   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.744277   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.845217   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.946301   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:06 slave1 kubelet: E0830 12:21:06.047337   21865 kubelet.go:2267] node "k8s-node1" not found
Aug 30 12:21:06 slave1 kubelet: E0830 12:21:06.593145   21865 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: leases.coordination.k8s.io "k8s-node1" is forbidden: User "system:node:k8s-node2" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node

解决方案:

根据以上的日志,可以看到是kubelet服务有问题,具体原因是配置文件写错了,本来应该写node1的,错误写成node2了,而node2已经注册成功了。

具体解决方法为删除node1节点内的证书,kubelet-client-current.pem 将此文件删除后,重启kubelet服务,在master节点就可以看到可以正常获取node了:

[root@master cfg]# k get no -owide
NAME         STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
k8s-master   NotReady   <none>   64m   v1.18.3   192.168.217.16   <none>        CentOS Linux 7 (Core)   5.16.9-1.el7.elrepo.x86_64   docker://20.10.7
k8s-node1    NotReady   <none>   9s    v1.18.3   192.168.217.17   <none>        CentOS Linux 7 (Core)   5.16.9-1.el7.elrepo.x86_64   docker://20.10.7
k8s-node2    NotReady   <none>   20m   v1.18.3   192.168.217.18   <none>        CentOS Linux 7 (Core)   5.16.9-1.el7.elrepo.x86_64   docker://20.10.7

十,建立pv存储的时候报错

故障现象:

建立pv的时候报错如下:

[root@master mysql]# k apply -f mysql-pv.yaml 
The PersistentVolume "mysql-pv" is invalid: nodeAffinity: Invalid value: core.VolumeNodeAffinity{Required:(*core.NodeSelector)(0xc002d9bf20)}: field is immutable

查看pv状态如下(pv是错误状态哦,available):

[root@master mysql]# k get pv
NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    REASON   AGE
mysql-pv   15Gi       RWO            Delete           Available           local-storage            12m

原因分析:

建立pv的内容如下:

[root@master mysql]# cat mysql-pv.yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: mysql-pv
spec:
  capacity:
    storage: 15Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:       # 指定它是一个 Local Persistent Volume
    path: /mnt/mysql-data  # PV对应的本地磁盘路径
  nodeAffinity:     # 亲和性标志
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
            - k8s-node1   # 必须部署在node1上

此前修改了values的值,也就是说pv已经建立过一次了,导致亲和度的值再次赋值失效。

解决方案:

删除此前建立的pv,重新apply,再次查看,pv和pvc都正常了:

[root@master mysql]# k get pv
NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS    REASON   AGE
mysql-pv   15Gi       RWO            Delete           Bound    default/mysql-pv-claim   local-storage            9m5s
[root@master mysql]# k get pvc
NAME             STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
mysql-pv-claim   Bound    mysql-pv   15Gi       RWO            local-storage   57m

十一,

故障现象:创建ingress资源文件失败,报错如下:

[root@master ~]# k apply -f ingress-http.yaml 
Error from server (InternalError): error when creating "ingress-http.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1beta1/ingresses?timeout=10s: context deadline exceeded

vim ingress-http.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: ingress-http
  namespace: dev
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: nginx.test.com
    http:
      paths:
      - path: /
        backend:
          serviceName: nginx-service
          servicePort: 80
  - host: tomcat.test.com
    http:
      paths:
      - path: /
        backend:
          serviceName: tomcat-service
          servicePort: 80

原因分析:回忆了一哈,此文件是修改了最后的端口,一开始是8080,后面修改为了80,但ingress不能够自动升级,因为ingress创建使用到了webhook,但webhook已经隐式创建了,因此解决方案为删除ValidatingWebhookConfiguration

解决方案:

先查询有哪些ValidatingWebhookConfiguration

kubectl get ValidatingWebhookConfiguration

删除kubectl get ValidatingWebhookConfiguration

kubectl delete -A  ValidatingWebhookConfiguration ingress-nginx-admission

十二,

故障现象:

pod启动不了,kube-proxy kube-controller-manage都报错,查看系统日志,真正的满江红:

master节点的kube-proxy kube-controller-manage:

Oct  4 19:11:00 master kube-controller-manager: E1004 19:11:00.930275   22282 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource

节点2的kubelet日志:

Oct  4 19:11:10 node2 kubelet: E1004 19:11:10.285790   31170 pod_workers.go:191] Error syncing pod 84a93201-5bee-4a40-85c1-b581c1faefa7 ("calico-kube-controllers-57546b46d6-sf26n_kube-system(84a93201-5bee-4a40-85c1-b581c1faefa7)"), skipping: failed to "KillPodSandbox" for "84a93201-5bee-4a40-85c1-b581c1faefa7" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-sf26n_kube-system\" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout"

 

节点2的kube-proxy日志:

E1004 18:54:49.280262   16325 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Service: Get https://192.168.217.16:6443/api/v1/services?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=64344&timeout=6m50s&timeoutSeconds=410&watch=true: dial tcp 192.168.217.16:6443: connect: connection refused
E1004 18:54:49.280330   16325 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Endpoints: Get https://192.168.217.16:6443/api/v1/endpoints?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=64886&timeout=8m58s&timeoutSeconds=538&watch=true: dial tcp 192.168.217.16:6443: connect: connection r

整个pod都不正常,calico-node反复重启,可以观察到master是正常的,剩下的都不正常。:

[root@k8s-master cfg]# k get po -A -owide
NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE     IP               NODE         NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-57546b46d6-sf26n   0/1     ContainerCreating   0          12m     <none>           k8s-node2    <none>           <none>
kube-system   calico-node-fskfk                          0/1     CrashLoopBackOff    7          12m     192.168.217.18   k8s-node2    <none>           <none>
kube-system   calico-node-gbv9d                          1/1     Running             0          12m     192.168.217.16   k8s-master   <none>           <none>
kube-system   calico-node-vb88h                          0/1     Error               7          12m     192.168.217.17   k8s-node1    <none>           <none>
kube-system   coredns-76648cbfc9-8f45v                   1/1     Running             2          3h36m   10.244.235.193   k8s-master   <none>           <none>

错误pod的日志:

 ----     ------                  ----       ----                -------
 Normal   Scheduled               <unknown>  default-scheduler   Successfully assigned kube-system/calico-kube-controllers-57546b46d6-sf26n to k8s-node2
 Warning  FailedCreatePodSandBox  1s         kubelet, k8s-node2  Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "0ca9dd9a4cf391dec3163d80e50fcb6b6424c7d93f245dc5b4f011eefed53375" network for pod "calico-kube-controllers-57546b46d6-sf26n": networkPlugin cni failed to set up pod "calico-kube-controllers-57546b46d6-sf26n_kube-system" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout, failed to clean up sandbox container "0ca9dd9a4cf391dec3163d80e50fcb6b6424c7d93f245dc5b4f011eefed53375" network for pod "calico-kube-controllers-57546b46d6-sf26n": networkPlugin cni failed to teardown pod "calico-kube-controllers-57546b46d6-sf26n_kube-system" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout]
 Normal   SandboxChanged          0s         kubelet, k8s-node2  Pod sandbox changed, it will be killed and re-created.

 

故障排除经过:

在node2节点,Telnet 10.0.0.1 443 确实不通,检查防火墙是否关闭,查看apiserver服务,看服务基本正常,然后所有节点所有服务都重启了,仍然报错。

后面翻找各个服务的配置文件,看是不是配置问题,结果发现问题:

[root@k8s-master cfg]# grep -r -i "10.244" ./
./calico.yaml:              value: "10.244.0.0/16"
./kube-flannel.yml:      "Network": "10.244.0.0/16",
./kube-controller-manager.conf:--cluster-cidr=10.244.0.0/16 \
./kube-proxy-config.yml:clusterCIDR: 10.0.0.0/16

发现kube-proxy 的cidr是10.0.0.0,和controller-manager是不一样的,遂改之(修改kube-proxy-config.yml),并重启kube-proxy和kubelet服务(真正的一字之差啊!!!!!):

[root@k8s-master cfg]# grep -r -i "10.244" ./
./calico.yaml:              value: "10.244.0.0/16"
./kube-flannel.yml:      "Network": "10.244.0.0/16",
./kube-controller-manager.conf:--cluster-cidr=10.244.0.0/16 \
./kube-proxy-config.yml:clusterCIDR: 10.244.0.0/16

再次查看日志没有问题,pod也正常了:

[root@k8s-master cfg]# k get po -A -owide
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE     IP               NODE         NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-57546b46d6-9xt7k   1/1     Running   0          32m     10.244.36.66     k8s-node1    <none>           <none>
kube-system   calico-node-5tzdt                          1/1     Running   0          40m     192.168.217.16   k8s-master   <none>           <none>
kube-system   calico-node-pllx6                          1/1     Running   4          40m     192.168.217.17   k8s-node1    <none>           <none>
kube-system   calico-node-tpjc9                          1/1     Running   4          40m     192.168.217.18   k8s-node2    <none>           <none>
kube-system   coredns-76648cbfc9-8f45v                   1/1     Running   2          4h18m   10.244.235.193   k8s-master   <none>           <none>

整个世界终于清静了!!!!!!!!!!!!!

十三,

故障现象:

部署Metrics server服务时,发现pod正常启动,但无法使用kubectl top node 命令获取信息,报错:

[root@k8s-master kis]# k top node
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

查看kube-controller-manager 服务,报错如下:

[root@k8s-master kis]# systemctl status kube-controller-manager.service -l
● kube-controller-manager.service - Kubernetes Controller Manager
   Loaded: loaded (/usr/lib/systemd/system/kube-controller-manager.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2022-10-04 22:01:55 CST; 35min ago
     Docs: https://github.com/kubernetes/kubernetes
 Main PID: 757 (kube-controller)
   Memory: 115.1M
   CGroup: /system.slice/kube-controller-manager.service
           └─757 /opt/kubernetes/bin/kube-controller-manager --logtostderr=false --v=2 --log-dir=/opt/kubernetes/logs --leader-elect=true --master=127.0.0.1:8080 --bind-address=127.0.0.1 --allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16 --service-cluster-ip-range=10.0.0.0/16 --cluster-signing-cert-file=/opt/kubernetes/ssl/ca.pem --cluster-signing-key-file=/opt/kubernetes/ssl/ca-key.pem --root-ca-file=/opt/kubernetes/ssl/ca.pem --service-account-private-key-file=/opt/kubernetes/ssl/ca-key.pem --experimental-cluster-signing-duration=87600h0m0s
Oct 04 22:32:23 k8s-master kube-controller-manager[757]: E1004 22:32:23.089005     757 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Oct 04 22:32:53 k8s-master kube-controller-manager[757]: E1004 22:32:53.842126     757 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

其它几个服务也有报类似错误,系统日志如下(江山一片红啊啊啊啊 ):

Oct  4 23:20:03 master kube-apiserver: E1004 23:20:03.661075   27870 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
Oct  4 23:20:03 master kubelet: E1004 23:20:03.665400    1343 cni.go:385] Error deleting kube-system_calico-kube-controllers-57546b46d6-zq5ds/a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502 from network calico/k8s-pod-network: error getting ClusterInformation: connection is unauthorized: Unauthorized
Oct  4 23:20:03 master kubelet: E1004 23:20:03.666450    1343 remote_runtime.go:128] StopPodSandbox "a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "calico-kube-controllers-57546b46d6-zq5ds_kube-system" network: error getting ClusterInformation: connection is unauthorized: Unauthorized
Oct  4 23:20:03 master kubelet: E1004 23:20:03.666506    1343 kuberuntime_manager.go:895] Failed to stop sandbox {"docker" "a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502"}
Oct  4 23:20:03 master kubelet: E1004 23:20:03.666567    1343 kuberuntime_manager.go:674] killPodWithSyncResult failed: failed to "KillPodSandbox" for "556b1eeb-27a4-4b3d-bbc5-6bc9b172dced" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-zq5ds_kube-system\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"
Oct  4 23:20:03 master kubelet: E1004 23:20:03.666598    1343 pod_workers.go:191] Error syncing pod 556b1eeb-27a4-4b3d-bbc5-6bc9b172dced ("calico-kube-controllers-57546b46d6-zq5ds_kube-system(556b1eeb-27a4-4b3d-bbc5-6bc9b172dced)"), skipping: failed to "KillPodSandbox" for "556b1eeb-27a4-4b3d-bbc5-6bc9b172dced" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-zq5ds_kube-system\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"
Oct  4 23:20:04 master dockerd: time="2022-10-04T23:20:04.082579376+08:00" level=info msg="shim reaped" id=82ad6e61e556d31761bef3ebb390519e747baf37a5bb10c614cac79746ec1600
Oct  4 23:20:04 master dockerd: time="2022-10-04T23:20:04.093751545+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct  4 23:20:04 master dockerd: time="2022-10-04T23:20:04.663889369+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878/shim.sock" debug=false pid=8400
Oct  4 23:20:04 master systemd: Started libcontainer container 25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878.
Oct  4 23:20:04 master systemd: Starting libcontainer container 25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878.
Oct  4 23:20:04 master dockerd: time="2022-10-04T23:20:04.929696386+08:00" level=info msg="shim reaped" id=25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878
Oct  4 23:20:04 master dockerd: time="2022-10-04T23:20:04.940353322+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct  4 23:20:05 master dockerd: time="2022-10-04T23:20:05.689201068+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146/shim.sock" debug=false pid=8470
Oct  4 23:20:05 master systemd: Started libcontainer container 06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146.
Oct  4 23:20:05 master systemd: Starting libcontainer container 06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146.
Oct  4 23:20:10 master kube-apiserver: E1004 23:20:10.456368   27870 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.0.158.133:443/apis/metrics.k8s.io/v1beta1: Get https://10.0.158.133:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Metrics server是部署在node2节点的,查看pod 的IP,在master节点ping此IP不通。

解决方案:

通过以上日志,可以看出现在的嫌疑最大的就是calico这个网络插件了,master上ping了几个节点的IP都是不通,因此,查看calico的部署文件,发现使用的是vxlan模式:

  typha_service_name: "none"
  # Configure the backend to use.
  calico_backend: "vxlan"

因此,将calico_backend 的值修改为bird,并重新部署即可。当然,calico网络的bgp模式也会造成节点之间的pod互相隔离ping不通的问题,使用calico网络的crosssubnet网络也可修复此问题,或者将Metrics server和master也就是apiserver部署在一起也可以勉强使用(这个方法我试用了,是没有问题的,长期来说就不一定了)。

十四,

故障前情:

etcd集群感觉卡顿,kubernetes集群总是一卡一卡的那种感觉,各种操作都不流畅。在排查别的问题的时候打开系统日志发现了一抹红色:

etcd: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:6

关联紧密的日志部分截取(日志是192.168.217.20服务器的,此服务器目前是leader):

Nov  1 22:55:27 master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.143421ms, to f5b8cb45a0dcf520)
Nov  1 22:55:27 master2 etcd: server is likely overloaded
Nov  1 22:55:27 master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.235065ms, to 3d70d11f824a5d8f)
Nov  1 22:55:27 master2 etcd: server is likely overloaded
Nov  1 22:55:36 master2 etcd: read-only range request "key:\"/registry/leases/kube-system/kube-scheduler\" " with result "range_response_count:1 size:483" took too long (102.762895ms) to execute
Nov  1 22:55:42 master2 etcd: request "header:<ID:17663263095897001410 > txn:<compare:<target:MOD key:\"/registry/masterleases/192.168.217.19\" mod_revision:369490 > success:<request_put:<key:\"/registry/masterleases/192.168.217.19\" value_size:70 lease:5373221187761963523 >> failure:<>>" with result "size:18" took too long (122.623655ms) to execute
Nov  1 22:55:42 master2 etcd: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:6" took too long (103.679383ms) to execute

故障分析:

这里面有一段关键的话:

此句说的是发送心跳失败了,100ms的时间不够,是发送到192.168.217.20这个服务器。

master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.143421ms, to f5b8cb45a0dcf520
[root@master2 ~]# etct_serch member list -w table
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  |  NAME  |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
| 3d70d11f824a5d8f | started | etcd-1 | https://192.168.217.19:2380 | https://192.168.217.19:2379 |      false |
| ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 |      false |
| f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 |      false |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+

因此,此故障的原因应该是etcd集群心跳和选主的时间设置过小造成的。

故障解决:

修改etcd集群的配置文件,三个节点都修改(heartbeat-interval和election-timeout默认的值分别是100毫秒和1000毫秒),修改完毕后重启所有etcd服务:

[root@master2 ~]# cat /usr/lib/systemd/system/etcd.service 
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd.conf
ExecStart=/opt/etcd/bin/etcd \
        --cert-file=/opt/etcd/ssl/server.pem \
        --key-file=/opt/etcd/ssl/server-key.pem \
        --peer-cert-file=/opt/etcd/ssl/server.pem \
        --peer-key-file=/opt/etcd/ssl/server-key.pem \
        --trusted-ca-file=/opt/etcd/ssl/ca.pem \
        --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
        --wal-dir=/var/lib/etcd \ #快照日志路径
        --snapshot-count=50000 \ #最大快照次数,指定有多少事务被提交时,触发截取快照保存到磁盘,释放wal日志,默认值100000
        --auto-compaction-retention=1 \  #首次压缩周期为1小时,后续压缩周期为当前值的10%,也就是每隔6分钟压缩一次
        --auto-compaction-mode=periodic \ #周期性压缩
        --max-request-bytes=$((10*1024*1024)) \ #请求的最大字节数,默认一个key为1.5M,官方推荐最大为10M
        --quota-backend-bytes=$((8*1024*1024*1024)) \
        --heartbeat-interval="5000" \
        --election-timeout="25000"
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target

稍微总结一哈:

因为部署的kubernetes集群是etcd和apiserver同节点,并且是三master的节点,因此造成这个现象,apiserver和etcd都需要抢占网络资源这是一个根本性的原因,因此,如果是在实际的生产中,etcd集群最好是单独部署不和apiserver混合,两个参数如果默认,正常的情况下应该是够用的。

网络卡顿还是需要从根源上找出问题,调参只是权宜之计。两个参数如果要设置,相差五倍即可。

十五,

故障前情:

安装flannel插件的时候报错,pod一直不是running状态,集群是minikube:

Error registering network: failed to acquire lease: node "node3" pod cidr not assigned

故障分析:

以上报错的意思是找不到pod cidr因此无法部署应用,minikube底层其实用的就是kubeadm,也就是说它的静态pod的配置文件和kubeadm是一样的,都放置在/etc/kubernetes/manifests这个目录下。因此,查看kube-controller-manager文件,发现确实没有pod cidr的定义。而我用的flannel是使用的默认网段10.244.0.0.。

很奇怪,初始化命令指定了cidr,但配置文件里面没有:

初始化命令:

minikube start \
    --extra-config=controller-manager.allocate-node-cidrs=true \
    --extra-config=controller-manager.cluster-cidr=10.244.0.0/16 \
    --extra-config=kubelet.network-plugin=cni \
    --extra-config=kubelet.pod-cidr=10.244.0.0/16 \
    --network-plugin=cni \
    --kubernetes-version=1.18.8 \
    --vm-driver=none

解决方案:

编辑/etc/kubernetes/manifests/kube-controller-manager 文件,添加如下三行:

    - --allocate-node-cidrs=true
    - --cluster-cidr=10.244.0.0/16
    - --service-cluster-ip-range=10.96.0.0/12

仔细排查了一下,上面的初始化命令也是有问题的,正确的初始化文件应该是这样的,此配置会安装flannel,省的使用yaml文件;

minikube start pod-network-cidr='10.244.0.0/16'\
    --extra-config=kubelet.pod-cidr=10.244.0.0/16 \
    --network-plugin=cni \
    --image-repository='registry.aliyuncs.com/google_containers' \
    --cni=flannel \
    --apiserver-ips=192.168.217.23 \
    --kubernetes-version=1.18.8 \
    --vm-driver=none

稍等片刻后,在重新部署flannel,可以看到flannel恢复正常了。

[root@node3 manifests]# kubectl get po -A -owide
NAMESPACE     NAME                            READY   STATUS    RESTARTS   AGE     IP               NODE    NOMINATED NODE   READINESS GATES
kube-system   coredns-66bff467f8-9glkl        0/1     Running   0          58m     10.244.0.3       node3   <none>           <none>
kube-system   etcd-node3                      1/1     Running   0          86m     192.168.217.23   node3   <none>           <none>
kube-system   kube-apiserver-node3            1/1     Running   0          86m     192.168.217.23   node3   <none>           <none>
kube-system   kube-controller-manager-node3   1/1     Running   0          15m     192.168.217.23   node3   <none>           <none>
kube-system   kube-flannel-ds-thjml           1/1     Running   9          80m     192.168.217.23   node3   <none>           <none>
kube-system   kube-proxy-j6j8c                1/1     Running   0          6m57s   192.168.217.23   node3   <none>           <none>
kube-system   kube-scheduler-node3            1/1     Running   0          11m     192.168.217.23   node3   <none>           <none>
kube-system   storage-provisioner             1/1     Running   0          86m     192.168.217.23   node3   <none>           <none>

查看flannel的pod的日志:

[root@node3 manifests]# kubectl logs  kube-flannel-ds-thjml -n kube-system
略略略
I1102 12:25:02.177604       1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -j ACCEPT
I1102 12:25:02.178853       1 iptables.go:167] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
I1102 12:25:02.276100       1 iptables.go:155] Adding iptables rule: -d 10.244.0.0/16 -j ACCEPT
I1102 12:25:02.276598       1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I1102 12:25:02.375648       1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I1102 12:25:02.379296       1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/24 -j RETURN
I1102 12:25:02.476040       1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully

可以看到flannel插件正常了。虚拟网卡也出现了,子网配置文件也自动生成了:

[root@node3 manifests]# ls  /run/flannel/subnet.env 
/run/flannel/subnet.env
[root@node3 manifests]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:0c:29:70:12:12 brd ff:ff:ff:ff:ff:ff
    inet 192.168.217.23/24 brd 192.168.217.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe70:1212/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 02:42:59:55:e5:7f brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:59ff:fe55:e57f/64 scope link 
       valid_lft forever preferred_lft forever
12: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN 
    link/ether 2e:b4:f1:da:9b:d9 brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.0/32 brd 10.244.0.0 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::2cb4:f1ff:feda:9bd9/64 scope link 
       valid_lft forever preferred_lft forever

报错解决了!!!!!!!!!!!!

十六,

故障前情:

查看pod,发现coredns不正常,是running状态,但无法使用:

[root@node3 ~]# kubectl get po -A -owide
NAMESPACE     NAME                            READY   STATUS    RESTARTS   AGE     IP               NODE    NOMINATED NODE   READINESS GATES
kube-system   coredns-7ff77c879f-55z6k        0/1     Running   0          5m49s   10.244.0.4       node3   <none>           <none>

查看此pod的日志:

[INFO] plugin/ready: Still waiting on: "kubernetes"
E1102 15:04:04.718632       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
E1102 15:04:05.719791       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
E1102 15:04:06.721905       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
E1102 15:04:07.723040       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
E1102 15:04:08.724991       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host

关键的地方:dial tcp 10.96.0.1:443: connect: no route to host

故障分析:

安装telnet,telnet此IP+端口:

[root@node3 ~]# telnet 10.96.0.1 443
Trying 10.96.0.1...
Connected to 10.96.0.1.
Escape character is '^]'.

以上表示端口正常,也就是说端口是开放出来的,这就让人非常迷惑了。

[root@node3 ~]# curl -k https://10.96.0.1:443
curl: (7) Failed connect to 10.96.0.1:443; Connection refused

以上命令表示超时,因为卡顿了几十秒才给这个报错,这个现象非常类似防火墙的问题。

故障解决方案:

停止防火墙,再次curl 发现恢复了(由于此网址是带证书的,而我没有带证书,自然是失败,但可以访问,只是权限问题。):

[root@node3 ~]# systemctl stop firewalld
[root@node3 ~]# curl -k https://10.96.0.1:443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
  },
  "code": 403

再次查看pod,发现恢复正常了:

}[root@node3 ~]#kubectl get po -A -owide
NAMESPACE     NAME                            READY   STATUS    RESTARTS   AGE   IP               NODE    NOMINATED NODE   READINESS GATES
kube-system   coredns-7ff77c879f-55z6k        1/1     Running   0          21m   10.244.0.4       node3   <none>           <none>

查看pod,也正常了:

I1102 15:13:35.813427       1 trace.go:116] Trace[1852186258]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2022-11-02 15:13:08.790440794 +0000 UTC m=+301.345800939) (total time: 27.022935709s):
Trace[1852186258]: [27.022876523s] [27.022876523s] Objects listed
I1102 15:13:35.813725       1 trace.go:116] Trace[1616138287]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2022-11-02 15:13:24.782953268 +0000 UTC m=+317.338313414) (total time: 11.030727079s):
Trace[1616138287]: [11.030681851s] [11.030681851s] Objects listed

coredns的功能测试也正常了:

[root@node3 ~]# kubectl run -it --image busybox:1.28.3  dns-test --restart=Never --rm
If you don't see a command prompt, try pressing enter.
/ # nslookup kubernetes
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name:      kubernetes
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

故障修复了!!!!!!!!!!!!!!!!!!!!!!!!!

十七,

故障前情:

daemon方式部署,无法apply -f 文件,总是报错如下:

root@k8s-master:~# kubectl apply -f 4.yaml 
error: error validating "4.yaml": error validating data: [ValidationError(DaemonSet.status): missing required field "currentNumberScheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "numberMisscheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "desiredNumberScheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "numberReady" in io.k8s.api.apps.v1.DaemonSetStatus]; if you choose to ignore these errors, turn validation off with --validate=false

故障分析:

这个错误提示还是比较的明显,以上报错说的是dameset.status 错误,因此,打开部署文件,在末尾可以看到有 status: {} 完整文件如下:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  creationTimestamp: null
  labels:
    app: nginx
  name: nginx
  namespace: project-tiger
spec:
  selector:
    matchLabels:
      app: nginx
      #strategy: {}
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: httpd:2.4-alpine
        name: nginx
        resources: {}
status: {}

解决方案:

根据错误提示,要么增加currentNumberScheduled,numberReady,numberMisscheduled,desiredNumberScheduled等等以上错误提示的字段,要么删除status: {} 这两种方式。

很明显,无需增加状态描述字段,因为我们主要是部署,因此,删除status: {} 即可。

稍作小结:

daemonset是由deployment控制器更改而来的,因此,通常是由命令生成的模板文件,status: {}是自动生成的,删除即可,和阑尾一样,基本没用

十八,

kubelet服务报错

从系统日志里抓取的

Jan 20 13:44:18 k8s-master kubelet[1210]: E0120 13:44:18.882767    1210 summary_sys_containers.go:47] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service"
Jan 20 13:44:26 k8s-master kubelet[1210]: E0120 13:44:26.799795    1210 summary_sys_containers.go:82] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service"
Jan 20 13:44:28 k8s-master kubelet[1210]: E0120 13:44:28.906733    1210 summary_sys_containers.go:47] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service"

报错分析:

原因是 kubelet 启动时,会执行节点资源统计,需要 systemd 中开启对应的选项

解决方案:

修改10-kubeadm.conf这个kubelet的配置文件,增加以下两行:

CPUAccounting=true
MemoryAccounting=true

完整的配置文件是这样的:

# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
CPUAccounting=true
MemoryAccounting=true
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS

重启kubelet,可以看到没有错误日志了:

重启kubelet服务

systemctl daemon-reload && systemctl restart kubelet

再次查看系统日志,可以看到日志恢复正常了:

"
Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638874   62687 reconciler.go:225] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-proxy\" (UniqueName: \"kubernetes.io/configmap/6decd6cc-a931-46b9-92fe-b3b1a03f9ea4-kube-proxy\") pod \"kube-proxy-5nj6l\" (UID: \"6decd6cc-a931-46b9-92fe-b3b1a03f9ea4\") "
Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638918   62687 reconciler.go:225] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-local-net-dir\" (UniqueName: \"kubernetes.io/host-path/5ef5e743-ee71-4c80-a543-76e18a232a45-host-local-net-dir\") pod \"calico-node-4l4ll\" (UID: \"5ef5e743-ee71-4c80-a543-76e18a232a45\") "
Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638951   62687 reconciler.go:157] "Reconciler: start to sync state"
Jan 20 13:45:36 k8s-master kubelet[62687]: I0120 13:45:36.601380   62687 request.go:665] Waited for 1.151321922s due to client-side throttling, not priority and fair

十九,

安装nfs存储插件报错如下:

Mounting arguments: -t nfs 192.168.123.11:/data/nfs-sc /var/lib/kubelet/pods/4a0ead87-4932-4a9a-9fc0-2b89aac94b1a/volumes/kubernetes.io~nfs/nfs-client-root
Output: mount: wrong fs type, bad option, bad superblock on 192.168.123.11:/data/nfs-sc,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)
       In some cases useful info is found in syslog - try
       dmesg | tail or so.

报错分析:

缺少执行文件

解决方案:

yum install nfs-utils -y

三个节点,只有主节点安装了,其他的节点也要安装这个~!~~~~~~~~~!!!!

相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
云原生实践公开课
课程大纲 开篇:如何学习并实践云原生技术 基础篇: 5 步上手 Kubernetes 进阶篇:生产环境下的 K8s 实践 相关的阿里云产品:容器服务&nbsp;ACK 容器服务&nbsp;Kubernetes&nbsp;版(简称&nbsp;ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情:&nbsp;https://www.aliyun.com/product/kubernetes
目录
相关文章
|
28天前
|
Kubernetes 网络协议 应用服务中间件
K8S二进制部署实践-1.15.5
K8S二进制部署实践-1.15.5
34 0
|
30天前
|
Kubernetes 流计算 Perl
在Rancher K8s上部署Flink时,TaskManager连接不上并不断重启可能是由多种原因导致的
在Rancher K8s上部署Flink时,TaskManager连接不上并不断重启可能是由多种原因导致的
34 7
|
14天前
|
Kubernetes 监控 Cloud Native
构建高效云原生应用:基于Kubernetes的微服务治理实践
【4月更文挑战第13天】 在当今数字化转型的浪潮中,企业纷纷将目光投向了云原生技术以支持其业务敏捷性和可扩展性。本文深入探讨了利用Kubernetes作为容器编排平台,实现微服务架构的有效治理,旨在为开发者和运维团队提供一套优化策略,以确保云原生应用的高性能和稳定性。通过分析微服务设计原则、Kubernetes的核心组件以及实际案例,本文揭示了在多变的业务需求下,如何确保系统的高可用性、弹性和安全性。
17 4
|
13天前
|
Kubernetes 搜索推荐 Docker
使用 kubeadm 部署 Kubernetes 集群(二)k8s环境安装
使用 kubeadm 部署 Kubernetes 集群(二)k8s环境安装
58 17
|
25天前
|
Kubernetes Ubuntu 应用服务中间件
Ubuntu 22.04 利用kubeadm方式部署Kubernetes(v1.28.2版本)
Ubuntu 22.04 利用kubeadm方式部署Kubernetes(v1.28.2版本)
103 0
|
28天前
|
人工智能 监控 Serverless
如何基于ACK Serverless快速部署AI推理服务
通过上述步骤,可以在ACK Serverless上快速部署AI推理服务,实现高可用、弹性扩展的服务架构。
21 1
|
1月前
|
Kubernetes Cloud Native Docker
【云原生】kubeadm快速搭建K8s集群Kubernetes1.19.0
Kubernetes 是一个开源平台,用于管理容器化工作负载和服务,提供声明式配置和自动化。源自 Google 的大规模运维经验,它拥有广泛的生态支持。本文档详细介绍了 Kubernetes 集群的搭建过程,包括服务器配置、Docker 和 Kubernetes 组件的安装,以及 Master 和 Node 的部署。此外,还提到了使用 Calico 作为 CNI 网络插件,并提供了集群功能的测试步骤。
219 0
|
1月前
|
Kubernetes Java Nacos
nacos常见问题之k8s上部署需要自动扩缩容如何解决
Nacos是阿里云开源的服务发现和配置管理平台,用于构建动态微服务应用架构;本汇总针对Nacos在实际应用中用户常遇到的问题进行了归纳和解答,旨在帮助开发者和运维人员高效解决使用Nacos时的各类疑难杂症。
33 0
|
1月前
|
Kubernetes Cloud Native Devops
云原生技术落地实现之二KubeSphere DevOps 系统在 Kubernetes 集群上实现springboot项目的自动部署和管理 CI/CD (2/2)
云原生技术落地实现之二KubeSphere DevOps 系统在 Kubernetes 集群上实现springboot项目的自动部署和管理 CI/CD (2/2)
51 1
|
1月前
|
存储 Kubernetes 分布式数据库
利用Helm在K8S上部署 PolarDB-X 集群(详细步骤--亲测!!!)
利用Helm在K8S上部署 PolarDB-X 集群(详细步骤--亲测!!!)
93 0

热门文章

最新文章