正文
今天打开 kubernetes dashboard 仪表盘一看,发现有块红的,如下所示:
接着,通过命令行查到下面的错误:
[root@k8s0 server]# kubectl get all -n kube-system NAME READY STATUS RESTARTS AGE pod/calico-kube-controllers-798cc86c47-k6x4g 1/1 Running 0 30m pod/calico-node-cttlt 1/1 Running 0 30m pod/calico-node-mnp54 1/1 Running 0 30m pod/calico-node-smvvn 0/1 Running 0 30m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/calico-node 3 3 2 3 2 kubernetes.io/os=linux 30m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/calico-kube-controllers 1/1 1 1 30m NAME DESIRED CURRENT READY AGE replicaset.apps/calico-kube-controllers-798cc86c47 1 1 1 30m [root@k8s0 server]#
进一步执行命令kubectl describe pod/calico-node-smvvn -n kube-system
查到下面的错误:
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.3.xxx,192.168.3.xxx
进一步执行命令 kubectl logs -f calico-node-smvvn -n kube-system
(查看有问题的节点) 查到下面的日志:
2022-11-21 12:20:58.373 [INFO][98] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface nerdctl0: 10.4.0.1/24 2022-11-21 12:21:47.330 [INFO][97] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.1s: avg=5ms longest=10ms (resync-filter-v4) 2022-11-21 12:21:58.375 [INFO][98] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface nerdctl0: 10.4.0.1/24 2022-11-21 12:22:50.288 [INFO][97] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m3s: avg=6ms longest=13ms (resync-filter-v4) 2022-11-21 12:22:58.376 [INFO][98] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface nerdctl0: 10.4.0.1/24 2022-11-21 12:23:52.746 [INFO][97] felix/summary.go 100: Summarising 7 dataplane reconciliation loops over 1m2.5s: avg=3ms longest=3ms (resync-routes-v4,resync-routes-v4,resync-rules-v4,resync-wg)
进一步执行命令kubectl logs -f calico-node-mnp54 -n kube-system
(查看没有问题的节点),日志如下
2022-11-21 12:22:58.963 [INFO][94] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface enp2s0f0: 192.168.3.102/24 2022-11-21 12:23:54.232 [INFO][96] felix/summary.go 100: Summarising 7 dataplane reconciliation loops over 1m3.9s: avg=3ms longest=3ms (resync-ipsets-v4) 2022-11-21 12:23:58.966 [INFO][94] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface enp2s0f0: 192.168.3.102/24 2022-11-21 12:24:57.809 [INFO][96] felix/summary.go 100: Summarising 8 dataplane reconciliation loops over 1m3.6s: avg=6ms longest=19ms (resync-filter-v4,resync-mangle-v4,resync-nat-v4)
- 可以看出两个pod 里显示的网段不一样,一个是10.4.0.1/24 一个是 192.168.3.102/24。
再来执行命令(ip addr
)看一下有问题的那个pod对应的节点ip:
[root@k8s0 server]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 6c:92:bf:2b:20:6a brd ff:ff:ff:ff:ff:ff inet 192.168.3.100/24 brd 192.168.3.255 scope global noprefixroute enp2s0f0 valid_lft forever preferred_lft forever inet 192.168.3.250/24 scope global secondary enp2s0f0 valid_lft forever preferred_lft forever inet6 fe80::c44d:4c26:e656:ae28/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: enp2s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 6c:92:bf:2b:20:6b brd ff:ff:ff:ff:ff:ff 4: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 inet 172.17.144.64/32 scope global tunl0 valid_lft forever preferred_lft forever 6: nerdctl0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 link/ether 6e:c2:85:5f:0f:20 brd ff:ff:ff:ff:ff:ff inet 10.4.0.1/24 brd 10.4.0.255 scope global nerdctl0 valid_lft forever preferred_lft forever inet6 fe80::6cc2:85ff:fe5f:f20/64 scope link valid_lft forever preferred_lft forever
- 可以看到,有问题的那个 pod 使用的是 buildkitd 创建的网卡,这是不对的。
现在知道了,这是网卡冲突导致的,我们打开 calico.yaml 文件,可以看到:
# Auto-detect the BGP IP address. - name: IP value: "autodetect"
- 原来BGP IP 是自动获取的。
解决办法:
修改calico.yaml 配置文件,将 IP_AUTODETECTION_METHOD 环境变量改成指定的网卡(我环境里的网卡名是:enp2s0f0),如下所示:
[root@k8s0 cni]# kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=enp2s0f0 daemonset.apps/calico-node env updated
问题解决
可以看到3个 calico-node 都恢复正常了:
我遇到的这个问题主要是由于:在kubernetes集群上,我还安装了用于构建镜像的 buildkiltd,它自动创建了个网卡,导致 calico 自动获取失败。也就是说最好不要再kubernetes集群上安装多余的服务(我这里的集群环境本身也不需要构建镜像,直接pull就可以了)。
另外:解决方法里明确指定了网卡,其实这是不好的,因为每个节点上有多个网卡时,不一定都是同一个网卡上绑定IP,如果实际情况是这样,将导致网卡再次分配失败。也就是说,最好是把多余的服务给去了,不要改calico的配置。