前言:
VMware安装的四台虚拟机,IP分配为:192.168.217.19/20/21/22 ,采用kubeadm部署的高可用kubernetes集群,该集群使用的是外部扩展etcd集群,etcd集群部署在19 20 21,master也是19 20 21,22为工作节点。
具体的配置和安装操作见上一篇文章:云原生|kubernetes|kubeadm部署高可用集群(二)---kube-apiserver高可用+etcd外部集群+haproxy+keepalived_晚风_END的博客-CSDN博客
由于误操作,当然,也不是误操作,只是ionice没用好,19服务器彻底挂了,虽然有打快照,但不想恢复快照,看看如何找回这个失去的master节点。
在20服务器上查看:
[root@master2 ~]# kubectl get po -A -owide NAMESPACE NAME READY STATUS RESTARTS kube-system calico-kube-controllers-796cc7f49d-5fz47 1/1 Running 3 (88m ago) 2d5h 10.244.166.146 node1 <none> <none> kube-system calico-node-49qd6 1/1 Running 4 (29h ago) 2d5h 192.168.217.19 master1 <none> <none> kube-system calico-node-l9kbj 1/1 Running 3 (88m ago) 2d5h 192.168.217.21 master3 <none> <none> kube-system calico-node-nsknc 1/1 Running 3 (88m ago) 2d5h 192.168.217.20 master2 <none> <none> kube-system calico-node-pd8v2 1/1 Running 6 (88m ago) 2d5h 192.168.217.22 node1 <none> <none> kube-system coredns-7f6cbbb7b8-7c85v 1/1 Running 15 (88m ago) 5d16h 10.244.166.143 node1 <none> <none> kube-system coredns-7f6cbbb7b8-h9wtb 1/1 Running 15 (88m ago) 5d16h 10.244.166.144 node1 <none> <none> kube-system kube-apiserver-master1 1/1 Running 19 (29h ago) 5d18h 192.168.217.19 master1 <none> <none> kube-system kube-apiserver-master2 1/1 Running 1 (88m ago) 16h 192.168.217.20 master2 <none> <none> kube-system kube-apiserver-master3 1/1 Running 1 (88m ago) 16h 192.168.217.21 master3 <none> <none> kube-system kube-controller-manager-master1 1/1 Running 14 4d2h 192.168.217.19 master1 <none> <none> kube-system kube-controller-manager-master2 1/1 Running 12 (88m ago) 4d4h 192.168.217.20 master2 <none> <none> kube-system kube-controller-manager-master3 1/1 Running 13 (88m ago) 4d4h 192.168.217.21 master3 <none> <none> kube-system kube-proxy-69w6c 1/1 Running 2 (29h ago) 2d5h 192.168.217.19 master1 <none> <none> kube-system kube-proxy-vtz99 1/1 Running 4 (88m ago) 2d6h 192.168.217.22 node1 <none> <none> kube-system kube-proxy-wldcc 1/1 Running 4 (88m ago) 2d6h 192.168.217.21 master3 <none> <none> kube-system kube-proxy-x6w6l 1/1 Running 4 (88m ago) 2d6h 192.168.217.20 master2 <none> <none> kube-system kube-scheduler-master1 1/1 Running 11 (29h ago) 4d2h 192.168.217.19 master1 <none> <none> kube-system kube-scheduler-master2 1/1 Running 11 (88m ago) 4d4h 192.168.217.20 master2 <none> <none> kube-system kube-scheduler-master3 1/1 Running 10 (88m ago) 4d4h 192.168.217.21 master3 <none> <none> kube-system metrics-server-55b9b69769-j9c7j 1/1 Running 1 (88m ago) 16h 10.244.166.147 node1 <none> <none>
查看etcd状态:
这里是设置带所有证书的etcdctl命令的别名(在20服务器上):
alias etct_search="ETCDCTL_API=3 \ /opt/etcd/bin/etcdctl \ --endpoints=https://192.168.217.19:2379,https://192.168.217.20:2379,https://192.168.217.21:2379 \ --cacert=/opt/etcd/ssl/ca.pem \ --cert=/opt/etcd/ssl/server.pem \ --key=/opt/etcd/ssl/server-key.pem"
已经将下线的etcd节点19踢出member,查看集群状态可以看到有报错:
[root@master2 ~]# etct_serch member list -w table +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 | false | | f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 | false | +------------------+---------+--------+-----------------------------+-----------------------------+------------+ [root@master2 ~]# etct_serch endpoint status -w table {"level":"warn","ts":"2022-11-01T17:13:39.004+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://192.168.217.19:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.217.19:2379: connect: no route to host\""} Failed to get the status of endpoint https://192.168.217.19:2379 (context deadline exceeded) +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.217.20:2379 | ef2fee107aafca91 | 3.4.9 | 4.0 MB | true | false | 229 | 453336 | 453336 | | | https://192.168.217.21:2379 | f5b8cb45a0dcf520 | 3.4.9 | 4.0 MB | false | false | 229 | 453336 | 453336 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
etcd的配置文件(20服务器正常的配置文件):
[root@master2 ~]# cat /opt/etcd/cfg/etcd.conf #[Member] ETCD_NAME="etcd-2" ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="https://192.168.217.20:2380" ETCD_LISTEN_CLIENT_URLS="https://192.168.217.20:2379" #[Clustering] ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.20:2380" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.20:2379" ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.19:2380,etcd-2=https://192.168.217.20:2380,etcd-3=https://192.168.217.21:2380" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ETCD_INITIAL_CLUSTER_STATE="new"
etcd的启动脚本(20服务器上正常的启动脚本):
[root@master2 ~]# cat /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=/opt/etcd/cfg/etcd.conf ExecStart=/opt/etcd/bin/etcd \ --cert-file=/opt/etcd/ssl/server.pem \ --key-file=/opt/etcd/ssl/server-key.pem \ --peer-cert-file=/opt/etcd/ssl/server.pem \ --peer-key-file=/opt/etcd/ssl/server-key.pem \ --trusted-ca-file=/opt/etcd/ssl/ca.pem \ --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem --wal-dir=/var/lib/etcd \ #快照日志路径 --snapshot-count=50000 \ #最大快照次数,指定有多少事务被提交时,触发截取快照保存到磁盘,释放wal日志,默认值100000 --auto-compaction-retention=1 \ #首次压缩周期为1小时,后续压缩周期为当前值的10%,也就是每隔6分钟压缩一次 --auto-compaction-mode=periodic \ #周期性压缩 --max-request-bytes=$((10*1024*1024)) \ #请求的最大字节数,默认一个key为1.5M,官方推荐最大为10M --quota-backend-bytes=$((8*1024*1024*1024)) \ --heartbeat-interval="500" \ --election-timeout="1000" Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target
一,恢复计划
根据以上的信息,可以看出,etcd集群需要恢复到三个,原节点19重做虚拟机,使用原IP,这样可以不需要重新制作etcd证书, 只需要加入现有的etcd集群即可。
因为IP没有改变,并且etcd是外部集群,因此第二步就是待etcd集群恢复后,只需要重新kubeadm reset各个节点后,所有节点在重新加入一次整个集群就恢复了。
二,etcd集群的恢复
1,踢出损坏的节点
踢出命令为(数字为 etcdctl member list 查询出的ID):
etct_serch member remove 97c1c1003e0d4bf
2,
将正常的etcd节点的配置文件,程序拷贝到19节点(/opt/etcd目录下有etcd集群证书和两个可执行文件以及主配置文件,配置文件一哈要修改):
scp -r /opt/etcd 192.168.217.19:/opt/ scp /usr/lib/systemd/system/etcd.service 192.168.217.19:/usr/lib/systemd/system/
3,
修改19的etcd配置文件:
主要是IP修改和name修改,以及cluster_state 修改为existing可以对比上面的配置文件。
[root@master ~]# cat !$ cat /opt/etcd/cfg/etcd.conf #[Member] ETCD_NAME="etcd-1" ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="https://192.168.217.19:2380" ETCD_LISTEN_CLIENT_URLS="https://192.168.217.19:2379" #[Clustering] ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.19:2380" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.19:2379" ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.19:2380,etcd-2=https://192.168.217.20:2380,etcd-3=https://192.168.217.21:2380" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ETCD_INITIAL_CLUSTER_STATE="existing"
4,
在20服务器上执行添加member命令,这个非常重要!!!!!!!!!!!!:
etct_serch member add etcd-1 --peer-urls=https://192.168.217.19:2380
5,
三个节点的etcd服务都重启:
systemctl daemon-reload && systemctl restart etcd
6,etcd集群恢复的日志
查看20服务器上的系统日志可以看到etcd新加节点以及选主的过程,现将相关日志截取出来(3d那一串就是新节点,也就是192.168.217.19):
Nov 1 22:41:47 master2 etcd: raft2022/11/01 22:41:47 INFO: ef2fee107aafca91 switched to configuration voters=(4427268366965300623 17235256053515405969 17706125434919122208) Nov 1 22:41:47 master2 etcd: added member 3d70d11f824a5d8f [https://192.168.217.19:2380] to cluster b459890bbabfc99f Nov 1 22:41:47 master2 etcd: starting peer 3d70d11f824a5d8f... Nov 1 22:41:47 master2 etcd: started HTTP pipelining with peer 3d70d11f824a5d8f Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (writer) Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (writer) Nov 1 22:41:47 master2 etcd: started peer 3d70d11f824a5d8f Nov 1 22:41:47 master2 etcd: added peer 3d70d11f824a5d8f Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (stream Message reader) Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (stream MsgApp v2 reader) Nov 1 22:41:47 master2 etcd: peer 3d70d11f824a5d8f became active Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream Message reader) Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 reader) Nov 1 22:41:47 master2 etcd: ef2fee107aafca91 initialized peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s) Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream Message writer) Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 writer) Nov 1 22:41:47 master2 etcd: published {Name:etcd-2 ClientURLs:[https://192.168.217.20:2379]} to cluster b459890bbabfc99f Nov 1 22:41:47 master2 etcd: ready to serve client requests Nov 1 22:41:47 master2 systemd: Started Etcd Server. Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [term 639] received MsgTimeoutNow from 3d70d11f824a5d8f and starts an election to get leadership. Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 became candidate at term 640 Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 received MsgVoteResp from ef2fee107aafca91 at term 640 Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [logterm: 639, index: 492331] sent MsgVote request to 3d70d11f824a5d8f at term 640 Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [logterm: 639, index: 492331] sent MsgVote request to f5b8cb45a0dcf520 at term 640 Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: raft.node: ef2fee107aafca91 lost leader 3d70d11f824a5d8f at term 640 Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 received MsgVoteResp from 3d70d11f824a5d8f at term 640 Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 has received 2 MsgVoteResp votes and 0 vote rejections Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 became leader at term 640 Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: raft.node: ef2fee107aafca91 elected leader ef2fee107aafca91 at term 640 Nov 1 22:42:26 master2 etcd: lost the TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 reader) Nov 1 22:42:26 master2 etcd: lost the TCP streaming connection with peer 3d70d11f824a5d8f (stream Message reader)
7,
etcd集群测试
可以看到现在的20服务器是etcd的leader,这一点也在上面的日志上有所体现。
[root@master2 ~]# etct_serch member list -w table +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | 3d70d11f824a5d8f | started | etcd-1 | https://192.168.217.19:2380 | https://192.168.217.19:2379 | false | | ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 | false | | f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 | false | +------------------+---------+--------+-----------------------------+-----------------------------+------------+ [root@master2 ~]# etct_serch endpoint status -w table +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.217.19:2379 | 3d70d11f824a5d8f | 3.4.9 | 4.0 MB | false | false | 640 | 494999 | 494999 | | | https://192.168.217.20:2379 | ef2fee107aafca91 | 3.4.9 | 4.1 MB | true | false | 640 | 495000 | 495000 | | | https://192.168.217.21:2379 | f5b8cb45a0dcf520 | 3.4.9 | 4.0 MB | false | false | 640 | 495000 | 495000 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ [root@master2 ~]# etct_serch endpoint health -w table +-----------------------------+--------+-------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +-----------------------------+--------+-------------+-------+ | https://192.168.217.19:2379 | true | 26.210272ms | | | https://192.168.217.20:2379 | true | 26.710558ms | | | https://192.168.217.21:2379 | true | 27.903774ms | | +-----------------------------+--------+-------------+-------+
三,kubernetes节点的恢复:
由于是三个master里的master1挂了,因此,需要重新安装haproxy和keepalived,这个没什么好说的,从正常节点20,把配置文件拷贝一份,在改改就可以了。
同样的,将kubelet,kubeadm和kubectl也重新安装一次。
然后四个节点都重置,也就是kubeadm reset -f 然后重新初始化即可,初始化文件(节点恢复这里本文最开始有挂部署安装教程的链接,在这就不重复了):
[root@master ~]# cat kubeadm-init-ha.yaml apiVersion: kubeadm.k8s.io/v1beta3 bootstrapTokens: - groups: - system:bootstrappers:kubeadm:default-node-token token: abcdef.0123456789abcdef ttl: "0" usages: - signing - authentication kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.217.19 bindPort: 6443 nodeRegistration: criSocket: /var/run/dockershim.sock imagePullPolicy: IfNotPresent name: master1 taints: null --- controlPlaneEndpoint: "192.168.217.100" apiServer: timeoutForControlPlane: 4m0s apiVersion: kubeadm.k8s.io/v1beta3 certificatesDir: /etc/kubernetes/pki clusterName: kubernetes controllerManager: {} dns: {} etcd: external: endpoints: #下面为自定义etcd集群地址 - https://192.168.217.19:2379 - https://192.168.217.20:2379 - https://192.168.217.21:2379 caFile: /etc/kubernetes/pki/etcd/ca.pem certFile: /etc/kubernetes/pki/etcd/apiserver-etcd-client.pem keyFile: /etc/kubernetes/pki/etcd/apiserver-etcd-client-key.pem imageRepository: registry.aliyuncs.com/google_containers kind: ClusterConfiguration kubernetesVersion: 1.22.2 networking: dnsDomain: cluster.local podSubnet: "10.244.0.0/16" serviceSubnet: "10.96.0.0/12" scheduler: {}
初始化命令(在19服务器上执行,因为初始化配置文件写的就是19服务器嘛):
kubeadm init --config=kubeadm-init-ha.yaml --upload-certs
由于使用的是外部扩展etcd集群,因此,节点恢复起来比较简单,网络插件什么的重新安装一次就可以了,效果如下图:
可以看到部分pod,比如kube-apiserver的时间都刷新了,像kube-proxy等pod并没有改变时间,这都是由于使用的是外部etcd集群的原因。
[root@master ~]# kubectl get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system calico-kube-controllers-796cc7f49d-k586w 1/1 Running 0 153m 10.244.166.131 node1 <none> <none> kube-system calico-node-7x86d 1/1 Running 0 153m 192.168.217.21 master3 <none> <none> kube-system calico-node-dhxcq 1/1 Running 0 153m 192.168.217.19 master1 <none> <none> kube-system calico-node-jcq6p 1/1 Running 0 153m 192.168.217.20 master2 <none> <none> kube-system calico-node-vjtv6 1/1 Running 0 153m 192.168.217.22 node1 <none> <none> kube-system coredns-7f6cbbb7b8-7c85v 1/1 Running 16 5d23h 10.244.166.129 node1 <none> <none> kube-system coredns-7f6cbbb7b8-7xm62 1/1 Running 0 152m 10.244.166.132 node1 <none> <none> kube-system kube-apiserver-master1 1/1 Running 0 107m 192.168.217.19 master1 <none> <none> kube-system kube-apiserver-master2 1/1 Running 0 108m 192.168.217.20 master2 <none> <none> kube-system kube-apiserver-master3 1/1 Running 1 (107m ago) 107m 192.168.217.21 master3 <none> <none> kube-system kube-controller-manager-master1 1/1 Running 5 (108m ago) 3h26m 192.168.217.19 master1 <none> <none> kube-system kube-controller-manager-master2 1/1 Running 15 (108m ago) 4d10h 192.168.217.20 master2 <none> <none> kube-system kube-controller-manager-master3 1/1 Running 15 4d10h 192.168.217.21 master3 <none> <none> kube-system kube-proxy-69w6c 1/1 Running 2 (131m ago) 2d12h 192.168.217.19 master1 <none> <none> kube-system kube-proxy-vtz99 1/1 Running 5 2d12h 192.168.217.22 node1 <none> <none> kube-system kube-proxy-wldcc 1/1 Running 5 2d12h 192.168.217.21 master3 <none> <none> kube-system kube-proxy-x6w6l 1/1 Running 5 2d12h 192.168.217.20 master2 <none> <none> kube-system kube-scheduler-master1 1/1 Running 4 (108m ago) 3h26m 192.168.217.19 master1 <none> <none> kube-system kube-scheduler-master2 1/1 Running 14 (108m ago) 4d10h 192.168.217.20 master2 <none> <none> kube-system kube-scheduler-master3 1/1 Running 13 (46m ago) 4d10h 192.168.217.21 master3 <none> <none> kube-system metrics-server-55b9b69769-j9c7j 1/1 Running 13 (127m ago) 23h 10.244.166.130 node1 <none> <none>