云原生|kubernetes|找回丢失的etcd集群节点---etcd节点重新添加,扩容和重新初始化k8s的master节点

简介: 云原生|kubernetes|找回丢失的etcd集群节点---etcd节点重新添加,扩容和重新初始化k8s的master节点

前言:

VMware安装的四台虚拟机,IP分配为:192.168.217.19/20/21/22 ,采用kubeadm部署的高可用kubernetes集群,该集群使用的是外部扩展etcd集群,etcd集群部署在19 20 21,master也是19 20 21,22为工作节点。

具体的配置和安装操作见上一篇文章:云原生|kubernetes|kubeadm部署高可用集群(二)---kube-apiserver高可用+etcd外部集群+haproxy+keepalived_晚风_END的博客-CSDN博客

由于误操作,当然,也不是误操作,只是ionice没用好,19服务器彻底挂了,虽然有打快照,但不想恢复快照,看看如何找回这个失去的master节点。

在20服务器上查看:

[root@master2 ~]# kubectl get po -A -owide
NAMESPACE     NAME                                       READY   STATUS      RESTARTS       
kube-system   calico-kube-controllers-796cc7f49d-5fz47   1/1     Running     3 (88m ago)    2d5h    10.244.166.146   node1     <none>           <none>
kube-system   calico-node-49qd6                          1/1     Running     4 (29h ago)    2d5h    192.168.217.19   master1   <none>           <none>
kube-system   calico-node-l9kbj                          1/1     Running     3 (88m ago)    2d5h    192.168.217.21   master3   <none>           <none>
kube-system   calico-node-nsknc                          1/1     Running     3 (88m ago)    2d5h    192.168.217.20   master2   <none>           <none>
kube-system   calico-node-pd8v2                          1/1     Running     6 (88m ago)    2d5h    192.168.217.22   node1     <none>           <none>
kube-system   coredns-7f6cbbb7b8-7c85v                   1/1     Running     15 (88m ago)   5d16h   10.244.166.143   node1     <none>           <none>
kube-system   coredns-7f6cbbb7b8-h9wtb                   1/1     Running     15 (88m ago)   5d16h   10.244.166.144   node1     <none>           <none>
kube-system   kube-apiserver-master1                     1/1     Running     19 (29h ago)   5d18h   192.168.217.19   master1   <none>           <none>
kube-system   kube-apiserver-master2                     1/1     Running     1 (88m ago)    16h     192.168.217.20   master2   <none>           <none>
kube-system   kube-apiserver-master3                     1/1     Running     1 (88m ago)    16h     192.168.217.21   master3   <none>           <none>
kube-system   kube-controller-manager-master1            1/1     Running     14             4d2h    192.168.217.19   master1   <none>           <none>
kube-system   kube-controller-manager-master2            1/1     Running     12 (88m ago)   4d4h    192.168.217.20   master2   <none>           <none>
kube-system   kube-controller-manager-master3            1/1     Running     13 (88m ago)   4d4h    192.168.217.21   master3   <none>           <none>
kube-system   kube-proxy-69w6c                           1/1     Running     2 (29h ago)    2d5h    192.168.217.19   master1   <none>           <none>
kube-system   kube-proxy-vtz99                           1/1     Running     4 (88m ago)    2d6h    192.168.217.22   node1     <none>           <none>
kube-system   kube-proxy-wldcc                           1/1     Running     4 (88m ago)    2d6h    192.168.217.21   master3   <none>           <none>
kube-system   kube-proxy-x6w6l                           1/1     Running     4 (88m ago)    2d6h    192.168.217.20   master2   <none>           <none>
kube-system   kube-scheduler-master1                     1/1     Running     11 (29h ago)   4d2h    192.168.217.19   master1   <none>           <none>
kube-system   kube-scheduler-master2                     1/1     Running     11 (88m ago)   4d4h    192.168.217.20   master2   <none>           <none>
kube-system   kube-scheduler-master3                     1/1     Running     10 (88m ago)   4d4h    192.168.217.21   master3   <none>           <none>
kube-system   metrics-server-55b9b69769-j9c7j            1/1     Running     1 (88m ago)    16h     10.244.166.147   node1     <none>           <none>

查看etcd状态:

这里是设置带所有证书的etcdctl命令的别名(在20服务器上):

alias etct_search="ETCDCTL_API=3 \
/opt/etcd/bin/etcdctl \
--endpoints=https://192.168.217.19:2379,https://192.168.217.20:2379,https://192.168.217.21:2379 \
--cacert=/opt/etcd/ssl/ca.pem \
--cert=/opt/etcd/ssl/server.pem \
--key=/opt/etcd/ssl/server-key.pem"

已经将下线的etcd节点19踢出member,查看集群状态可以看到有报错:

[root@master2 ~]# etct_serch member list -w table
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  |  NAME  |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
| ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 |      false |
| f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 |      false |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
[root@master2 ~]# etct_serch endpoint status -w table
{"level":"warn","ts":"2022-11-01T17:13:39.004+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://192.168.217.19:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.217.19:2379: connect: no route to host\""}
Failed to get the status of endpoint https://192.168.217.19:2379 (context deadline exceeded)
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.217.20:2379 | ef2fee107aafca91 |   3.4.9 |  4.0 MB |      true |      false |       229 |     453336 |             453336 |        |
| https://192.168.217.21:2379 | f5b8cb45a0dcf520 |   3.4.9 |  4.0 MB |     false |      false |       229 |     453336 |             453336 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

etcd的配置文件(20服务器正常的配置文件):

[root@master2 ~]# cat  /opt/etcd/cfg/etcd.conf 
#[Member]
ETCD_NAME="etcd-2"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://192.168.217.20:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.217.20:2379"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.20:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.20:2379"
ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.19:2380,etcd-2=https://192.168.217.20:2380,etcd-3=https://192.168.217.21:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"

etcd的启动脚本(20服务器上正常的启动脚本):

[root@master2 ~]# cat /usr/lib/systemd/system/etcd.service 
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd.conf
ExecStart=/opt/etcd/bin/etcd \
        --cert-file=/opt/etcd/ssl/server.pem \
        --key-file=/opt/etcd/ssl/server-key.pem \
        --peer-cert-file=/opt/etcd/ssl/server.pem \
        --peer-key-file=/opt/etcd/ssl/server-key.pem \
        --trusted-ca-file=/opt/etcd/ssl/ca.pem \
        --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
        --wal-dir=/var/lib/etcd \ #快照日志路径
        --snapshot-count=50000 \ #最大快照次数,指定有多少事务被提交时,触发截取快照保存到磁盘,释放wal日志,默认值100000
        --auto-compaction-retention=1 \  #首次压缩周期为1小时,后续压缩周期为当前值的10%,也就是每隔6分钟压缩一次
        --auto-compaction-mode=periodic \ #周期性压缩
        --max-request-bytes=$((10*1024*1024)) \ #请求的最大字节数,默认一个key为1.5M,官方推荐最大为10M
        --quota-backend-bytes=$((8*1024*1024*1024)) \
        --heartbeat-interval="500" \
        --election-timeout="1000"
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target

一,

恢复计划

根据以上的信息,可以看出,etcd集群需要恢复到三个,原节点19重做虚拟机,使用原IP,这样可以不需要重新制作etcd证书, 只需要加入现有的etcd集群即可。

因为IP没有改变,并且etcd是外部集群,因此第二步就是待etcd集群恢复后,只需要重新kubeadm reset各个节点后,所有节点在重新加入一次整个集群就恢复了。

二,

etcd集群的恢复

1,踢出损坏的节点

踢出命令为(数字为 etcdctl member list 查询出的ID):

etct_serch member remove 97c1c1003e0d4bf

2,

将正常的etcd节点的配置文件,程序拷贝到19节点(/opt/etcd目录下有etcd集群证书和两个可执行文件以及主配置文件,配置文件一哈要修改):

scp -r /opt/etcd  192.168.217.19:/opt/
scp /usr/lib/systemd/system/etcd.service 192.168.217.19:/usr/lib/systemd/system/

3,

修改19的etcd配置文件:

主要是IP修改和name修改,以及cluster_state 修改为existing可以对比上面的配置文件

[root@master ~]# cat !$
cat /opt/etcd/cfg/etcd.conf
#[Member]
ETCD_NAME="etcd-1"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://192.168.217.19:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.217.19:2379"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.19:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.19:2379"
ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.19:2380,etcd-2=https://192.168.217.20:2380,etcd-3=https://192.168.217.21:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

4,

在20服务器上执行添加member命令,这个非常重要!!!!!!!!!!!!

etct_serch member add etcd-1 --peer-urls=https://192.168.217.19:2380

5,

三个节点的etcd服务都重启:

systemctl daemon-reload  && systemctl restart etcd

6,etcd集群恢复的日志

查看20服务器上的系统日志可以看到etcd新加节点以及选主的过程,现将相关日志截取出来(3d那一串就是新节点,也就是192.168.217.19):

Nov  1 22:41:47 master2 etcd: raft2022/11/01 22:41:47 INFO: ef2fee107aafca91 switched to configuration voters=(4427268366965300623 17235256053515405969 17706125434919122208)
Nov  1 22:41:47 master2 etcd: added member 3d70d11f824a5d8f [https://192.168.217.19:2380] to cluster b459890bbabfc99f
Nov  1 22:41:47 master2 etcd: starting peer 3d70d11f824a5d8f...
Nov  1 22:41:47 master2 etcd: started HTTP pipelining with peer 3d70d11f824a5d8f
Nov  1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (writer)
Nov  1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (writer)
Nov  1 22:41:47 master2 etcd: started peer 3d70d11f824a5d8f
Nov  1 22:41:47 master2 etcd: added peer 3d70d11f824a5d8f
Nov  1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (stream Message reader)
Nov  1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (stream MsgApp v2 reader)
Nov  1 22:41:47 master2 etcd: peer 3d70d11f824a5d8f became active
Nov  1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream Message reader)
Nov  1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 reader)
Nov  1 22:41:47 master2 etcd: ef2fee107aafca91 initialized peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
Nov  1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream Message writer)
Nov  1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 writer)
Nov  1 22:41:47 master2 etcd: published {Name:etcd-2 ClientURLs:[https://192.168.217.20:2379]} to cluster b459890bbabfc99f
Nov  1 22:41:47 master2 etcd: ready to serve client requests
Nov  1 22:41:47 master2 systemd: Started Etcd Server.
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [term 639] received MsgTimeoutNow from 3d70d11f824a5d8f and starts an election to get leadership.
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 became candidate at term 640
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 received MsgVoteResp from ef2fee107aafca91 at term 640
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [logterm: 639, index: 492331] sent MsgVote request to 3d70d11f824a5d8f at term 640
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [logterm: 639, index: 492331] sent MsgVote request to f5b8cb45a0dcf520 at term 640
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: raft.node: ef2fee107aafca91 lost leader 3d70d11f824a5d8f at term 640
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 received MsgVoteResp from 3d70d11f824a5d8f at term 640
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 has received 2 MsgVoteResp votes and 0 vote rejections
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 became leader at term 640
Nov  1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: raft.node: ef2fee107aafca91 elected leader ef2fee107aafca91 at term 640
Nov  1 22:42:26 master2 etcd: lost the TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 reader)
Nov  1 22:42:26 master2 etcd: lost the TCP streaming connection with peer 3d70d11f824a5d8f (stream Message reader)

 

6,

etcd集群测试

可以看到现在的20服务器是etcd的leader,这一点也在上面的日志上有所体现。

[root@master2 ~]# etct_serch member list -w table
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  |  NAME  |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
| 3d70d11f824a5d8f | started | etcd-1 | https://192.168.217.19:2380 | https://192.168.217.19:2379 |      false |
| ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 |      false |
| f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 |      false |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
[root@master2 ~]# etct_serch endpoint status  -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.217.19:2379 | 3d70d11f824a5d8f |   3.4.9 |  4.0 MB |     false |      false |       640 |     494999 |             494999 |        |
| https://192.168.217.20:2379 | ef2fee107aafca91 |   3.4.9 |  4.1 MB |      true |      false |       640 |     495000 |             495000 |        |
| https://192.168.217.21:2379 | f5b8cb45a0dcf520 |   3.4.9 |  4.0 MB |     false |      false |       640 |     495000 |             495000 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@master2 ~]# etct_serch endpoint health  -w table
+-----------------------------+--------+-------------+-------+
|          ENDPOINT           | HEALTH |    TOOK     | ERROR |
+-----------------------------+--------+-------------+-------+
| https://192.168.217.19:2379 |   true | 26.210272ms |       |
| https://192.168.217.20:2379 |   true | 26.710558ms |       |
| https://192.168.217.21:2379 |   true | 27.903774ms |       |
+-----------------------------+--------+-------------+-------+

三,

kubernetes节点的恢复:

由于是三个master里的master1挂了,因此,需要重新安装haproxy和keepalived,这个没什么好说的,从正常节点20,把配置文件拷贝一份,在改改就可以了。

同样的,将kubelet,kubeadm和kubectl也重新安装一次。

然后四个节点都重置,也就是kubeadm reset -f   然后重新初始化即可,初始化文件(节点恢复这里本文最开始有挂部署安装教程的链接,在这就不重复了):

[root@master ~]# cat kubeadm-init-ha.yaml 
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
  - system:bootstrappers:kubeadm:default-node-token
  token: abcdef.0123456789abcdef
  ttl: "0"
  usages:
  - signing
  - authentication
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 192.168.217.19
  bindPort: 6443
nodeRegistration:
  criSocket: /var/run/dockershim.sock
  imagePullPolicy: IfNotPresent
  name: master1
  taints: null
---
controlPlaneEndpoint: "192.168.217.100"
apiServer:
  timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
dns: {}
etcd:
  external:
    endpoints:     #下面为自定义etcd集群地址
    - https://192.168.217.19:2379
    - https://192.168.217.20:2379
    - https://192.168.217.21:2379
    caFile: /etc/kubernetes/pki/etcd/ca.pem
    certFile: /etc/kubernetes/pki/etcd/apiserver-etcd-client.pem
    keyFile: /etc/kubernetes/pki/etcd/apiserver-etcd-client-key.pem
imageRepository: registry.aliyuncs.com/google_containers
kind: ClusterConfiguration
kubernetesVersion: 1.22.2
networking:
  dnsDomain: cluster.local
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
scheduler: {}

初始化命令(在19服务器上执行,因为初始化配置文件写的就是19服务器嘛):

kubeadm init --config=kubeadm-init-ha.yaml --upload-certs

由于使用的是外部扩展etcd集群,因此,节点恢复起来比较简单,网络插件什么的重新安装一次就可以了,效果如下图:

可以看到部分pod,比如kube-apiserver的时间都刷新了,像kube-proxy等pod并没有改变时间,这都是由于使用的是外部etcd集群的原因。

[root@master ~]# kubectl get po -A -owide
NAMESPACE     NAME                                       READY   STATUS    RESTARTS        AGE     IP               NODE      NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-796cc7f49d-k586w   1/1     Running   0               153m    10.244.166.131   node1     <none>           <none>
kube-system   calico-node-7x86d                          1/1     Running   0               153m    192.168.217.21   master3   <none>           <none>
kube-system   calico-node-dhxcq                          1/1     Running   0               153m    192.168.217.19   master1   <none>           <none>
kube-system   calico-node-jcq6p                          1/1     Running   0               153m    192.168.217.20   master2   <none>           <none>
kube-system   calico-node-vjtv6                          1/1     Running   0               153m    192.168.217.22   node1     <none>           <none>
kube-system   coredns-7f6cbbb7b8-7c85v                   1/1     Running   16              5d23h   10.244.166.129   node1     <none>           <none>
kube-system   coredns-7f6cbbb7b8-7xm62                   1/1     Running   0               152m    10.244.166.132   node1     <none>           <none>
kube-system   kube-apiserver-master1                     1/1     Running   0               107m    192.168.217.19   master1   <none>           <none>
kube-system   kube-apiserver-master2                     1/1     Running   0               108m    192.168.217.20   master2   <none>           <none>
kube-system   kube-apiserver-master3                     1/1     Running   1 (107m ago)    107m    192.168.217.21   master3   <none>           <none>
kube-system   kube-controller-manager-master1            1/1     Running   5 (108m ago)    3h26m   192.168.217.19   master1   <none>           <none>
kube-system   kube-controller-manager-master2            1/1     Running   15 (108m ago)   4d10h   192.168.217.20   master2   <none>           <none>
kube-system   kube-controller-manager-master3            1/1     Running   15              4d10h   192.168.217.21   master3   <none>           <none>
kube-system   kube-proxy-69w6c                           1/1     Running   2 (131m ago)    2d12h   192.168.217.19   master1   <none>           <none>
kube-system   kube-proxy-vtz99                           1/1     Running   5               2d12h   192.168.217.22   node1     <none>           <none>
kube-system   kube-proxy-wldcc                           1/1     Running   5               2d12h   192.168.217.21   master3   <none>           <none>
kube-system   kube-proxy-x6w6l                           1/1     Running   5               2d12h   192.168.217.20   master2   <none>           <none>
kube-system   kube-scheduler-master1                     1/1     Running   4 (108m ago)    3h26m   192.168.217.19   master1   <none>           <none>
kube-system   kube-scheduler-master2                     1/1     Running   14 (108m ago)   4d10h   192.168.217.20   master2   <none>           <none>
kube-system   kube-scheduler-master3                     1/1     Running   13 (46m ago)    4d10h   192.168.217.21   master3   <none>           <none>
kube-system   metrics-server-55b9b69769-j9c7j            1/1     Running   13 (127m ago)   23h     10.244.166.130   node1     <none>           <none>


相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
云原生实践公开课
课程大纲 开篇:如何学习并实践云原生技术 基础篇: 5 步上手 Kubernetes 进阶篇:生产环境下的 K8s 实践 相关的阿里云产品:容器服务&nbsp;ACK 容器服务&nbsp;Kubernetes&nbsp;版(简称&nbsp;ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情:&nbsp;https://www.aliyun.com/product/kubernetes
目录
相关文章
|
4天前
|
存储 运维 Kubernetes
Kubernetes 集群的监控与维护策略
【4月更文挑战第23天】 在微服务架构日益盛行的当下,容器编排工具如 Kubernetes 成为了运维工作的重要环节。然而,随着集群规模的增长和复杂性的提升,如何确保 Kubernetes 集群的高效稳定运行成为了一大挑战。本文将深入探讨 Kubernetes 集群的监控要点、常见问题及解决方案,并提出一系列切实可行的维护策略,旨在帮助运维人员有效管理和维护 Kubernetes 环境,保障服务的持续可用性和性能优化。
|
6天前
|
存储 运维 Kubernetes
Kubernetes 集群的持续性能优化实践
【4月更文挑战第22天】在动态且复杂的微服务架构中,确保 Kubernetes 集群的高性能运行是至关重要的。本文将深入探讨针对 Kubernetes 集群性能优化的策略与实践,从节点资源配置、网络优化到应用部署模式等多个维度展开,旨在为运维工程师提供一套系统的性能调优方法论。通过实际案例分析与经验总结,读者可以掌握持续优化 Kubernetes 集群性能的有效手段,以适应不断变化的业务需求和技术挑战。
17 4
|
24天前
|
数据库 存储 监控
什么是 SAP HANA 内存数据库 的 Delta Storage
什么是 SAP HANA 内存数据库 的 Delta Storage
17 0
什么是 SAP HANA 内存数据库 的 Delta Storage
|
1天前
|
运维 Kubernetes 监控
Kubernetes 集群的持续性能优化实践
【4月更文挑战第26天】 在动态且不断增长的云计算环境中,维护高性能的 Kubernetes 集群是一个挑战。本文将探讨一系列实用的策略和工具,旨在帮助运维专家监控、分析和优化 Kubernetes 集群的性能。我们将讨论资源分配的最佳实践,包括 CPU 和内存管理,以及集群规模调整的策略。此外,文中还将介绍延迟和吞吐量的重要性,并提供日志和监控工具的使用技巧,以实现持续改进的目标。
|
13天前
|
Kubernetes 搜索推荐 Docker
使用 kubeadm 部署 Kubernetes 集群(二)k8s环境安装
使用 kubeadm 部署 Kubernetes 集群(二)k8s环境安装
58 17
|
26天前
|
消息中间件 Kubernetes Kafka
Terraform阿里云创建资源1分钟创建集群一键发布应用Terraform 创建 Kubernetes 集群
Terraform阿里云创建资源1分钟创建集群一键发布应用Terraform 创建 Kubernetes 集群
18 0
|
26天前
|
Kubernetes 安全 网络安全
搭建k8s集群kubeadm搭建Kubernetes二进制搭建Kubernetes集群
搭建k8s集群kubeadm搭建Kubernetes二进制搭建Kubernetes集群
108 0
|
1月前
|
Kubernetes Cloud Native Docker
【云原生】kubeadm快速搭建K8s集群Kubernetes1.19.0
Kubernetes 是一个开源平台,用于管理容器化工作负载和服务,提供声明式配置和自动化。源自 Google 的大规模运维经验,它拥有广泛的生态支持。本文档详细介绍了 Kubernetes 集群的搭建过程,包括服务器配置、Docker 和 Kubernetes 组件的安装,以及 Master 和 Node 的部署。此外,还提到了使用 Calico 作为 CNI 网络插件,并提供了集群功能的测试步骤。
221 0
|
1月前
|
Prometheus 监控 Kubernetes
Kubernetes 集群的监控与日志管理实践
【2月更文挑战第31天】 在微服务架构日益普及的今天,容器编排工具如Kubernetes已成为部署、管理和扩展容器化应用的关键平台。然而,随着集群规模的扩大和业务复杂性的增加,如何有效监控集群状态、及时响应系统异常,以及管理海量日志信息成为了运维人员面临的重要挑战。本文将深入探讨 Kubernetes 集群监控的最佳实践和日志管理的高效策略,旨在为运维团队提供一套系统的解决思路和操作指南。
27 0
|
1月前
|
存储 Kubernetes 监控
容器服务ACK常见问题之容器服务ACK worker节点选择不同地域失败如何解决
容器服务ACK(阿里云容器服务 Kubernetes 版)是阿里云提供的一种托管式Kubernetes服务,帮助用户轻松使用Kubernetes进行应用部署、管理和扩展。本汇总收集了容器服务ACK使用中的常见问题及答案,包括集群管理、应用部署、服务访问、网络配置、存储使用、安全保障等方面,旨在帮助用户快速解决使用过程中遇到的难题,提升容器管理和运维效率。

热门文章

最新文章