k8s集群安装
整理一下自定义搭建的K8S集群安装过程。
安装部署方式采用的是kubeadm方式,一共3个节点,一个master两个node.大致分为3个步骤。
- 所有节点的安装准备工作
- master节点初始化
- 将node节点加入到master节点中
一、所有节点的准备工作
- 增加所有的节点的hosts文件,使他们可以通过域名直接访问
cat>./etc/hosts<<EOF 8.130.17.55 master 8.130.136.89 node1 8.130.103.97 node2 EOF
这里使用了批量执行,不想去一个节点一个节点的增加==。
- 时间同步,按照网上教程增加了一个时间同步,保持服务器之间无时间差
## 以此执行3条命令,需服务器联网 systemctl start chronyd systemctl enable chronyd date
- 禁用iptable和firewalld
kubernetes和docker 在运行的中会产生大量的iptables规则,为了不让系统规则跟它们混淆,直接关闭系统的规则
systemctl stop firewalld systemctl disable firewalld systemctl stop iptables systemctl disable iptables
此处可能会报错:
Failed to stop iptables.service: Unit iptables.service not loaded.
解决办法执行:
yum install iptables-services
这是因为centos7后是使用的基于iptable的systemctl stop firewalld
再次执行:systemctl disable iptables
成功
- 禁用selinux
selinux是linux系统下的一个安全服务,如果不关闭它,在安装集群中会产生各种不确定问题.未尝试不关闭会怎样,官网推荐关闭。
查找/etc/selinux/config文件将里面的SELINUX值改为disable
sed -i "s/targeted/disabled/g" /etc/selinux/config
- 禁用swap分区
swap分区指的是虚拟内存分区,它的作用是物理内存使用完,之后将磁盘空间虚拟成内存来使用,启用swap设备会对系统的性能产生非常负面的影响,因此kubernetes要求每个节点都要禁用swap设备,但是如果因为某些原因确实不能关闭swap分区,就需要在集群安装过程中通过明确的参数进行配置说明
vim /etc/fstab 注释掉 /dev/mapper/centos-swap swap # /dev/mapper/centos-swap swap
注意修改完毕之后需要重启linux服务
- 修改linux的内核参数
修改linux的内核采纳数,添加网桥过滤和地址转发功能
编辑/etc/sysctl.d/kubernetes.conf文件
cat>./etc/sysctl.d/kubernetes.conf<<EOF net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 net.ipv4.ip_forward = 1 EOF
执行过后重新加载配置:
#重新加载配置 sysctl -p # 加载网桥过滤模块儿 modprobe br_netfilter #查看是否成功 lsmod | grep br_netfilter
- 配置IPVS功能
在Kubernetes中Service有两种模型,一种是基于iptables的,一种是基于ipvs的。两者比较的话,ipvs的性能明显要高一些,如果要使用它,需要手动载入ipvs模块。
# 安装ipset和ipvsadm yum install ipset ipvsadm -y #添加需要加载的模块写入脚本文件,此处配置有问题,下面会补充 cat <<EOF> /etc/sysconfig/modules/ipvs.modules #!/bin/bash modprobe -- ip_vs modprobe -- ip_vs_rr modprobe -- ip_vs_wrr modprobe -- ip_vs_sh modprobe -- nf_conntrack_ipv4 EOF #为脚本添加执行权限 chmod +x /etc/sysconfig/modules/ipvs.modules # 执行脚本文件 /bin/bash /etc/sysconfig/modules/ipvs.modules #查看对应的模块是否加载成功 lsmod | grep -e ip_vs -e nf_conntrack_ipv4
此处再执行脚本时会报错,
modprobe: FATAL: Module nf_conntrack_ipv4 not found.
原因:因为使用了高内核.高版本内核已经把nf_conntrack_ipv4替换为nf_conntrack了。所以正确配置应该如下
cat<<EOF>/etc/sysconfig/modules/ipvs.modules #!/bin/bash modprobe -- ip_vs modprobe -- ip_vs_rr modprobe -- ip_vs_wrr modprobe -- ip_vs_sh modprobe -- nf_conntrack EOF
- 安装docker
# 切换镜像源为阿里云 wget https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo -O /etc/yum.repos.d/docker-ce.repo # 查看当前镜像列表 yum list docker-ce --showduplicates # 安装特定版本的docker-ce # 必须制定--setopt=obsoletes=0,否则yum会自动安装更高版本 yum install --setopt=obsoletes=0 docker-ce-19.03.13-3.el8 -y # 添加配置 mkdir /etc/docker cat<<EOF>/etc/docker/daemon.json { "exec-opts": ["native.cgroupdriver=systemd"], "registry-mirrors": ["https://n6mm3rim.mirror.aliyuncs.com"] } EOF systemctl restart docker systemctl enable docker
- 安装k8s组件
# 1、由于kubernetes的镜像在国外,速度比较慢,这里切换成国内的镜像源 # 2、编辑/etc/yum.repos.d/kubernetes.repo,添加下面的配置 [kubernetes] name=Kubernetes baseurl=http://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgchech=0 repo_gpgcheck=0 gpgkey=http://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg http://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg # 安装kubeadm、kubelet和kubectl yum install --setopt=obsoletes=0 kubeadm-1.26.4-0 kubelet-1.26.4-0 kubectl-1.26.4-0 -y # 4、配置kubelet的cgroup #编辑/etc/sysconfig/kubelet, 添加下面的配置 KUBELET_CGROUP_ARGS="--cgroup-driver=systemd" KUBE_PROXY_MODE="ipvs" systemctl enable kubelet
- 准备集群镜像,此步骤参考教程是为了K8S组件再启动时去拉取需要的镜像过慢,所以自己手动拉取
# 在安装kubernetes集群之前,必须要提前准备好集群需要的镜像,所需镜像可以通过下面命令查看 kubeadm config images list
准备需要拉取的脚本,使用循环拉取
# 此镜像kubernetes的仓库中,由于网络原因,无法连接,下面提供了一种替换方案 images=( kube-apiserver:v1.26.12 kube-controller-manager:v1.26.12 kube-scheduler:v1.26.12 kube-proxy:v1.26.12 pause:3.9 etcd:3.5.6-0 coredns:v1.9.3 ) for imageName in ${images[@]};do docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/$imageName docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/$imageName k8s.gcr.io/$imageName docker rmi registry.cn-hangzhou.aliyuncs.com/google_containers/$imageName done
二、master节点初始化
- 执行初始化
kubeadm init \ --apiserver-advertise-address=8.130.17.55 \ --image-repository registry.aliyuncs.com/google_containers \ --kubernetes-version=v1.26.12 \ --service-cidr=10.96.0.0/12 \ --pod-network-cidr=10.244.0.0/16
首次执行报错:
[init] Using Kubernetes version: v1.26.12 [preflight] Running pre-flight checks [WARNING FileExisting-tc]: tc not found in system path [WARNING Hostname]: hostname "iz0jlep7hqqxlc1jvoa43xz" could not be reached [WARNING Hostname]: hostname "iz0jlep7hqqxlc1jvoa43xz": lookup iz0jlep7hqqxlc1jvoa43xz on 100.100.2.136:53: no such host error execution phase preflight: [preflight] Some fatal errors occurred: [ERROR CRI]: container runtime is not running: output: time="2024-01-07T17:47:30+08:00" level=fatal msg="validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService" , error: exit status 1 [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...` To see the stack trace of this error execute with --v=5 or higher
解决:
sudo vim /etc/containerd/config.toml 将disabled_plugins = ["cri"] 注释掉
执行清除缓存操作
kubeadm reset rm -fr $HOME/.kube/config
再次执行初始化报错:
"CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"registry.k8s.io/pause:3.6\": failed to pull image \"registry.k8s.io/pause:3.6\": failed to pull and unpack image \"registry.k8s.io/pause:3.6\": failed to resolve reference \"registry.k8s.io/pause:3.6\": failed to do request: Head \"https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\": dial tcp 64.233.188.82:443: i/o timeout" pod="kube-system/etcd-iz0jlep7hqqxlc1jvoa43xz"
拉取镜像registry.k8s.io/pause:3.6 失败
解决
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.6 docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.6 registry.k8s.io/pause:3.6 docker rmi registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.6
再次执行初始化,因为使用阿里云,所以更改为内部IP测试:
kubeadm init \ --apiserver-advertise-address=172.30.96.3 \ --image-repository registry.aliyuncs.com/google_containers \ --kubernetes-version=v1.26.12 \ --service-cidr=10.96.0.0/12 \ --pod-network-cidr=10.244.0.0/16
成功,执行初始化后操作
mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
三、将node节点加入master
- master创建成功后会有token,但是当没有即使保存的情况下,需要我们重新获取下或重新创建下。
# 查看token kubeadm token list # 如果没有就重新创建 kubeadm token create
再次查看
[root@iZ0jlep7hqqxlc1jvoa43xZ ~]# kubeadm token list TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS 774tup.rjay53649futq3vs 23h 2024-01-09T13:00:08Z authentication,signing <none> system:bootstrappers:kubeadm:default-node-token
创建:discovery-token-ca-cert
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
拷贝值需要自己拼接sha256
node执行命令:
kubeadm join 172.30.96.3:6443 --token 774tup.rjay53649futq3vs --discovery-token-ca-cert-hash sha256:d39ec7b385e957934b902f7c72d2a454cf937d1b45bf70ea1c832cd296533504
报错:
[preflight] Running pre-flight checks [WARNING FileExisting-tc]: tc not found in system path error execution phase preflight: [preflight] Some fatal errors occurred: [ERROR CRI]: container runtime is not running: output: time="2024-01-08T21:43:55+08:00" level=fatal msg="validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService" , error: exit status 1 [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...` To see the stack trace of this error execute with --v=5 or higher
此处和上面一样需要关闭cri
sudo vim /etc/containerd/config.toml 将disabled_plugins = ["cri"] 注释掉
重启containerd
systemctl restart containerd
再次执行:
kubeadm join 172.30.96.3:6443 --token 774tup.rjay53649futq3vs --discovery-token-ca-cert-hash sha256:d39ec7b385e957934b902f7c72d2a454cf937d1b45bf70ea1c832cd296533504
加入成功:
[preflight] Running pre-flight checks [WARNING FileExisting-tc]: tc not found in system path [preflight] Reading configuration from the cluster... [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml" [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env" [kubelet-start] Starting the kubelet [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap... This node has joined the cluster: * Certificate signing request was sent to apiserver and a response was received. * The Kubelet was informed of the new secure connection details. Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
node2执行一样的操作
查看master
[root@iZ0jlep7hqqxlc1jvoa43xZ /]# kubectl get no NAME STATUS ROLES AGE VERSION iz0jlep7hqqxlc1jvoa43vz NotReady <none> 5m56s v1.26.4 iz0jlep7hqqxlc1jvoa43wz NotReady <none> 4s v1.26.4 iz0jlep7hqqxlc1jvoa43xz NotReady control-plane 26h v1.26.4
但是现在都是 NotReady 状态
四、排查状态
已经将节点加入主节点了但是状态都是未启动,排查各个节点的状态
- 查看主节点的状态
kubectl get ComponentStatus [root@iZ0jlep7hqqxlc1jvoa43xZ /]# kubectl get ComponentStatus Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-0 Healthy
简化:
[root@iZ0jlep7hqqxlc1jvoa43xZ /]# kubectl get cs Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-0 Healthy
没啥问题
- 查看是否是pod有问题
[root@iZ0jlep7hqqxlc1jvoa43xZ /]# kubectl get namespace NAME STATUS AGE default Active 26h kube-node-lease Active 26h kube-public Active 26h kube-system Active 26h [root@iZ0jlep7hqqxlc1jvoa43xZ /]# kubectl get po -n kube-system NAME READY STATUS RESTARTS AGE coredns-5bbd96d687-29fq2 0/1 Pending 0 26h coredns-5bbd96d687-7zr2m 0/1 Pending 0 26h etcd-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 0 26h kube-apiserver-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 0 26h kube-controller-manager-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 2 26h kube-proxy-9l2nc 0/1 ContainerCreating 0 13m kube-proxy-fj9km 1/1 Running 0 26h kube-proxy-prdtv 0/1 ContainerCreating 0 7m41s kube-scheduler-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 2 26h
此处先查看 命名空间,查看kube-system 下的pod 可以发现网络有问题
- 安装网络插件
#新建文件夹 mkdir /opt/k8s cd /opt/k8s #下载文件: wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml # 若是网络原因那么可以先外部下载再上传进去 [root@iZ0jlep7hqqxlc1jvoa43xZ k8s]# grep image kube-flannel.yml image: docker.io/flannel/flannel-cni-plugin:v1.2.0 image: docker.io/flannel/flannel:v0.24.0 image: docker.io/flannel/flannel:v0.24.0
查看需要哪些镜像,然后拉取
docker pull docker.io/flannel/flannel-cni-plugin:v1.2.0 docker pull docker.io/flannel/flannel:v0.24.0
- 节点问题排查
主节点启动,镜像拉取过后一直提示从节点NotReady。问题排查记录
首先使用命令kubectl get po -n kube-system
查看是哪个有问题
[root@master ~]# kubectl get po -n kube-system NAME READY STATUS RESTARTS AGE coredns-5bbd96d687-29fq2 1/1 Running 1 (9h ago) 37h coredns-5bbd96d687-7zr2m 1/1 Running 1 (9h ago) 37h etcd-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 1 (9h ago) 37h kube-apiserver-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 1 (9h ago) 37h kube-controller-manager-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 3 (9h ago) 37h kube-proxy-9l2nc 1/1 Running 0 10h kube-proxy-fj9km 1/1 Running 1 (9h ago) 37h kube-proxy-prdtv 1/1 Running 0 10h kube-scheduler-iz0jlep7hqqxlc1jvoa43xz 1/1 Running 3 (9h ago) 37h
此处可以加上明细-owide 查看具体是哪个主机节点
此处为已经解决过的展示
找到服务name后使用命令kubectl describe po kube-proxy-prdtv -n kube-system查看具体的报错信息
d to pull image "registry.k8s.io/pause:3.6": failed to pull and unpack image "registry.k8s.io/pause:3.6": failed to resolve reference "registry.k8s.io/pause:3.6": failed to do request: Head "https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6": dial tcp 64.233.188.82:443: i/o timeout Warning FailedCreatePodSandBox 2s (x3 over 2m58s) kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = failed to get sandbox image "registry.k8s.io/pause:3.6": failed to pull image "registry.k8s.io/pause:3.6": failed to pull and unpack image "registry.k8s.io/pause:3.6": failed to resolve reference "registry.k8s.io/pause:3.6": failed to do request: Head "https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6": dial tcp 64.233.189.82:443: i/o timeout
一直提示有个镜像拉取不下来。从节点拉取镜像后重新启动可以。再次查看发现已经Ready状态
主要使用的命令就是:
# 查看节点问题 kubectl get no # 查看 kube-system 命名空间下的哪个pod有问题 kubectl get po -n kube-system kubectl get po -n kube-system -owide # 查看具体的报错详情 kubectl describe po kube-proxy-prdtv -n kube-system