容器化管理k8s部署踩坑记录

简介: 容器化管理k8s部署踩坑记录

基本概念的理解

k8s是一种编排工具,类似于docker-compose,但是应用比后者广泛。

k8s水平扩展访问,本质上是增加pod,且新增的pod均匀分布在不同的机器上。

概念的层级关系k8s–node(对应一台物理机器)–pod。

容器有docker,rocket。k8s与docker的组合最常用。

pod是k8s最小调度单元。每个pod有自己独立的ip。一个pod里面多个docker container的ip都是一样的。pod里面可以开多个docker container。一般情况下一个pod只开一个container,当同一个pod开多个container的时候,多个容器的协作关系采用localhost的方式通信。

节点中的kubelet相当于node server的作用,会上报节点中不同pod的信息。

kublet-pod-docker-container三者之间的关系,类似操作系统中的调度器-进程-线程之间的关系。

所有的数据交互都是通过server-api组件来完成。

etcd默认在master节点上。

service是用来统一管理pod的组件,怎样将pod分发到不同的node。对外表现为一个访问入口,可以认为是一个用来做负载均衡的网管。

参考书:《k8s权威指南》

实操

准备工作

1. 禁用swap

sudo swapoff -a #无输出

注释掉/etc/fstab中关于swap语句

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UmKR3G1L-1664502154890)(D:\Documents\fiberhome\MarkDown\Images\0voice\2022-09-19 194506.png)]

2. 关闭防火墙

master@ubuntu:~$ sudo systemctl stop firewalld #提示防火墙服务没有导入
Failed to stop firewalld.service: Unit firewalld.service not loaded.
master@ubuntu:~$ sudo systemctl disable firewalld
Failed to disable unit: Unit file firewalld.service does not exist.

3. 禁用selinux

更新ubuntu源

sudo apt install selinux-utils #安装不成功,跟mysql有关系
E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.

这里属于历史遗留问题,参考另一篇解决方法。解决后,更新软件源,用来更快安装selinux。

sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak

更新文件内容,用阿里源替换

sudo gedit /etc/apt/sources.list
#  阿里源
deb http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse

selinux安装完成后,执行

master@ubuntu:~$ sudo setenforce 0
setenforce: SELinux is disabled

4. 修改主机名

master@ubuntu:~$ sudo vim /etc/hosts
master@ubuntu:~$ sudo /etc/init.d/networking restart 
[ ok ] Restarting networking (via systemctl): networking.service.

设置主机名

sudo hostnamectl set-hostname qiu-k8s-m

键入hostnamectl显示主机名已修改成功

master@ubuntu:~$ hostnamectl
   Static hostname: qiu-k8s-m
         Icon name: computer-vm
           Chassis: vm
               ... ...

经过重启后,终端中会显示改过后的主机名

master@qiu-k8s-m:~$

5. 安装docker

已经安装,输入docker -v可显示版本

master@qiu-k8s-m:~$ docker -v
Docker version 20.10.17, build 100c701

将当前用户加入docker组,避免每次docker命令都用sudo。但是要使修改后的/etc/group生效得重启机器。

sudo groupadd docker #打开/etc/group最后一行会有docker:x:GID:
sudo usermod -aG docker $USER #将当前用户加入到GID:后面

测试docker能否正常拉取镜像

master@qiu-k8s-m:~$ docker run -it ubuntu bash
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
2b55860d4c66: Pull complete 
Digest: sha256:20fa2d7bb4de7723f542be5923b06c4d704370f0390e4ae9e1c833c8785644c1
Status: Downloaded newer image for ubuntu:latest
root@32cfec738c47:/#

能正常显示以上信息,则表示docker安装成功。

如果拉取速度太慢,是因为要从Docker Hub上下载,可修改相关文件

sudo vim /etc/docker/daemon.json
#添加
{ "registry-mirrors": ["https://alzgoonw.mirror.aliyuncs.com"], "live-restore": true }

并且执行

sudo systemctl daemon-reload 
sudo systemctl restart docker

来重启docker服务。更新后再拉取镜像就快很多了。

如果启动docker服务出现问题,可能是/etc/docker/daemon.json改的有问题,使用dockerd --debug排查

master@qiu-k8s-m:~$ sudo dockerd --debug
unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives don't match any configuration option: __registry-mirrors
准备安装kubectl,kubelet,kubeadm

添加秘钥

master@qiu-k8s-m:~$ curl https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2537  100  2537    0     0    898      0  0:00:02  0:00:02 --:--:--   898
OK

添加Kubernetes软件源,采用ustc源

sudo vim /etc/apt/source.list.d/kubernetes.list

添加内容

deb http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial main

由于xenial是ubuntu16.04的代号,而实验的机器是18.04,理应将xenial替换成bionic,但换过之后反而找不到源,于是再换回来。

安装kubelet,kubelet,kubeadm

更新源,然后使用apt-get install的方式安装

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo systemctl enable kubelet

查看节点试试

master@qiu-k8s-m:~$ kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?

或者报这个错误

master@qiu-k8s-m:~$ sudo kubectl get nodes
Unable to connect to the server: dial tcp 192.168.230.246:6443: connect: no route to host

因为还没有布置节点,所以无节点显示。

将其他两个节点按同样的步骤过一遍。

配置master

配置环境变量

sudo vim /etc/profile

末尾加入

export KUBECONFIG=/etc/kubernetes/admin.conf

立即生效

source /etc/profile

重启kubelet

sudo systemctl daemon-reload 
sudo systemctl restart kubelet

初始化kubeadm

sudo kubeadm init --image-repository registry.aliyuncs.com/google_containers --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=192.168.230.246 --ignore-preflight-errors=NumCPU

报错

W0921 11:13:33.282440    3157 version.go:104] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get "https://dl.k8s.io/release/stable-1.txt": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
W0921 11:13:33.282636    3157 version.go:105] falling back to the local client version: v1.25.1
[init] Using Kubernetes version: v1.25.1
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
  [ERROR CRI]: container runtime is not running: output: E0921 11:13:33.473505    3198 remote_runtime.go:948] "Status from runtime service failed" err="rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService"
time="2022-09-21T11:13:33+08:00" level=fatal msg="getting status of runtime: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService"
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

解决办法,删掉/etc/containerd/config.toml文件,重启容器服务

sudo rm -rf /etc/containerd/config.toml 
systemctl restart containerd

可以跑起来,但是很慢,不过提示可以先执行sudo kubeadm config images pull这个命令

master@qiu-k8s-m:~$ sudo kubeadm init --image-repository registry.aliyuncs.com/google_containers --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=192.168.230.246 --ignore-preflight-errors=NumCPU
[sudo] password for master: 
[init] Using Kubernetes version: v1.25.1
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
^C

于是先拉镜像,同样很慢,不过还是熬结束了

master@qiu-k8s-m:~$ sudo kubeadm config images pull --image-repository registry.aliyuncs.com/google_containers
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-apiserver:v1.25.1
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-controller-manager:v1.25.1
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-scheduler:v1.25.1
[config/images] Pulled registry.aliyuncs.com/google_containers/kube-proxy:v1.25.1
[config/images] Pulled registry.aliyuncs.com/google_containers/pause:3.8
[config/images] Pulled registry.aliyuncs.com/google_containers/etcd:3.5.4-0
[config/images] Pulled registry.aliyuncs.com/google_containers/coredns:v1.9.3

再回来执行init那条语句,但是遇到错误

...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
  timed out waiting for the condition
This error is likely caused by:
  - The kubelet is not running
  - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
  - 'systemctl status kubelet'
  - 'journalctl -xeu kubelet'
...

说是kubelet没有跑起来?那就再执行下reload

error execution phase preflight: [preflight] Some fatal errors occurred:
  [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
  [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
  [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
  [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
  [ERROR Port-10250]: Port 10250 is in use

仍然报错,显示端口已在使用。键入kubeadm reset,会删掉一些东西。重新执行init语句,会在

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed

等好一会,然后重新出现

Unfortunately, an error has occurred:
  timed out waiting for the condition
...

查来查去,可能是之前删除/etc/containerd/config.toml文件导致,因此再新建,加入一些信息

先在当前目录生成这个文件

containerd config default > config.toml

打开文件,将k8s.gcr.io替换为registry.cn-hangzhou.aliyuncs.com/google_containers

60     restrict_oom_score_adj = false
 61     sandbox_image = "k8s.gcr.io/pause:3.6"
 62     selinux_category_range = 1024

改完后

60     restrict_oom_score_adj = false
 61     sandbox_image = "registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.6"
 62     selinux_category_range = 1024

将改完后的config.toml文件移动到/etc/containerd/下(当然也可以直接在/etc/containerd/目录下建文件,再修改)

再次执行init那句,仍然报错

....
This error is likely caused by:
  - The kubelet is not running
  - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
  - 'systemctl status kubelet'
  - 'journalctl -xeu kubelet'
....

于是决定按照提示在init这句后加上–v=6来获取更详细的信息,可以看到

...
I0926 22:36:31.414225   86362 round_trippers.go:553] GET https://192.168.230.246:6443/healthz?timeout=10s  in 2720 milliseconds
I0926 22:36:34.485995   86362 round_trippers.go:553] GET https://192.168.230.246:6443/healthz?timeout=10s  in 2792 milliseconds
I0926 22:36:37.557797   86362 round_trippers.go:553] GET https://192.168.230.246:6443/healthz?timeout=10s  in 2863 milliseconds
[kubelet-check] Initial timeout of 40s passed.
I0926 22:36:40.630317   86362 round_trippers.go:553] GET https://192.168.230.246:6443/healthz?timeout=10s  in 2935 milliseconds
I0926 22:36:43.701762   86362 round_trippers.go:553] GET https://192.168.230.246:6443/healthz?timeout=10s  in 3007 milliseconds
...

一直重复出现这几句,显然是健康检查时,连接主机地址超时了。通过ifconfig查看自己的ip已经改变,输入了一个不可用的ip,更改后,sudo kubeadm reset,再次init,终于出现了想要的输出

...
[addons] Applied essential addon: kube-proxy
I0926 22:45:05.364595   88564 loader.go:374] Config loaded from file:  /etc/kubernetes/admin.conf
I0926 22:45:05.366180   88564 loader.go:374] Config loaded from file:  /etc/kubernetes/admin.conf
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
  export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.230.130:6443 --token 65rht0.rs2weic09flv5o8s \
  --discovery-token-ca-cert-hash sha256:193cc3031acdde64264aba3b186e25462a2c8707b6d9bc27ba92393962d5eece

如果忘记保存join节点的token值,可输入命令

sudo kubeadm token create --print-join-command

根据以上提示需要输入

mkdir -p $HOME/.kube  
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config  
sudo chown $(id -u):$(id -g) $HOME/.kube/config

接着安装flannel网络组件

sudo kubectl apply -f  https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

稍等一会执行可以看到ready的状态

master@qiu-k8s-m:~$ sudo kubectl get node
NAME        STATUS   ROLES           AGE   VERSION
qiu-k8s-m   Ready    control-plane   20m   v1.25.1

执行init的过程中,由于旧的命令导致端口在使用,可以reset重设

[init] Using Kubernetes version: v1.25.2
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
  [ERROR Port-6443]: Port 6443 is in use
  [ERROR Port-10259]: Port 10259 is in use
  [ERROR Port-10257]: Port 10257 is in use
  [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
  [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
  [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
  [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
  [ERROR Port-10250]: Port 10250 is in use
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

端口被占用,重置试试

sudo kubeadm reset

重启kubeadm可以解决这个问题。

从节点配置

执行join那句报错

[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
  [ERROR CRI]: container runtime is not running: output: E0927 21:58:31.253378   27048 remote_runtime.go:948] "Status from runtime service failed" err="rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService"
time="2022-09-27T21:58:31+08:00" level=fatal msg="getting status of runtime: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService"
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

初步判断是container有关的错误,查到这个资料,解决办法为

rm -rf /etc/containerd/config.toml
systemctl restart containerd

后再执行join那句,最后能看到这个提示

...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
...

表示把当前node加入到集群了,但是输入get nodes这句,仍然有报错

The connection to the server localhost:8080 was refused - did you specify the right host or port?

参考这篇,将主节点/etc/kubenetes/admin.conf拷贝到从节点的指定目录

sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

再次执行,可看到节点信息列表

NAME         STATUS     ROLES           AGE   VERSION
qiu-k8s-m    Ready      control-plane   82m   v1.25.1
qiu-k8s-n2   NotReady   <none>          10m   v1.25.1

最后安装flannel网络组件

kubectl apply -f  https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

可能是网络连接错误

The connection to the server raw.githubusercontent.com was refused - did you specify the right host or port?

那么换种方式,将主节点的/etc/cni拷贝到从节点下的相同目录

cp -r /etc/cni /mnt/
sudo mv  /mnt/cni /etc/

告一段落。

yaml文件里kind有四种类型,Pod,ReolicationController,Deployment,Service。

yaml文件tab和空格不能混用,不支持tab缩进。NodePort是对外提供的端口。

重新配置master节点和从节点,有错误如下

Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

这是由于在sudo kubeadm reset时没有删除~/.kube/config文件所致(因为这个文件是手动拷过去的),解决办法重新拷一份

sudo cp /etc/kubernetes/admin.conf ~/.kube/config

从节点~/.kube/config也需要从主节点拷过去。

为什么配的从节点没有内部的ip?

通过查看pod详情

sudo kubectl describe pod httpd

错误细节

坑太多。。。未完待续

相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
云原生实践公开课
课程大纲 开篇:如何学习并实践云原生技术 基础篇: 5 步上手 Kubernetes 进阶篇:生产环境下的 K8s 实践 相关的阿里云产品:容器服务&nbsp;ACK 容器服务&nbsp;Kubernetes&nbsp;版(简称&nbsp;ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情:&nbsp;https://www.aliyun.com/product/kubernetes
相关文章
|
20小时前
|
Kubernetes 应用服务中间件 nginx
Kubernetes详解(六)——Pod对象部署和应用
在Kubernetes系列中,本文聚焦Pod对象的部署和管理。首先,通过`kubectl run`命令创建Pod,如`kubectl run pod-test --image=nginx:1.12 --port=80 --replicas=1`。接着,使用`kubectl get deployment`或`kubectl get pods`查看Pod信息,添加`-o wide`参数获取详细详情。然后,利用Pod的IP地址进行访问。最后,用`kubectl delete pods [Pod名]`删除Pod,但因Controller控制器,删除后Pod可能自动重建。了解更多细节,请参阅原文链接。
8 5
|
20小时前
|
Kubernetes Linux Docker
Kubernetes详解(四)——基于kubeadm的Kubernetes部署
Kubernetes详解(四)——基于kubeadm的Kubernetes部署
9 2
|
2天前
|
Kubernetes Java 调度
Java容器技术:Docker与Kubernetes
Java容器技术:Docker与Kubernetes
12 0
|
10天前
|
敏捷开发 运维 测试技术
构建高效自动化运维体系:基于容器技术的持续集成与持续部署实践
【4月更文挑战第30天】在数字化转型的浪潮中,企业对软件交付速度和质量的要求日益提高。自动化运维作为提升效率、确保稳定性的关键手段,其重要性不言而喻。本文将探讨如何利用容器技术构建一个高效的自动化运维体系,实现从代码提交到产品上线的持续集成(CI)与持续部署(CD)。通过分析现代容器技术与传统虚拟化的差异,阐述容器化带来的轻量化、快速部署及易于管理的优势,并结合实例讲解如何在实际环境中搭建起一套完善的CI/CD流程。
|
10天前
|
存储 Linux 文件存储
Linux使用Docker部署Traefik容器并实现远程访问管理界面-1
Linux使用Docker部署Traefik容器并实现远程访问管理界面
|
10天前
|
存储 Java 应用服务中间件
Springboot项目打war包部署到外置tomcat容器【详解版】
该文介绍了将Spring Boot应用改为war包并在外部Tomcat中部署的步骤:1) 修改pom.xml打包方式为war;2) 排除内置Tomcat依赖;3) 创建`ServletInitializer`类继承`SpringBootServletInitializer`;4) build部分需指定`finalName`;5) 使用`mvn clean package`打包,将war包放入外部Tomcat的webapps目录,通过startup脚本启动Tomcat并访问应用。注意,应用访问路径和静态资源引用需包含war包名。
|
10天前
|
运维 Kubernetes 持续交付
构建高效自动化运维系统:基于容器技术的持续集成与持续部署实践
【4月更文挑战第30天】 在快速发展的云计算时代,传统的运维模式已无法满足敏捷开发和快速迭代的需求。本文将介绍如何利用容器技术搭建一套高效自动化运维系统,实现软件的持续集成(CI)与持续部署(CD)。文章首先探讨了现代运维面临的挑战,接着详细阐述了容器技术的核心组件和工作原理,最后通过实际案例展示了如何整合这些组件来构建一个可靠、可扩展的自动化运维平台。
|
10天前
|
弹性计算 Shell 数据安全/隐私保护
自动化构建和部署Docker容器
【4月更文挑战第30天】
14 0
|
11天前
|
Kubernetes 应用服务中间件 nginx
K8S二进制部署详解,一文教会你部署高可用K8S集群(二)
K8S二进制部署详解,一文教会你部署高可用K8S集群(二)
|
11天前
|
Kubernetes 网络安全 数据安全/隐私保护
K8S二进制部署详解,一文教会你部署高可用K8S集群(一)
K8S二进制部署详解,一文教会你部署高可用K8S集群(一)