部署 metrics-server 的前提条件
- 要保证
apiserver
所在节点和metrics-serevr
的pod
之间网络可以互通 [ kubeadm 部署的集群会部署相应的 work 节点组件 ]
- 要保证
apiserver
配置中开启了聚合配置 [ kubeadm 部署的集群,默认开启了聚合 ]
部署 metrics-server 需要注意的地方
修改镜像的 tag
- 官方下载下来的镜像是国外仓库的,国内很难拉取
sed -i 's#k8s.gcr.io#registry.cn-hangzhou.aliyuncs.com/google_containers#g' components.yaml
修改前
image: k8s.gcr.io/metrics-server-amd64:v0.3.6
修改后
image: registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server-amd64:v0.3.6
修改启动参数
修改前
- 官方只有两个启动参数
args: - --cert-dir=/tmp - --secure-port=4443
修改后
metric-resolution
:从 kubelet 采集数据的周期,默认为 60s
kubelet-preferred-address-types
: 优先使用 InternalIP 来访问 kubelet,这样可以避免节点名称没有 DNS 解析记录时,通过节点名称调用节点 kubelet API 失败的情况
默认为 Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP
kubelet-insecure-tls
:不要验证 kubelet 提供的服务证书
- args: - --cert-dir=/tmp - --secure-port=4443 - --metric-resolution=10s - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP - --kubelet-insecure-tls
不完整的报错合集
没有配置 --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
- metrics-server 会有类似如下的报错
E0907 14:29:51.774592 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host, unable to fully scr ape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host] E0907 14:30:10.517886 1 reststorage.go:112] unable to fetch node metrics for node "<node_name>": no metrics known for node "<node_name>"
- 当然,也可以在 metrics-server 里面增加 hosts 解析
没有配置 --kubelet-insecure-tls
x509: certificate signed by unknown authority
apiserver 节点与 metrics-server pod 之间网络不通
- metrics-server 会有类似如下的报错
unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to get CPU for container "metrics-server" in pod kube-system/metrics-server-7db5b7cb7c-pkcjb on node "<node_name>", discarding data: missing cpu usage metric
- 在 apiserver 里可以看到类似如下的报错
v1beta1.metrics.k8s.io failed with: failing or missing response from https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1: Get "https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1": context deadline exceeded v1beta1.metrics.k8s.io failed with: failing or missing response from https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1: Get "https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1": dial tcp 172.30.1.16:4443: i/o timeout
个人场景
- 前期使用的二进制部署的 k8s 集群,当时的规划是
master
节点不运行pod
,于是没有安装flannel
插件 - 整体部署中,
flannel
采用了pod
的形式部署,如果master
节点要部署flannel
,等同于master
节点需要复用work
节点,与原先的期望不符合
- 于是在
master
节点复用node
节点的情况下,将节点标记为不可调度
并驱逐所有负载
将节点标记为不可调度
kubectl cordon <node name>
驱逐节点 pod
,保留 daemonset
类型的 pod
kubectl drain <node name> --ignore-daemonsets