metrics-server - unable to fully collect metrics

本文涉及的产品
容器服务 Serverless 版 ACK Serverless,317元额度 多规格
容器服务 Serverless 版 ACK Serverless,952元额度 多规格
简介: metrics-server - unable to fully collect metrics

部署 metrics-server 的前提条件

  • 要保证 apiserver 所在节点和 metrics-serevrpod 之间网络可以互通 [ kubeadm 部署的集群会部署相应的 work 节点组件 ]
  • 要保证 apiserver 配置中开启了聚合配置 [ kubeadm 部署的集群,默认开启了聚合 ]

部署 metrics-server 需要注意的地方

修改镜像的 tag
  • 官方下载下来的镜像是国外仓库的,国内很难拉取
sed -i 's#k8s.gcr.io#registry.cn-hangzhou.aliyuncs.com/google_containers#g' components.yaml

修改前

image: k8s.gcr.io/metrics-server-amd64:v0.3.6

修改后

image: registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server-amd64:v0.3.6
修改启动参数

修改前

  • 官方只有两个启动参数
args:
  - --cert-dir=/tmp
  - --secure-port=4443

修改后

  • metric-resolution : 从 kubelet 采集数据的周期,默认为 60s

kubelet-preferred-address-types : 优先使用 InternalIP 来访问 kubelet,这样可以避免节点名称没有 DNS 解析记录时,通过节点名称调用节点 kubelet API 失败的情况

  • 默认为 Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP
  • kubelet-insecure-tls : 不要验证 kubelet 提供的服务证书
- args:
  - --cert-dir=/tmp
  - --secure-port=4443
  - --metric-resolution=10s
  - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
  - --kubelet-insecure-tls

不完整的报错合集

没有配置 --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
  • metrics-server 会有类似如下的报错
E0907 14:29:51.774592       1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to 
fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host, unable to fully scr
ape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup 
<node_name> on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): 
Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host]
E0907 14:30:10.517886       1 reststorage.go:112] unable to fetch node metrics for node "<node_name>": no metrics known for node "<node_name>"
  • 当然,也可以在 metrics-server 里面增加 hosts 解析
没有配置 --kubelet-insecure-tls
x509: certificate signed by unknown authority
apiserver 节点与 metrics-server pod 之间网络不通
  • metrics-server 会有类似如下的报错
unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to get CPU for container "metrics-server" in pod kube-system/metrics-server-7db5b7cb7c-pkcjb on node "<node_name>", discarding data: missing cpu usage metric
  • 在 apiserver 里可以看到类似如下的报错
v1beta1.metrics.k8s.io failed with: failing or missing response from https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1: Get "https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1": context deadline exceeded
v1beta1.metrics.k8s.io failed with: failing or missing response from https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1: Get "https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1": dial tcp 172.30.1.16:4443: i/o timeout

个人场景

  • 前期使用的二进制部署的 k8s 集群,当时的规划是 master 节点不运行 pod,于是没有安装 flannel 插件
  • 整体部署中,flannel 采用了 pod 的形式部署,如果 master 节点要部署 flannel,等同于 master 节点需要复用 work 节点,与原先的期望不符合
  • 于是在 master 节点复用 node 节点的情况下,将节点标记为不可调度驱逐所有负载

将节点标记为不可调度

kubectl cordon <node name>

驱逐节点 pod ,保留 daemonset 类型的 pod

kubectl drain <node name> --ignore-daemonsets


相关实践学习
通过Ingress进行灰度发布
本场景您将运行一个简单的应用,部署一个新的应用用于新的发布,并通过Ingress能力实现灰度发布。
容器应用与集群管理
欢迎来到《容器应用与集群管理》课程,本课程是“云原生容器Clouder认证“系列中的第二阶段。课程将向您介绍与容器集群相关的概念和技术,这些概念和技术可以帮助您了解阿里云容器服务ACK/ACK Serverless的使用。同时,本课程也会向您介绍可以采取的工具、方法和可操作步骤,以帮助您了解如何基于容器服务ACK Serverless构建和管理企业级应用。 学习完本课程后,您将能够: 掌握容器集群、容器编排的基本概念 掌握Kubernetes的基础概念及核心思想 掌握阿里云容器服务ACK/ACK Serverless概念及使用方法 基于容器服务ACK Serverless搭建和管理企业级网站应用
目录
相关文章
|
Kubernetes 监控 网络协议
|
Java
Appium问题解决方案(8)- selenium.common.exceptions.WebDriverException: Message: An unknown server-side error occurred while processing the command. Original error: Could not sign with default certificate.
Appium问题解决方案(8)- selenium.common.exceptions.WebDriverException: Message: An unknown server-side error occurred while processing the command. Original error: Could not sign with default certificate.
1097 0
Appium问题解决方案(8)- selenium.common.exceptions.WebDriverException: Message: An unknown server-side error occurred while processing the command. Original error: Could not sign with default certificate.
|
资源调度
yarn 错误:There appears to be trouble with your network connection. Retrying…
yarn 错误:There appears to be trouble with your network connection. Retrying…
1356 0
|
2月前
|
监控
{"level":"warn","ts":"2023-11-07T00:35:53.400+0800","caller":"etcdserver/server.go:2048",&
{"level":"warn","ts":"2023-11-07T00:35:53.400+0800","caller":"etcdserver/server.go:2048",&
|
6月前
|
Kubernetes 容器 Perl
error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is cu
error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is cu
112 0
|
6月前
|
Prometheus Kubernetes 监控
metrics-server
Metrics Server 是一个 Kubernetes 集群的附加组件,用于收集和暴露 Kubernetes 集群的运行时指标。Metrics Server 提供了 Kubernetes 集群的详细信息,包括节点、pod、service 等资源的资源使用情况、性能指标等。这些指标对于监控、诊断和优化 Kubernetes 集群的运行状况非常有用。
336 4
|
Linux
ERROR: 2 matches found based on name: network product-server_default is ambiguous
ERROR: 2 matches found based on name: network product-server_default is ambiguous
163 0
|
索引
问题复盘:Kibana did not load properly. Check the server output for more information
问题复盘:Kibana did not load properly. Check the server output for more information
429 0
问题复盘:Kibana did not load properly. Check the server output for more information