metrics-server - unable to fully collect metrics

简介: metrics-server - unable to fully collect metrics

部署 metrics-server 的前提条件

  • 要保证 apiserver 所在节点和 metrics-serevrpod 之间网络可以互通 [ kubeadm 部署的集群会部署相应的 work 节点组件 ]
  • 要保证 apiserver 配置中开启了聚合配置 [ kubeadm 部署的集群,默认开启了聚合 ]

部署 metrics-server 需要注意的地方

修改镜像的 tag
  • 官方下载下来的镜像是国外仓库的,国内很难拉取
sed -i 's#k8s.gcr.io#registry.cn-hangzhou.aliyuncs.com/google_containers#g' components.yaml

修改前

image: k8s.gcr.io/metrics-server-amd64:v0.3.6

修改后

image: registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server-amd64:v0.3.6
修改启动参数

修改前

  • 官方只有两个启动参数
args:
  - --cert-dir=/tmp
  - --secure-port=4443

修改后

  • metric-resolution : 从 kubelet 采集数据的周期,默认为 60s

kubelet-preferred-address-types : 优先使用 InternalIP 来访问 kubelet,这样可以避免节点名称没有 DNS 解析记录时,通过节点名称调用节点 kubelet API 失败的情况

  • 默认为 Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP
  • kubelet-insecure-tls : 不要验证 kubelet 提供的服务证书
- args:
  - --cert-dir=/tmp
  - --secure-port=4443
  - --metric-resolution=10s
  - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
  - --kubelet-insecure-tls

不完整的报错合集

没有配置 --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
  • metrics-server 会有类似如下的报错
E0907 14:29:51.774592       1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to 
fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host, unable to fully scr
ape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup 
<node_name> on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): 
Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to fetch metrics from Kubelet <node_name> (<node_name>): Get https://<node_name>:10250/stats/summary/: dial tcp: lookup <node_name> on 10.96.0.10:53: no such host]
E0907 14:30:10.517886       1 reststorage.go:112] unable to fetch node metrics for node "<node_name>": no metrics known for node "<node_name>"
  • 当然,也可以在 metrics-server 里面增加 hosts 解析
没有配置 --kubelet-insecure-tls
x509: certificate signed by unknown authority
apiserver 节点与 metrics-server pod 之间网络不通
  • metrics-server 会有类似如下的报错
unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:<node_name>: unable to get CPU for container "metrics-server" in pod kube-system/metrics-server-7db5b7cb7c-pkcjb on node "<node_name>", discarding data: missing cpu usage metric
  • 在 apiserver 里可以看到类似如下的报错
v1beta1.metrics.k8s.io failed with: failing or missing response from https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1: Get "https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1": context deadline exceeded
v1beta1.metrics.k8s.io failed with: failing or missing response from https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1: Get "https://172.30.1.16:4443/apis/metrics.k8s.io/v1beta1": dial tcp 172.30.1.16:4443: i/o timeout

个人场景

  • 前期使用的二进制部署的 k8s 集群,当时的规划是 master 节点不运行 pod,于是没有安装 flannel 插件
  • 整体部署中,flannel 采用了 pod 的形式部署,如果 master 节点要部署 flannel,等同于 master 节点需要复用 work 节点,与原先的期望不符合
  • 于是在 master 节点复用 node 节点的情况下,将节点标记为不可调度驱逐所有负载

将节点标记为不可调度

kubectl cordon <node name>

驱逐节点 pod ,保留 daemonset 类型的 pod

kubectl drain <node name> --ignore-daemonsets


相关实践学习
通过Ingress进行灰度发布
本场景您将运行一个简单的应用,部署一个新的应用用于新的发布,并通过Ingress能力实现灰度发布。
容器应用与集群管理
欢迎来到《容器应用与集群管理》课程,本课程是“云原生容器Clouder认证“系列中的第二阶段。课程将向您介绍与容器集群相关的概念和技术,这些概念和技术可以帮助您了解阿里云容器服务ACK/ACK Serverless的使用。同时,本课程也会向您介绍可以采取的工具、方法和可操作步骤,以帮助您了解如何基于容器服务ACK Serverless构建和管理企业级应用。 学习完本课程后,您将能够: 掌握容器集群、容器编排的基本概念 掌握Kubernetes的基础概念及核心思想 掌握阿里云容器服务ACK/ACK Serverless概念及使用方法 基于容器服务ACK Serverless搭建和管理企业级网站应用
目录
相关文章
|
Kubernetes 监控 网络协议
|
3月前
|
监控
{"level":"warn","ts":"2023-11-07T00:35:53.400+0800","caller":"etcdserver/server.go:2048",&
{"level":"warn","ts":"2023-11-07T00:35:53.400+0800","caller":"etcdserver/server.go:2048",&
|
7月前
|
Kubernetes 容器 Perl
error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is cu
error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is cu
116 0
|
7月前
|
Prometheus Kubernetes 监控
metrics-server
Metrics Server 是一个 Kubernetes 集群的附加组件,用于收集和暴露 Kubernetes 集群的运行时指标。Metrics Server 提供了 Kubernetes 集群的详细信息,包括节点、pod、service 等资源的资源使用情况、性能指标等。这些指标对于监控、诊断和优化 Kubernetes 集群的运行状况非常有用。
355 4
|
7月前
|
Java Python
【已解决】RuntimeError Java gateway process exited before sending its port number
【已解决】RuntimeError Java gateway process exited before sending its port number
313 0
|
XML 应用服务中间件 数据格式
控制台报错: [SetPropertiesRule]{Server/Service/Engine/Host/Context} Setting proper
控制台报错: [SetPropertiesRule]{Server/Service/Engine/Host/Context} Setting proper
103 0
控制台报错: [SetPropertiesRule]{Server/Service/Engine/Host/Context} Setting proper
|
索引
问题复盘:Kibana did not load properly. Check the server output for more information
问题复盘:Kibana did not load properly. Check the server output for more information
441 0
问题复盘:Kibana did not load properly. Check the server output for more information
‘Client‘ is not allowed to run in parallel.Would you like to stop the running one?
‘Client‘ is not allowed to run in parallel.Would you like to stop the running one?
582 0
‘Client‘ is not allowed to run in parallel.Would you like to stop the running one?
|
测试技术
The concurrent snapshot for publication 'xxx' is not available because it has not been fully generated or the Log Reader Agent is not running to activ
在两台测试服务器部署了复制(发布订阅)后,发现订阅的表一直没有同步过来。重新生成过snapshot ,也重新初始化过订阅,都不能同步数据,后面检查Distributor To Subscriber History, 发现有如下日志信息: The concurrent snapshot for pub...
1537 0