Hello folks! 今天我们介绍一下基于 Helm 快速部署安装 Prometheus Stack 的文章,在本文中,我们将讨论 Prometheus 和 Grafana,以及如何使用 Helm 图表为任何 Kubernetes 集群设置监控。我们还将学习如何将 Prometheus 和 Grafana 连接在一起,并在 Grafana 上设置一个基本的仪表板来监控 Kubernetes 集群上的资源。
在进入本文正题之前,我们先简单介绍一下 Prometheus Stack 相关生态概念。
Prometheus Stack 概述
Prometheus Stack,通常指的是 Prometheus 和 Grafana 以及相关关联集成组件的统称。在实际的业务场景中,Prometheus 和 Grafana 往往都是协同工作进行监控渲染:Prometheus 负责作为数据源获取数据,并将该数据提供给 Grafana,Grafana 则用于借助其吸引力的仪表板进行可视化数据展示。
基于 Prometheus,
1、Prometheus 是一个开源系统监控和警报工具包;
2、Prometheus 收集指标并将其存储为时间序列数据,它为 Kubernetes 等容器编排平台提供开箱即用的监控功能。
基于 Grafana,
1、Grafana 是一个多平台开源分析和交互式可视化 Web 应用程序;
2、当连接到支持的数据服务时,Grafana 会为网络提供图表、图形和警报;
3、Grafana 允许我们查询、可视化、提醒和理解我们的指标,无论它们存储在何处。除了Prometheus 之外,一些受支持的数据源还有 AWS Cloud Watch、Azure Monitor、PostgreSQL、 Elasticsearch 等等;
4、除此之外,我们可以基于实际的场景需求创建自己的仪表板或使用 Grafana 提供的现有仪表板定义个性化仪表板等。
于 2018 年 8 月 CNCF 毕业,Prometheus 作为下一代开源解决方案,其建设思路与 Google SRE 理念不谋而合。我们来看一下其架构图:
基于上述架构组件图所示,我们对核心的组件进行简要的解析,具体如下:
1、Prometheus Server:存储和抓取时间序列数据的主要服务器。
2、TSDB(时间序列数据库):指标是任何系统了解其健康状况和运行状态的关键方面。任何系统的设计都需要收集、存储和报告指标,以提供系统的脉搏。数据存储在一系列时间间隔内,需要一个高效的数据库来存储和检索这些数据。OpenTSDB 时序数据库就是这样一种可以满足这种需求的时序数据库。
3、PromQL: Prometheus 以 PromQL 的形式定义了一种丰富的查询语言,用于从时序数据库中查询数据。
4、Pushgateway:可用于支持短期工作。
5、导出器:它们用于将指标数据提升到普罗米修斯服务器。
6、Alertmanager:用于将通知发送到各种通信渠道,如 Slack、Email 以通知用户。
基于 Kubernetes Cluster 环境下的生态组件交互架构图,如下所示:
Kube-Prometheus-Stack 解析
Kube-Prometheus-Stack 仓库收集 Kubernetes 清单、Grafana 仪表板和 Prometheus 规则,结合相关文档和脚本,基于 Prometheus Operator 提供易于操作的端到端 Kubernetes 集群监控。
此项目基于 jsonnet 编写,既可以被描述为一个包,也可以被描述为一个库。主要包含如下组件:
- The Prometheus Operator
- Highly available Prometheus
- Highly available Alertmanager
- Prometheus node-exporter
- Prometheus Adapter for Kubernetes Metrics APIs
- kube-state-metrics
- Grafana
kube-prometheus-stack 主要用于集群监控,因此它被预先配置为从所有 Kubernetes 组件收集指标。除此之外,它还提供一组默认的仪表板和警报规则,它提供可组合的 jsonnet 作为库,供用户根据自己的需要进行定制。
在进行部署之前,我们需要评估当前 Kubernetes 集群环境与 Kube-Prometheus-Stack 组件的版本兼容性,具体可参考如下:
kube-prometheus stack | Kubernetes 1.20 | Kubernetes 1.21 | Kubernetes 1.22 | Kubernetes 1.23 | Kubernetes 1.24 |
release-0.8 | ✔ |
✔ |
✗ | ✗ | ✗ |
release-0.9 | ✗ |
✔ |
✔ |
✗ |
✗ |
release-0.10 | ✗ |
✗ |
✔ |
✔ |
✗ |
release-0.11 | ✗ |
✗ |
✗ |
✔ |
✔ |
main | ✗ | ✗ | ✗ | ✗ | ✔ |
部署安装
本次的 Kube-Prometheus-Stack 全家桶组件,我们主要基于 Helm 进行快速、高效部署。在部署之前我们需要先部署 Kubernetes 集群环境,这里主要基于 Minikube 部署单机版环境,具体如下所示:
[leonli@Leon k8s ] % kubectl get po -A -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system coredns-64897985d-v9jcf 1/1 Running 0 38s 172.17.0.2 k8s-cluster <none> <none> kube-system etcd-k8s-cluster 1/1 Running 0 51s 192.168.49.2 k8s-cluster <none> <none> kube-system kube-apiserver-k8s-cluster 1/1 Running 0 51s 192.168.49.2 k8s-cluster <none> <none> kube-system kube-controller-manager-k8s-cluster 1/1 Running 0 51s 192.168.49.2 k8s-cluster <none> <none> kube-system kube-proxy-rpvg8 1/1 Running 0 38s 192.168.49.2 k8s-cluster <none> <none> kube-system kube-scheduler-k8s-cluster 1/1 Running 0 51s 192.168.49.2 k8s-cluster <none> <none> kube-system storage-provisioner 1/1 Running 0 50s 192.168.49.2 k8s-cluster <none> <none>
接下来,我们来安装 Helm 组件,为了保证后续的全家桶正确部署,需要基于当前的 Kubernetes 集群环境部署兼容的 Helm 组件,如下为 Helm 组件版本与 Kubernetes 版本的对应关系:
Helm Version | Supported Kubernetes Versions |
3.9.x | 1.24.x - 1.21.x |
3.8.x | 1.23.x - 1.20.x |
3.7.x | 1.22.x - 1.19.x |
3.6.x | 1.21.x - 1.18.x |
3.5.x | 1.20.x - 1.17.x |
3.4.x | 1.19.x - 1.16.x |
... | ... |
我们 Check 一下所准备的环境信息,具体如下所示:
[leonli@Leon kube-prometheus ] % helm version version.BuildInfo{Version:"v3.8.1", GitCommit:"5cb9af4b1b271d11d7a97a71df3ac337dd94ad37", GitTreeState:"clean", GoVersion:"go1.17.8"} [leonli@Leon kube-prometheus ] % kubectl version Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:30:48Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/arm64"} Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/arm64"}
接下来,我们正式进入安装 kube-prometheus-stack 环节,具体如下所示:
[leonli@Leon minikube ] % kubectl create ns monitoring namespace/monitoring created
[leonli@Leon minikube ] % helm repo add prometheus-community https://prometheus-community.github.io/helm-charts "prometheus-community" has been added to your repositories [leonli@Leon minikube ] % helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "traefik" chart repository ...Successfully got an update from the "komodorio" chart repository ...Successfully got an update from the "traefik-hub" chart repository ...Successfully got an update from the "prometheus-community" chart repository Update Complete. ⎈Happy Helming!⎈ [leonli@Leon minikube ] % helm install prometheus-community/kube-prometheus-stack --namespace monitoring --generate-name Error: INSTALLATION FAILED: failed to download "prometheus-community/kube-prometheus-stack"
在通过 “ helm install ” 进行操作时可能因网络原因导致安装失败,我们可以尝试多试几次或者直接从 GitHub 上下载文件进行,具体如下所示
[leonli@Leon minikube ] % git clone https://github.com/prometheus-operator/kube-prometheus.git -b release-0.10 Cloning into 'kube-prometheus'... remote: Enumerating objects: 17291, done. remote: Counting objects: 100% (197/197), done. remote: Compressing objects: 100% (99/99), done. remote: Total 17291 (delta 126), reused 146 (delta 91), pack-reused 17094 Receiving objects: 100% (17291/17291), 9.18 MiB | 6.19 MiB/s, done. Resolving deltas: 100% (11319/11319), done.
此时,进入 kube-prometheus 目录下,安装 manifest/setup 目录下的所有 yaml 文件,具体如下:
[leonli@Leon kube-prometheus ] % kubectl apply --server-side -f manifests/setup --force-conflicts customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com serverside-applied namespace/monitoring serverside-applied
kube-prometheus 默认安装在 monitoring 命名空间中,因有些镜像在国外,故此安装过程是非常缓慢的,有时会因为网络原因拉取不到而安装失败。
[leonli@Leon kube-prometheus ] % cd manifests/setup [leonli@Leon setup ] % ls -l total 3040 -rw-r--r-- 1 leonli admin 169131 Dec 2 14:53 0alertmanagerConfigCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 377495 Dec 2 14:53 0alertmanagerCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 30361 Dec 2 14:53 0podmonitorCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 31477 Dec 2 14:53 0probeCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 502646 Dec 2 14:53 0prometheusCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 4101 Dec 2 14:53 0prometheusruleCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 31881 Dec 2 14:53 0servicemonitorCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 385790 Dec 2 14:53 0thanosrulerCustomResourceDefinition.yaml -rw-r--r-- 1 leonli admin 60 Dec 2 14:53 namespace.yaml [leonli@Leon setup ] % until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done [leonli@Leon setup ] % cd ../.. [leonli@Leon kube-prometheus ] % kubectl apply -f manifests/ alertmanager.monitoring.coreos.com/main created poddisruptionbudget.policy/alertmanager-main created prometheusrule.monitoring.coreos.com/alertmanager-main-rules created secret/alertmanager-main created service/alertmanager-main created serviceaccount/alertmanager-main created servicemonitor.monitoring.coreos.com/alertmanager-main created clusterrole.rbac.authorization.k8s.io/blackbox-exporter created clusterrolebinding.rbac.authorization.k8s.io/blackbox-exporter created configmap/blackbox-exporter-configuration created deployment.apps/blackbox-exporter created service/blackbox-exporter created serviceaccount/blackbox-exporter created servicemonitor.monitoring.coreos.com/blackbox-exporter created secret/grafana-config created secret/grafana-datasources created configmap/grafana-dashboard-alertmanager-overview created configmap/grafana-dashboard-apiserver created configmap/grafana-dashboard-cluster-total created configmap/grafana-dashboard-controller-manager created configmap/grafana-dashboard-k8s-resources-cluster created configmap/grafana-dashboard-k8s-resources-namespace created configmap/grafana-dashboard-k8s-resources-node created configmap/grafana-dashboard-k8s-resources-pod created configmap/grafana-dashboard-k8s-resources-workload created configmap/grafana-dashboard-k8s-resources-workloads-namespace created configmap/grafana-dashboard-kubelet created configmap/grafana-dashboard-namespace-by-pod created configmap/grafana-dashboard-namespace-by-workload created configmap/grafana-dashboard-node-cluster-rsrc-use created configmap/grafana-dashboard-node-rsrc-use created configmap/grafana-dashboard-nodes created configmap/grafana-dashboard-persistentvolumesusage created configmap/grafana-dashboard-pod-total created configmap/grafana-dashboard-prometheus-remote-write created configmap/grafana-dashboard-prometheus created configmap/grafana-dashboard-proxy created configmap/grafana-dashboard-scheduler created configmap/grafana-dashboard-workload-total created configmap/grafana-dashboards created deployment.apps/grafana created service/grafana created serviceaccount/grafana created servicemonitor.monitoring.coreos.com/grafana created prometheusrule.monitoring.coreos.com/kube-prometheus-rules created clusterrole.rbac.authorization.k8s.io/kube-state-metrics created clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created deployment.apps/kube-state-metrics created prometheusrule.monitoring.coreos.com/kube-state-metrics-rules created service/kube-state-metrics created serviceaccount/kube-state-metrics created servicemonitor.monitoring.coreos.com/kube-state-metrics created prometheusrule.monitoring.coreos.com/kubernetes-monitoring-rules created servicemonitor.monitoring.coreos.com/kube-apiserver created servicemonitor.monitoring.coreos.com/coredns created servicemonitor.monitoring.coreos.com/kube-controller-manager created servicemonitor.monitoring.coreos.com/kube-scheduler created servicemonitor.monitoring.coreos.com/kubelet created clusterrole.rbac.authorization.k8s.io/node-exporter created clusterrolebinding.rbac.authorization.k8s.io/node-exporter created daemonset.apps/node-exporter created prometheusrule.monitoring.coreos.com/node-exporter-rules created service/node-exporter created serviceaccount/node-exporter created servicemonitor.monitoring.coreos.com/node-exporter created clusterrole.rbac.authorization.k8s.io/prometheus-k8s created clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created poddisruptionbudget.policy/prometheus-k8s created prometheus.monitoring.coreos.com/k8s created prometheusrule.monitoring.coreos.com/prometheus-k8s-prometheus-rules created rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created rolebinding.rbac.authorization.k8s.io/prometheus-k8s created rolebinding.rbac.authorization.k8s.io/prometheus-k8s created rolebinding.rbac.authorization.k8s.io/prometheus-k8s created role.rbac.authorization.k8s.io/prometheus-k8s-config created role.rbac.authorization.k8s.io/prometheus-k8s created role.rbac.authorization.k8s.io/prometheus-k8s created role.rbac.authorization.k8s.io/prometheus-k8s created service/prometheus-k8s created serviceaccount/prometheus-k8s created servicemonitor.monitoring.coreos.com/prometheus-k8s created apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created clusterrole.rbac.authorization.k8s.io/prometheus-adapter created clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created configmap/adapter-config created deployment.apps/prometheus-adapter created poddisruptionbudget.policy/prometheus-adapter created rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created service/prometheus-adapter created serviceaccount/prometheus-adapter created servicemonitor.monitoring.coreos.com/prometheus-adapter created clusterrole.rbac.authorization.k8s.io/prometheus-operator created clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created deployment.apps/prometheus-operator created prometheusrule.monitoring.coreos.com/prometheus-operator-rules created service/prometheus-operator created serviceaccount/prometheus-operator created servicemonitor.monitoring.coreos.com/prometheus-operator created
此时,查看所创建的资源进度情况,如下所示:
[leonli@Leon kube-prometheus ] % kubectl get all -n monitoring NAME READY STATUS RESTARTS AGE pod/blackbox-exporter-6b79c4588b-xxjkm 0/3 ContainerCreating 0 14s pod/grafana-7fd69887fb-rn65j 0/1 ContainerCreating 0 14s pod/kube-state-metrics-55f67795cd-xlxw6 0/3 ContainerCreating 0 13s pod/node-exporter-kxdrw 0/2 ContainerCreating 0 13s pod/prometheus-adapter-5565cc8d76-6tnfr 0/1 ContainerCreating 0 13s pod/prometheus-adapter-5565cc8d76-jkcwp 0/1 ContainerCreating 0 13s pod/prometheus-operator-6dc9f66cb7-lkfdl 0/2 ContainerCreating 0 13s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager-main ClusterIP 10.101.159.117 <none> 9093/TCP,8080/TCP 14s service/blackbox-exporter ClusterIP 10.96.196.202 <none> 9115/TCP,19115/TCP 14s service/grafana ClusterIP 10.102.241.149 <none> 3000/TCP 14s service/kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 14s service/node-exporter ClusterIP None <none> 9100/TCP 13s service/prometheus-adapter ClusterIP 10.103.77.245 <none> 443/TCP 13s service/prometheus-k8s ClusterIP 10.107.208.95 <none> 9090/TCP,8080/TCP 13s service/prometheus-operator ClusterIP None <none> 8443/TCP 13s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/node-exporter 1 1 0 1 0 kubernetes.io/os=linux 13s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/blackbox-exporter 0/1 1 0 14s deployment.apps/grafana 0/1 1 0 14s deployment.apps/kube-state-metrics 0/1 1 0 14s deployment.apps/prometheus-adapter 0/2 2 0 13s deployment.apps/prometheus-operator 0/1 1 0 13s NAME DESIRED CURRENT READY AGE replicaset.apps/blackbox-exporter-6b79c4588b 1 1 0 14s replicaset.apps/grafana-7fd69887fb 1 1 0 14s replicaset.apps/kube-state-metrics-55f67795cd 1 1 0 14s replicaset.apps/prometheus-adapter-5565cc8d76 2 2 0 13s replicaset.apps/prometheus-operator-6dc9f66cb7 1 1 0 13s
[leonli@Leon manifests ] % kubectl get pods -n monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 2/2 Running 0 13m 172.17.0.11 devops-cluster <none> <none> alertmanager-main-1 2/2 Running 0 13m 172.17.0.9 devops-cluster <none> <none> alertmanager-main-2 2/2 Running 0 13m 172.17.0.10 devops-cluster <none> <none> blackbox-exporter-6b79c4588b-xxjkm 3/3 Running 0 14m 172.17.0.2 devops-cluster <none> <none> grafana-7fd69887fb-rn65j 1/1 Running 0 14m 172.17.0.3 devops-cluster <none> <none> kube-state-metrics-55f67795cd-xlxw6 2/3 ImagePullBackOff 0 14m 172.17.0.5 devops-cluster <none> <none> kube-state-metrics-7ff75cff8b-qg57k 2/3 ImagePullBackOff 0 2m24s 172.17.0.7 devops-cluster <none> <none> node-exporter-kxdrw 2/2 Running 0 14m 192.168.49.2 devops-cluster <none> <none> prometheus-adapter-5698bb779b-2p5mm 1/1 Running 0 5m17s 172.17.0.6 devops-cluster <none> <none> prometheus-adapter-5698bb779b-2ptn4 1/1 Running 0 5m17s 172.17.0.12 devops-cluster <none> <none> prometheus-k8s-0 0/2 Pending 0 13m <none> <none> <none> <none> prometheus-k8s-1 0/2 Pending 0 13m <none> <none> <none> <none> prometheus-operator-6dc9f66cb7-lkfdl 2/2 Running 0 14m 172.17.0.8 devops-cluster <none> <none>
在 Pod 安装过程中,可能因为不同的原因导致其状态无法处于健康状态,我们可以通过命令行 kubecl describe pod [pod_name] -n [namespace] 去查看其安装详情,以定位到底是什么原因导致 Pod 状态无法正常运行。我们以 “Pending” 状态为例的 prometheus-k8s-0 为例,具体如下:
[leonli@Leon manifests ] % kubectl describe pod prometheus-k8s-0 -n monitoring Name: prometheus-k8s-0 Namespace: monitoring Priority: 0 Node: <none> Labels: app.kubernetes.io/component=prometheus app.kubernetes.io/instance=k8s app.kubernetes.io/managed-by=prometheus-operator app.kubernetes.io/name=prometheus app.kubernetes.io/part-of=kube-prometheus app.kubernetes.io/version=2.32.1 controller-revision-hash=prometheus-k8s-5f9554b8cd operator.prometheus.io/name=k8s operator.prometheus.io/shard=0 prometheus=k8s statefulset.kubernetes.io/pod-name=prometheus-k8s-0 Annotations: kubectl.kubernetes.io/default-container: prometheus Status: Pending IP: IPs: <none> Controlled By: StatefulSet/prometheus-k8s Init Containers: init-config-reloader: Image: quay.io/prometheus-operator/prometheus-config-reloader:v0.53.1 Port: 8080/TCP Host Port: 0/TCP Command: /bin/prometheus-config-reloader Args: --watch-interval=0 --listen-address=:8080 --config-file=/etc/prometheus/config/prometheus.yaml.gz --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0 Limits: cpu: 100m memory: 50Mi Requests: cpu: 100m memory: 50Mi Environment: POD_NAME: prometheus-k8s-0 (v1:metadata.name) SHARD: 0 Mounts: /etc/prometheus/config from config (rw) /etc/prometheus/config_out from config-out (rw) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nv7qs (ro) Containers: prometheus: Image: quay.io/prometheus/prometheus:v2.32.1 Port: 9090/TCP Host Port: 0/TCP Args: --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention.time=24h --web.enable-lifecycle --web.route-prefix=/ --web.config.file=/etc/prometheus/web_config/web-config.yaml Requests: memory: 400Mi Readiness: http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=3 Startup: http-get http://:web/-/ready delay=0s timeout=3s period=15s #success=1 #failure=60 Environment: <none> Mounts: /etc/prometheus/certs from tls-assets (ro) /etc/prometheus/config_out from config-out (ro) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml") /prometheus from prometheus-k8s-db (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nv7qs (ro) config-reloader: Image: quay.io/prometheus-operator/prometheus-config-reloader:v0.53.1 Port: 8080/TCP Host Port: 0/TCP Command: /bin/prometheus-config-reloader Args: --listen-address=:8080 --reload-url=http://localhost:9090/-/reload --config-file=/etc/prometheus/config/prometheus.yaml.gz --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0 Limits: cpu: 100m memory: 50Mi Requests: cpu: 100m memory: 50Mi Environment: POD_NAME: prometheus-k8s-0 (v1:metadata.name) SHARD: 0 Mounts: /etc/prometheus/config from config (rw) /etc/prometheus/config_out from config-out (rw) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nv7qs (ro) Conditions: Type Status PodScheduled False Volumes: config: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s Optional: false tls-assets: Type: Projected (a volume that contains injected data from multiple sources) SecretName: prometheus-k8s-tls-assets-0 SecretOptionalName: <nil> config-out: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> prometheus-k8s-rulefiles-0: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-k8s-rulefiles-0 Optional: false web-config: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-web-config Optional: false prometheus-k8s-db: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> kube-api-access-nv7qs: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 7s (x13 over 13m) default-scheduler 0/1 nodes are available: 1 Insufficient memory.
可以看到,是因为没有足够的内存分配导致 prometheus-k8s-0 此 Pod 无法正常创建,此时,我们可以将所有的 Pod 副本数量都定义为 1,从而减少资源消耗。
我们可以通命令行查看相关 副本信息,如下所示:
[leonli@Leon manifests ] % grep -rn 'replicas:' * alertmanager-alertmanager.yaml:23: replicas: 3 blackboxExporter-deployment.yaml:12: replicas: 1 grafana-deployment.yaml:12: replicas: 1 kubeStateMetrics-deployment.yaml:12: replicas: 1 prometheus-prometheus.yaml:35: replicas: 2 prometheusAdapter-deployment.yaml:12: replicas: 2 prometheusOperator-deployment.yaml:12: replicas: 1 setup/0alertmanagerCustomResourceDefinition.yaml:3547: replicas: setup/0alertmanagerCustomResourceDefinition.yaml:6010: replicas: setup/0thanosrulerCustomResourceDefinition.yaml:3670: replicas: setup/0thanosrulerCustomResourceDefinition.yaml:6172: replicas: setup/0prometheusCustomResourceDefinition.yaml:5119: replicas: setup/0prometheusCustomResourceDefinition.yaml:8271: replicas:
然后将其 replicas定义为 “1”,重新运行 yaml 文件,如下所示:
[leonli@Leon kube-prometheus ] % kubectl apply -f manifests/ [leonli@Leon manifests ] % kubectl get pods -n monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 2/2 Running 0 7m23s 172.17.0.9 devops-cluster <none> <none> blackbox-exporter-6b79c4588b-xxjkm 3/3 Running 0 26m 172.17.0.2 devops-cluster <none> <none> grafana-7fd69887fb-rn65j 1/1 Running 0 26m 172.17.0.3 devops-cluster <none> <none> kube-state-metrics-9d449c7f4-8kh8x 3/3 Running 0 16s 172.17.0.5 devops-cluster <none> <none> node-exporter-kxdrw 2/2 Running 0 26m 192.168.49.2 devops-cluster <none> <none> prometheus-adapter-5698bb779b-2p5mm 1/1 Running 0 17m 172.17.0.6 devops-cluster <none> <none> prometheus-k8s-0 2/2 Running 0 24m 172.17.0.10 devops-cluster <none> <none> prometheus-operator-6dc9f66cb7-lkfdl 2/2 Running 0 26m 172.17.0.8 devops-cluster <none> <none>
此时,我们基于 Minikube Dashboard 查看所部署组件情况,以验证是否正常运行,具体如下所示:
此时,再一次查看 svc,如下所示:
[leonli@Leon kube-prometheus ] % kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main ClusterIP 10.105.71.195 <none> 9093:30076/TCP,8080:31251/TCP 27h alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 27h blackbox-exporter ClusterIP 10.110.173.86 <none> 9115/TCP,19115/TCP 27h grafana ClusterIP 10.102.213.39 <none> 3000:32540/TCP 27h kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 27h node-exporter ClusterIP None <none> 9100/TCP 27h prometheus-adapter ClusterIP 10.100.47.174 <none> 443/TCP 27h prometheus-k8s ClusterIP 10.104.130.93 <none> 9090/TCP,8080/TCP 27h prometheus-operated ClusterIP None <none> 9090/TCP 27h prometheus-operator ClusterIP None <none> 8443/TCP 27h
至此,整个 Prometheus Stack 相关组件已成功部署完成。
Web 访问
基于 svc,我们可以看到,默认情况下所有 Service 都是基于 Cluster IP 类型,所有资源只能在集群内部相互访问。因此,如果需要外部 Web 访问,需要调整 Service 类型为 NodePort ,才能保证节点可正常访问。
这里,我们可以有如下方法进行操作:
1、Kubernetes Dashboard UI 界面操作
2、Yaml 文件调整
无论基于哪种方式,无非需要重新定义 type: NodePort,使其能够基于此访问进行交互。执行完如下所示:
[leonli@Leon kube-prometheus ] % kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main NodePort 10.105.71.195 <none> 9093:30076/TCP,8080:31251/TCP 27h alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 27h blackbox-exporter ClusterIP 10.110.173.86 <none> 9115/TCP,19115/TCP 27h grafana NodePort 10.102.213.39 <none> 3000:32540/TCP 27h kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 27h node-exporter ClusterIP None <none> 9100/TCP 27h prometheus-adapter ClusterIP 10.100.47.174 <none> 443/TCP 27h prometheus-k8s ClusterIP 10.104.130.93 <none> 9090/TCP,8080/TCP 27h prometheus-operated ClusterIP None <none> 9090/TCP 27h prometheus-operator ClusterIP None <none> 8443/TCP 27h
此时,我们通过 http://localhost:3000 等 Web 方式访问 Grafana,账户 admin/admin ,如下所示:
此时,进入首页,我们可以看到,如下信息:
如上为 Helm 快速部署 Prometheus Stack 实践解析,希望对大家有用。关于更多需要了解的信息,欢迎大家交流!
Adiós !