Metrics-Server指标获取链路分析

本文涉及的产品
对象存储 OSS,20GB 3个月
对象存储 OSS,恶意文件检测 1000次 1年
对象存储 OSS,内容安全 1000次 1年
简介: Metrics-server基于cAdvisor收集指标数据,获取、格式化后以metrics API的形式从apiserver对外暴露,核心作用是为kubectl top以及HPA等组件提供决策指标支持。

本文环境基于阿里云容器集群ACK,目的在于快速解决容器指标监控中的常见问题,比如:

为何kubectl top node 看到的资源使用率远高于top看到的?

为何kubectl top node看到资源百分比>100%?

为何hpa基于cpu指标伸缩异常?


Metrics-Server指标获取链路

以下是metrics-server收集基础指标(CPU/Memory)的链路:从cgroup的数据源,到cadvisor负责数据收集,kubelet负责数据计算汇总,再到apiserver中以api方式暴露出去供客户端(HPA/kubectl top)访问。

上图中数据请求流程(逆向路径):

Step1 . kubectl top向APIServer的Metrics API发起请求:               
#kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/xxxx              
#kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/xxxx/pods/xxxx
Step2 . Aggregation 根据API service "metrics.k8s.io"的定义转发给后端svc:metrics-server  
#kubectl get apiservices  v1beta1.metrics.k8s.io  
Step3 . Metrics-server pod获取最近一次的指标数据返回。
注:阿里云云监控容器监控控制台展示的指标基于metrics-server配置的sinkprovider 8093端口获取数据           
# kubectl get svc -n kube-system heapster -oyaml  此处需要注意历史遗留的heapster svc是否也指向后端metrics-server pod
Step4. Metrics server定期向kubelet暴露的endpoint收集数据转换成k8s API格式,暴露给Metrics API.  
Metrics-server本身不做数据采集,不做永久存储,相当于是将kubelet的数据做转换。              
kubelet的在cadvisor采集数据的基础上做了计算汇总,提供container+pod+node级别的cgroup数据       older version:# curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true          v0.6.0+:#curl 127.0.0.1:10255/metrics/resource 
Step5: cAdvisor定期向cgroup采集数据,container cgroup 级别。cadvisor的endpoint是  /metrics/cadvisor,仅提供contianer+machine数据,不做计算pod/node等指标。              
#curl http://127.0.0.1:10255/metrics/cadvisor  
Step6: 在Node/Pod对应的Cgroup目录查看指标文件中的数据。               
#cd  /sys/fs/cgroup/xxxx

通过不同方式查看指标

此处简单总结一下各种方式查看指标,后续会对每一步做详细分析。

1. 确定节点cgroup根目录

mount |grep cgroup  查看节点cgroup根目录:/sys/fs/cgroup


2. 查看pod/container cgroup指标文件目录路径

  • 通过pod uid定位指标文件:
    获取pod uid: kubectl get pod -n xxxx  xxxxx  -oyaml |grep Uid -i -B10
    获取containerid: kubectl describe pod -n xxxx  xxxxx  |grep id -i
    比如可以根据以上得到的uid进入pod中container对应的cgroup目录:
/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod-<pod-uid>.slice/cri-containerd-<container-id>.scope
  • 通过container pid定位指标源文件
    获取pod对应pid的cgroup文件目录
crictl pods |grep pod-name 可以拿到pod-id
crictl ps |grep container-name    或者crictl ps |grep pod-id 可以拿到 container-id
critl inspect <container-id> |grep -i pid 
cat /proc/${CPID}/cgroup   或者 cat /proc/${CPID}/mountinfo  |grep cgroup

比如cgroup文件可以看到具体的pod cgroup子目录:

"cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxx.slice/cri-containerd-xxxx.scope",
"memory": "/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxx.slice/cri-containerd-xxxx.scope",
  • pod的state.json文件也可以看到pod对应的cgroup信息

# crictl pods |grep pod-name

可以获取, 注意pod-id不是pod uid

# cat /run/containerd/runc/k8s.io//state.json  |jq .

"cgroup_paths": {
  "cpu": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  "cpuset": "/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  ...
  "memory": "/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  ...
},
"namespace_paths": {
  "NEWCGROUP": "/proc/8443/ns/cgroup",
  "NEWNET": "/proc/8443/ns/net",
  "NEWNS": "/proc/8443/ns/mnt",
  "NEWPID": "/proc/8443/ns/pid",
  "NEWUTS": "/proc/8443/ns/uts"
  ...
},

3. cgroup 指标计算

进入到node/pod/container对应的cgroup目录中,查看指标对应的文件,此处不过多解读每个指标文件的含义。

//pod cgroup for cpu
# cd /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope
# ls
...
cpuacct.block_latency        cpuacct.stat             cpuacct.usage_percpu_sys   cpuacct.wait_latency  cpu.cfs_period_us       cpu.stat
cpuacct.cgroup_wait_latency  cpuacct.usage            cpuacct.usage_percpu_user  cpu.bvt_warp_ns       cpu.cfs_quota_us        notify_on_release
//注意: CPU的cgroup目录  其实都是指向了cpu,cpuacct
lrwxrwxrwx  1 root root  11 3月   5 05:10 cpu -> cpu,cpuacct
lrwxrwxrwx  1 root root  11 3月   5 05:10 cpuacct -> cpu,cpuacct
dr-xr-xr-x  7 root root   0 4月  28 15:41 cpu,cpuacct
dr-xr-xr-x  3 root root   0 4月   3 17:33 cpuset
//pod cgroup for memory
# cd  /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope
# ls
memory.direct_swapout_global_latency  memory.kmem.tcp.failcnt             memory.min                       memory.thp_reclaim          tasks
memory.direct_swapout_memcg_latency   memory.kmem.tcp.limit_in_bytes      memory.move_charge_at_immigrate  memory.thp_reclaim_ctrl
memory.events                         memory.kmem.tcp.max_usage_in_bytes  memory.numa_stat                 memory.thp_reclaim_stat
memory.events.local                   memory.kmem.tcp.usage_in_bytes      memory.oom_control               memory.usage_in_bytes
...


  • 内存计算:
    公式:
container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file

   比如:

cat memory.usage_in_bytes 
cat memory.stat |grep total_inactive_file
workingset=expr $[$usage_in_bytes-$total_inactive_file]
转换为兆为单位: echo [workingset/1024/1024]


  • CPU计算:
    公式:
cpuUsage := float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed) / window.Seconds()
//其中,cgroup源文件 cpuacct.usage 显示的为cpu累计值usagenamocoreseconds,需要按照以上公式做计算方可得到cpu使用量
//对一段时间 从 startTime ~ endTime间的瞬时的CPU Core的计算公式是:
(endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)


比如计算过去十秒的平均使用量(不是百分比,是cpu  core的使用量):

tstart=$(date +%s%N);cstart=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);sleep 10;tstop=$(date +%s%N);cstop=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);result=`awk 'BEGIN{printf "%.2f\n",'$(($cstop - $cstart))'/'$(($tstop - $tstart))'}'`;echo $result;


4. 查看cadvisor指标

  • 如何通过raw api获取cadvisor指标
    kubectl get --raw=/api/v1/nodes/nodename/proxy/metrics/cadvisor
  • 在节点内部通过cAdvisor的本地接口/metrics/cadvisor 获取数据
    curl http://127.0.0.1:10255/metrics/cadvisor


5. 在节点内部通过kubelet接口获取指标


curl 127.0.0.1:10255/metrics/resource
curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true |jq '.node’


6. 通过metrics-server API获取数据 (kubectl top或者HPA的调用方式)

kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/xxxx |jq .
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/xxxx/pods/xxxx |jq .



指标链路分五步解读

如果客户端获取指标数据失败,可以在指标获取流程中(cgoup- > cadvisor-> kubelet-> MetrcisServer Pod -> apiserver/Metrics Api)通过各自暴露的接口获取数据,定位问题发生点。


一 数据源:Linux cgroup 层级结构:

最外层是node cgoup =》 qos级别cgroup =》 pod级别cgroup -》container级别cgroup

其中 node cgoup包含 kubepods +user+system部分

下图可直观显示cgroup层级包含关系:

也可以使用systemd-cgls看层级结构,以下在node级别根目录 : /sys/fs/cgroup



二 cAdvisor从cgroup中收集container级别的指标数据

注意:cadvsior 也只是个指标收集者,它的数据来自于cgroup 文件。


2.1 cadvisor如何计算working_set内存值

inactiveFileKeyName:="total_inactive_file"ifcgroups.IsCgroup2UnifiedMode(){inactiveFileKeyName="inactive_file"}workingSet:=ret.Memory.Usageifv, ok:=s.MemoryStats.Stats[inactiveFileKeyName];ok{ifworkingSet<v{workingSet=0}else{workingSet-=v}}ret.Memory.WorkingSet=workingSet}

2.2 cAdvisor的接口/metrics/cadvisor 数据分析

kubectl get --raw=/api/v1/nodes/nodename/proxy/metrics/cadvisor
或者 curl http://127.0.0.1:10255/metrics/cadvisor  


注意:/metrics/cadvisor 也是prometheus的其中一个数据源。

https://github.com/google/cadvisor/blob/master/info/v1/container.go#L320


该接口返回以下指标:

#curlhttp://127.0.0.1:10255/metrics/cadvisor|awk-F"\{"'{print $1}' |sort |uniq |grep "#" |grep -v TYPE 1#HELPcadvisor_version_infoAmetricwithaconstant'1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.1#HELPcontainer_blkio_device_usage_totalBlkioDevicebytesusage1#HELPcontainer_cpu_cfs_periods_totalNumberofelapsedenforcementperiodintervals.1#HELPcontainer_cpu_cfs_throttled_periods_totalNumberofthrottledperiodintervals.1#HELPcontainer_cpu_cfs_throttled_seconds_totalTotaltimedurationthecontainerhasbeenthrottled.1#HELPcontainer_cpu_load_average_10sValueofcontainercpuloadaverageoverthelast10seconds.1#HELPcontainer_cpu_system_seconds_totalCumulativesystemcputimeconsumedinseconds.1#HELPcontainer_cpu_usage_seconds_totalCumulativecputimeconsumedinseconds.1#HELPcontainer_cpu_user_seconds_totalCumulativeusercputimeconsumedinseconds.1#HELPcontainer_file_descriptorsNumberofopenfiledescriptorsforthecontainer.1#HELPcontainer_fs_inodes_freeNumberofavailableInodes1#HELPcontainer_fs_inodes_totalNumberofInodes1#HELPcontainer_fs_io_currentNumberofI/Oscurrentlyinprogress1#HELPcontainer_fs_io_time_seconds_totalCumulativecountofsecondsspentdoingI/Os1#HELPcontainer_fs_io_time_weighted_seconds_totalCumulativeweightedI/Otimeinseconds1#HELPcontainer_fs_limit_bytesNumberofbytesthatcanbeconsumedbythecontaineronthisfilesystem.1#HELPcontainer_fs_reads_bytes_totalCumulativecountofbytesread1#HELPcontainer_fs_read_seconds_totalCumulativecountofsecondsspentreading1#HELPcontainer_fs_reads_merged_totalCumulativecountofreadsmerged1#HELPcontainer_fs_reads_totalCumulativecountofreadscompleted1#HELPcontainer_fs_sector_reads_totalCumulativecountofsectorreadscompleted1#HELPcontainer_fs_sector_writes_totalCumulativecountofsectorwritescompleted1#HELPcontainer_fs_usage_bytesNumberofbytesthatareconsumedbythecontaineronthisfilesystem.1#HELPcontainer_fs_writes_bytes_totalCumulativecountofbyteswritten1#HELPcontainer_fs_write_seconds_totalCumulativecountofsecondsspentwriting1#HELPcontainer_fs_writes_merged_totalCumulativecountofwritesmerged1#HELPcontainer_fs_writes_totalCumulativecountofwritescompleted1#HELPcontainer_last_seenLasttimeacontainerwasseenbytheexporter1#HELPcontainer_memory_cacheNumberofbytesofpagecachememory.1#HELPcontainer_memory_failcntNumberofmemoryusagehitslimits1#HELPcontainer_memory_failures_totalCumulativecountofmemoryallocationfailures.1#HELPcontainer_memory_mapped_fileSizeofmemorymappedfilesinbytes.1#HELPcontainer_memory_max_usage_bytesMaximummemoryusagerecordedinbytes1#HELPcontainer_memory_rssSizeofRSSinbytes.1#HELPcontainer_memory_swapContainerswapusageinbytes.1#HELPcontainer_memory_usage_bytesCurrentmemoryusageinbytes, includingallmemoryregardlessofwhenitwasaccessed1#HELPcontainer_memory_working_set_bytesCurrentworkingsetinbytes.1#HELPcontainer_network_receive_bytes_totalCumulativecountofbytesreceived1#HELPcontainer_network_receive_errors_totalCumulativecountoferrorsencounteredwhilereceiving1#HELPcontainer_network_receive_packets_dropped_totalCumulativecountofpacketsdroppedwhilereceiving1#HELPcontainer_network_receive_packets_totalCumulativecountofpacketsreceived1#HELPcontainer_network_transmit_bytes_totalCumulativecountofbytestransmitted1#HELPcontainer_network_transmit_errors_totalCumulativecountoferrorsencounteredwhiletransmitting1#HELPcontainer_network_transmit_packets_dropped_totalCumulativecountofpacketsdroppedwhiletransmitting1#HELPcontainer_network_transmit_packets_totalCumulativecountofpacketstransmitted1#HELPcontainer_processesNumberofprocessesrunninginsidethecontainer.1#HELPcontainer_scrape_error1iftherewasanerrorwhilegettingcontainermetrics, 0otherwise1#HELPcontainer_socketsNumberofopensocketsforthecontainer.1#HELPcontainer_spec_cpu_periodCPUperiodofthecontainer.1#HELPcontainer_spec_cpu_quotaCPUquotaofthecontainer.1#HELPcontainer_spec_cpu_sharesCPUshareofthecontainer.1#HELPcontainer_spec_memory_limit_bytesMemorylimitforthecontainer.1#HELPcontainer_spec_memory_reservation_limit_bytesMemoryreservationlimitforthecontainer.1#HELPcontainer_spec_memory_swap_limit_bytesMemoryswaplimitforthecontainer.1#HELPcontainer_start_time_secondsStarttimeofthecontainersinceunixepochinseconds.1#HELPcontainer_tasks_stateNumberoftasksingivenstate1#HELPcontainer_threads_maxMaximumnumberofthreadsallowedinsidethecontainer, infinityifvalueiszero1#HELPcontainer_threadsNumberofthreadsrunninginsidethecontainer1#HELPcontainer_ulimits_softSoftulimitvaluesforthecontainerrootprocess. Unlimitedif-1, exceptpriorityandnice1#HELPmachine_cpu_coresNumberoflogicalCPUcores.1#HELPmachine_cpu_physical_coresNumberofphysicalCPUcores.1#HELPmachine_cpu_socketsNumberofCPUsockets.1#HELPmachine_memory_bytesAmountofmemoryinstalledonthemachine.1#HELPmachine_nvm_avg_power_budget_wattsNVMpowerbudget.1#HELPmachine_nvm_capacityNVMcapacityvaluelabeledbyNVMmode(memorymodeorappdirectmod


2.3 问题:kubectl top pod包含pause容器的指标么?

实验发现cadvisor接口其实也返回了pause container的数据,但是kubectl top pod --containers中不包含pause。至少目前是不包含pause的.

#curlhttp://127.0.0.1:10255/metrics/cadvisor|grepcsi-plugin|grepcontainer_cpu_usage_seconds_totalcontainer_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice",image="",name="",namespace="kube-system",pod="csi-plugin-tfrc6"}675.447883931665548285204container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-9f878a414fa02a1009e84dc7cea417084e9f856e8ae811112d841b9b1b86713f.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/pause:3.5",name="9f878a414fa02a1009e84dc7cea417084e9f856e8ae811112d841b9b1b86713f",namespace="kube-system",pod="csi-plugin-tfrc6"}0.0199812571665548276989container_cpu_usage_seconds_total{container="csi-plugin",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-eab2bc5aba47420c5dcfee0cdc63e9327f6154e95a64da2a10eb5c4b9bc9b8d0.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.22.11-abbb810e-aliyun",name="eab2bc5aba47420c5dcfee0cdc63e9327f6154e95a64da2a10eb5c4b9bc9b8d0",namespace="kube-system",pod="csi-plugin-tfrc6"}452.0184394951665548286584container_cpu_usage_seconds_total{container="csi-provisioner",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod3ef3dfea_a052_4ed5_8b21_647e0ac42817.slice/cri-containerd-9b2a563af4409e0f19260af23f46c3f10102f5089ad97c696dbb77be20d95a82.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.20.7-aafce42-aliyun",name="9b2a563af4409e0f19260af23f46c3f10102f5089ad97c696dbb77be20d95a82",namespace="kube-system",pod="csi-provisioner-66d47b7f64-88lzs"}533.0275188471665548276678container_cpu_usage_seconds_total{container="disk-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-b565d92528539b319d929935e20316c3cd121ab816f00ec5e09f2b1d7ec57eec.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="b565d92528539b319d929935e20316c3cd121ab816f00ec5e09f2b1d7ec57eec",namespace="kube-system",pod="csi-plugin-tfrc6"}79.6957118371665548287771container_cpu_usage_seconds_total{container="nas-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-addd318c7faf17a566d11b9916452df298eb1c4f96214d8ca572920d53a05e06.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="addd318c7faf17a566d11b9916452df298eb1c4f96214d8ca572920d53a05e06",namespace="kube-system",pod="csi-plugin-tfrc6"}71.7271122061665548288055container_cpu_usage_seconds_total{container="oss-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-7742122f2aca67336939e9c4f24696df5dc2d400e3674c4008c5e8202bf5ed7a.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="7742122f2aca67336939e9c4f24696df5dc2d400e3674c4008c5e8202bf5ed7a",namespace="kube-system",pod="csi-plugin-tfrc6"}71.9868918471665548274774#curlhttp://127.0.0.1:10255/metrics/cadvisor|grepcsi-plugin-cvjwm|grepcontainer_memory_working_set_bytescontainer_memory_working_set_bytes{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice",image="",name="",namespace="kube-system",pod="csi-plugin-cvjwm"}5.6066048e+071651810984383container_memory_working_set_bytes{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-2b08fab496e500709102969ed459f5150e7db008194fb3e71f1f7b0ad48f7e8e.scope",image="sha256:ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c459",name="2b08fab496e500709102969ed459f5150e7db008194fb3e71f1f7b0ad48f7e8e",namespace="kube-system",pod="csi-plugin-cvjwm"}409601651810984964container_memory_working_set_bytes{container="csi-plugin",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-5c79494498d7905db8ffda91f0037dc7208fcf75b67a2c85932065e95640bd77.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.20.7-aafce42-aliyun",name="5c79494498d7905db8ffda91f0037dc7208fcf75b67a2c85932065e95640bd77",namespace="kube-system",pod="csi-plugin-cvjwm"}2.316288e+071651810988657container_memory_working_set_bytes{container="disk-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-c9b1bac3ca7539b317047188f233c3396be3df0796bc90b408caf98a4f51a70b.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="c9b1bac3ca7539b317047188f233c3396be3df0796bc90b408caf98a4f51a70b",namespace="kube-system",pod="csi-plugin-cvjwm"}1.1063296e+071651810991117container_memory_working_set_bytes{container="nas-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-6e4c525abb2337ae7eb6f884dc9fc7a2605699581e450d4586173f8b7a4187cd.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="6e4c525abb2337ae7eb6f884dc9fc7a2605699581e450d4586173f8b7a4187cd",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0940416e+071651810981728container_memory_working_set_bytes{container="oss-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-34f5248e3a3f812480a42b16e3c8514b41f8da8e5ae1c84684dd2525f583da79.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="34f5248e3a3f812480a42b16e3c8514b41f8da8e5ae1c84684dd2525f583da79",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0846208e+071651810991024


三 Kubelet的API endpoint 提供Node/Pod/Container级别指标

kubelet提供两个endpoint,是kubelet计算后的数据值。因为cAdvisor采集的是container级别的原始数据,不包含pod以及node的计算加和值。


不同版本的metrics-server请求的endpoint如下:

older version: /stats/summary?only_cpu_and_memory=true

v0.6.0+: /metrics/resource


Metrics-server v0.6.0跟kubelet的/metrics/resource获取数据:

//GetMetricsimplementsclient.KubeletMetricsGetterfunc(kc*kubeletClient)GetMetrics(ctxcontext.Context, node*corev1.Node)(*storage.MetricsBatch, error){port:=kc.defaultPortnodeStatusPort:=int(node.Status.DaemonEndpoints.KubeletEndpoint.Port)ifkc.useNodeStatusPort&&nodeStatusPort!=0{port=nodeStatusPort}addr, err:=kc.addrResolver.NodeAddress(node)iferr!=nil{returnnil, err}url:=url.URL{Scheme:kc.scheme,
Host:net.JoinHostPort(addr, strconv.Itoa(port)),
Path:"/metrics/resource",
}returnkc.getMetrics(ctx, url.String(), node.Name)}


3.1 Kubelet接口提供的指标数据

/metrics/resource 接口:

该接口可以提供以下指标返回:

$curl127.0.0.1:10255/metrics/resource|awk-F"{"'{print $1}'  | grep  "#" |grep HELP#HELPcontainer_cpu_usage_seconds_total[ALPHA]Cumulativecputimeconsumedbythecontainerincore-seconds#HELPcontainer_memory_working_set_bytes[ALPHA]Currentworkingsetofthecontainerinbytes#HELPcontainer_start_time_seconds[ALPHA]Starttimeofthecontainersinceunixepochinseconds#HELPnode_cpu_usage_seconds_total[ALPHA]Cumulativecputimeconsumedbythenodeincore-seconds#HELPnode_memory_working_set_bytes[ALPHA]Currentworkingsetofthenodeinbytes#HELPpod_cpu_usage_seconds_total[ALPHA]Cumulativecputimeconsumedbythepodincore-seconds#HELPpod_memory_working_set_bytes[ALPHA]Currentworkingsetofthepodinbytes#HELPscrape_error[ALPHA]1iftherewasanerrorwhilegettingcontainermetrics, 0otherwisepodmetricscontainermetrcis:#curl127.0.0.1:10255/metrics/resource|grepcsi-plugin-cvjwmcontainer_cpu_usage_seconds_total{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"}787.2114559261651651825597container_cpu_usage_seconds_total{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}16.585081241651651825600container_cpu_usage_seconds_total{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}16.8203297541651651825602container_cpu_usage_seconds_total{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}16.0166304341651651825605container_memory_working_set_bytes{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"}2.312192e+071651651825597container_memory_working_set_bytes{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.1071488e+071651651825600container_memory_working_set_bytes{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0940416e+071651651825602container_memory_working_set_bytes{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0846208e+071651651825605container_start_time_seconds{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"}1.64639996363012e+091646399963630container_start_time_seconds{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.646399924462264e+091646399924462container_start_time_seconds{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.646399937591126e+091646399937591container_start_time_seconds{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.6463999541537158e+091646399954153pod_cpu_usage_seconds_total{namespace="kube-system",pod="csi-plugin-cvjwm"}836.6334973541651651825616pod_memory_working_set_bytes{namespace="kube-system",pod="csi-plugin-cvjwm"}5.5980032e+071651651825643


/stats/summary 接口:

接口返回指标如下,包含node/pod/container:

#curl127.0.0.1:10255/stats/summary?only_cpu_and_memory=true|jq'.node'{"nodeName":"ack.dedicated00009yibei",
"systemContainers":[{"name":"pods",
"startTime":"2022-03-04T13:17:01Z",
"cpu":{...},
"memory":{...}},
{"name":"kubelet",
"startTime":"2022-03-04T13:17:22Z",
"cpu":{...},
"memory":{...}}],
"startTime":"2022-03-04T13:11:06Z",
"cpu":{"time":"2022-05-05T13:54:02Z",
"usageNanoCores":182358783,     =======>HPA/kubectltop"usageCoreNanoSeconds":979924257151862====>使window},
"memory":{"time":"2022-05-05T13:54:02Z",
"availableBytes":1296015360,
"usageBytes":3079581696,
"workingSetBytes":2592722944,     =======>kubectltopnode"rssBytes":1459187712,
"pageFaults":9776943,
"majorPageFaults":1782}}############################curl127.0.0.1:10255/stats/summary?only_cpu_and_memory=true|jq'.pods[1]'{"podRef":{"name":"kube-flannel-ds-s6mrk",
"namespace":"kube-system",
"uid":"5b328994-c4a1-421d-9ab0-68992ca79807"},
"startTime":"2022-03-04T13:18:41Z",
"containers":[{...}],
"cpu":{"time":"2022-05-05T13:53:03Z",
"usageNanoCores":2817176,
"usageCoreNanoSeconds":11613876607138},
"memory":{"time":"2022-05-05T13:53:03Z",
"availableBytes":237830144,
"usageBytes":30605312,
"workingSetBytes":29876224,
"rssBytes":26116096,
"pageFaults":501002073,
"majorPageFaults":1716}}############################curl127.0.0.1:10255/stats/summary?only_cpu_and_memory=true|jq'.pods[1].containers[0]'{"name":"kube-scheduler",
"startTime":"2022-05-05T08:27:55Z",
"cpu":{"time":"2022-05-05T13:56:16Z",
"usageNanoCores":1169892,
"usageCoreNanoSeconds":29353035680},
"memory":{"time":"2022-05-05T13:56:16Z",
"availableBytes":9223372036817981000,
"usageBytes":36790272,
"workingSetBytes":32735232,
"rssBytes":24481792,
"pageFaults":5511,
"majorPageFaults":165}}


3.2 Kubelet如何计算CPU使用率:

CPU重点指标解析:

usageCoreNanoSeconds: //累计值,单位是nano core * seconds . 根据时间窗口window跟总核数计算出cpu usage.  

usageNanoCores://计算值,利用某个默认时间段的两个累计值usageCoreNanoSeconds做计算的结果.

CPU使用量usageNanoCores= (endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)

计算方式:


对一段时间 从 startTime ~ endTime间的瞬时的CPU Core的计算公式是:

(endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)


比如计算过去十秒的平均使用量(不是百分比,是cpu  core的使用量):

tstart=$(date +%s%N);cstart=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);sleep 10;tstop=$(date +%s%N);cstop=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);result=`awk 'BEGIN{printf "%.2f\n",'$(($cstop - $cstart))'/'$(($tstop - $tstart))'}'`;echo $result;


//MetricsPointrepresentstheasetofspecificmetricsatsomepointintime.typeMetricsPointstruct{//StartTimeisthestarttimeofcontainer/node. CumulativeCPUusageatthatmomentshouldbeequalzero.StartTimetime.Time//Timestampisthetimewhenmetricpointwasmeasured. IfCPUandMemorywasmeasuredatdifferenttimeitshouldequalCPUtimetoallowaccurateCPUcalculation.Timestamptime.Time//CumulativeCpuUsedisthecumulativecpuusedatTimestampfromtheStartTimeofcontainer/node. Unit:nanocore*seconds.CumulativeCpuUseduint64//MemoryUsageistheworkingsetsize. Unit:bytes.MemoryUsageuint64}funcresourceUsage(last, prevMetricsPoint)(corev1.ResourceList, api.TimeInfo, error){iflast.CumulativeCpuUsed<prev.CumulativeCpuUsed{returncorev1.ResourceList{}, api.TimeInfo{}, fmt.Errorf("unexpected decrease in cumulative CPU usage value")}window:=last.Timestamp.Sub(prev.Timestamp)cpuUsage:=float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed)/window.Seconds()returncorev1.ResourceList{corev1.ResourceCPU:uint64Quantity(uint64(cpuUsage), resource.DecimalSI, -9),
corev1.ResourceMemory:uint64Quantity(last.MemoryUsage, resource.BinarySI, 0),
}, api.TimeInfo{Timestamp:last.Timestamp,
Window:window,
}, nil}


3.3 Kubelet计算节点内存逻辑脚本(计算驱逐时的逻辑):

#!/bin/bash#!/usr/bin/envbash#Thisscriptreproduceswhatthekubeletdoes#tocalculatememory.availablerelativetorootcgroup.#currentmemoryusagememory_capacity_in_kb=$(cat/proc/meminfo|grepMemTotal|awk'{print $2}')memory_capacity_in_bytes=$((memory_capacity_in_kb*1024))memory_usage_in_bytes=$(cat/sys/fs/cgroup/memory/memory.usage_in_bytes)memory_total_inactive_file=$(cat/sys/fs/cgroup/memory/memory.stat|greptotal_inactive_file|awk'{print $2}')memory_working_set=${memory_usage_in_bytes}if["$memory_working_set"-lt"$memory_total_inactive_file"];thenmemory_working_set=0elsememory_working_set=$((memory_usage_in_bytes-memory_total_inactive_file))fimemory_available_in_bytes=$((memory_capacity_in_bytes-memory_working_set))memory_available_in_kb=$((memory_available_in_bytes/1024))memory_available_in_mb=$((memory_available_in_kb/1024))echo"memory.capacity_in_bytes $memory_capacity_in_bytes"echo"memory.usage_in_bytes $memory_usage_in_bytes"echo"memory.total_inactive_file $memory_total_inactive_file"echo"memory.working_set $memory_working_set"echo"memory.available_in_bytes $memory_available_in_bytes"echo"memory.available_in_kb $memory_available_in_kb"echo"memory.available_in_mb $memory_available_in_mb"


四 Metrics-Server解读

4.1 关于apiservice metrics.k8s.io

部署apiservice的时候向apiserver的aggregation layer注册Metrcis API。这样apiserver收到特定metrics api (/metrics.k8s.io/)的请求后,会转发给后端定义的metrics-server做处理。

kubectl get apiservices  v1beta1.metrics.k8s.io  后端指向的svc是metrics-service.

如果此处apiservice指向有问题导致指标获取异常,可清理资源重装metrics-server:

1v1beta1.metrics.k8s.ioApiServiceskubectldeleteapiservicev1beta1.metrics.k8s.io2metrics-serverkubectldeletedeploymentmetrics-server-nkube-system3metrics-server


4.2 关于heapster svc

Heapster是老版本集群中用来收集指标的,后续基础CPU/Memory由metrics-server负责,自定义指标可以prometheus负责。ACK集群中Heapster跟metrics-server两个svc共同指向metrics-server pod。



4.3 关于 metrics-server pod 启动参数


ACK集群中官方metrics-server组件的启动参数如下:

containers:

- command:

- /metrics-server

- --source=kubernetes.hybrid:''

- --sink=socket:tcp://monitor.csk.xxx.aliyuncs.com:8093?clusterId=xxx&public=true

image: registry-vpc.cn-beijing.aliyuncs.com/acs/metrics-server:v0.3.8.5-307cf45-aliyun


注意,此处sink中定义的8093端口,用于metrics-server pod向阿里云云监控提供指标数据。

另外,启动参数中没指定的,采用默认值,比如:

--metric-resolution duration         The resolution at which metrics-server will retain metrics. (default 30s)


指标缓存sink的定义:

https://github.com/kubernetes-sigs/metrics-server/blob/v0.3.5/pkg/provider/sink/sinkprov.go#L134

//sinkMetricsProviderisaprovider.MetricsProviderthatalsoactsasasink.MetricSinktypesinkMetricsProviderstruct{musync.RWMutexnodesmap[string]sources.NodeMetricsPointpodsmap[apitypes.NamespacedName]sources.PodMetricsPoint}


4.4 Metrics-server API提供的指标数据

Metrcis-server相当于对kubelet endpoint的数据做了一次转换:

#kubectlget--raw/apis/metrics.k8s.io/v1beta1/nodes/ack.dedicated00009yibei|jq.{"kind":"NodeMetrics",
"apiVersion":"metrics.k8s.io/v1beta1",
"metadata":{"name":"ack.dedicated00009yibei",
"selfLink":"/apis/metrics.k8s.io/v1beta1/nodes/ack.dedicated00009yibei",
"creationTimestamp":"2022-05-05T14:28:27Z"},
"timestamp":"2022-05-05T14:27:32Z",
"window":"30s",  ===cpu"usage":{"cpu":"157916713n",        ====nanocoreCPU"memory":"2536012Ki"====cgroup.working_set}}#kubectlget--raw/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/centos7-74cd758d98-wcwnj|jq.{"kind":"PodMetrics",
"apiVersion":"metrics.k8s.io/v1beta1",
"metadata":{"name":"centos7-74cd758d98-wcwnj",
"namespace":"default",
"selfLink":"/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/centos7-74cd758d98-wcwnj",
"creationTimestamp":"2022-05-05T14:32:39Z"},
"timestamp":"2022-05-05T14:32:04Z",
"window":"30s",
"containers":[{"name":"centos7",
"usage":{"cpu":"0",
"memory":"224Ki"}}]}


五 客户端组件对Metrics API的调用

5.1 kubectl top

// -v=6看API请求是发送给了metrics api:


5.2 HPA

向metrics API请求数据拿到CPU/Memory值后做计算与展示:


计算逻辑:

需要注意,HPA的百分比是usage/request; 因此若request设置比较低,会导致HPA阈值很高,容易引发误解。

DesiredReplicas=ceil[currentReplicas*(currentMetricValue/desiredMetricValue)]CurrentUtilization=int32((metricsTotal*100)/requestsTotal)requestsTotalpod.Spec.Containersloopcontainerrequestreturninitcontainer.


***衍生出一个知识点***

开启HPA的deployment必须给每个container定义request值,不过initcontainer不做要求。代码:

funccalculatePodRequests(pods[]*v1.Pod, containerstring, resourcev1.ResourceName)(map[string]int64, error){requests:=make(map[string]int64, len(pods))for_, pod:=rangepods{podSum:=int64(0)for_, c:=rangepod.Spec.Containers{ifcontainer==""||container==c.Name{ifcontainerRequest, ok:=c.Resources.Requests[resource];ok{podSum+=containerRequest.MilliValue()}else{returnnil, fmt.Errorf("missing request for %s", resource)}}}requests[pod.Name]=podSum}returnrequests, nil}//GetResourceUtilizationRatiotakesinasetofmetrics, asetofmatchingrequests,
//andatargetutilizationpercentage, andcalculatestheratioof//desiredtoactualutilization(returningthat, theactualutilization, andtherawaveragevalue)funcGetResourceUtilizationRatio(metricsPodMetricsInfo, requestsmap[string]int64, targetUtilizationint32)(utilizationRatiofloat64, currentUtilizationint32, rawAverageValueint64, errerror){metricsTotal:=int64(0)requestsTotal:=int64(0)numEntries:=0forpodName, metric:=rangemetrics{request, hasRequest:=requests[podName]if!hasRequest{//wecheckformissingrequestselsewhere, soassumingmissingrequests==extraneousmetricscontinue}metricsTotal+=metric.ValuerequestsTotal+=requestnumEntries++}//ifthesetofrequestsiscompletelydisjointfromthesetofmetrics,
//thenwecouldhaveanissuewheretherequeststotaliszeroifrequestsTotal==0{return0, 0, 0, fmt.Errorf("no metrics returned matched known pods")}currentUtilization=int32((metricsTotal*100)/requestsTotal)returnfloat64(currentUtilization)/float64(targetUtilization), currentUtilization, metricsTotal/int64(numEntries), nil}


附录

常见指标统计方式的计算

在分析链路之后,将一些常见指标获取方式的计算公式总结如下,毕竟梳理原理也是为了理解每种指标的计算方式,便于出现问题异常时快速定位。从公式总结中可以回答一些常见的疑问。


指标数据源

内存使用量

内存使用率

CPU使用量

CPU使用率

Node 节点级别

/proc/meminfo

top/free输出的mem used不包含缓存; free不包含buffer/cache ;avaliable包含buffer/cache;

公式(对标prometheus指标):

    • total: node_memory_MemTotal_bytes
    • available: node_memory_MemAvailable_bytes = node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_SReclaimable_bytes
    • used: node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes - node_memory_SReclaimable_bytes
    • used: node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes (不包含缓存)
    • shared:node_memory_Shmem_bytes
    • free:node_memory_MemFree_bytes
    • buff/cache: node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_SReclaimable_bytes

/proc/cpuinfo

Node 级别 cgroup

目录:/sys/fs/cgroup/memory/

取值:memory_total_inactive_file = cat memory.stat |grep total_inactive_file

取值:memory_usage_in_bytes = cat memory.usage_in_bytes

//也有这种表达: container_memory_usage_bytes = container_memory_rss + container_memory_cache + kernel memory

公式:memory_working_set_bytes =memory_usage_in_bytes - memory_total_inactive_file

/cache 应该是tocal_actice_file+total_inactive_file等,所以workingset是包含部分缓存的,所以会比top/free看到的使用值大

/sys/fs/cgroup/cpuxxx

Pod/container 级别 cgroup

/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podXXXXXX.slice/cri-containerd-XXXXX.scope

/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podXXXXXX.slice/cri-containerd-XXXXXX.scope

指标获取方式

内存使用量

内存使用率

CPU使用量

CPU使用率

kubectl top node

//基于metrics-server,聚合值

内存使用量= node_memory_working_set_bytes

//跟node cgoup中数值大体一致

//跟kubelet 10255暴露的接口指标大体一致

内存百分比= node_memory_working_set_bytes/node_Allocatable_memory

计算方式:

cpuUsage := float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed) / window.Seconds()

CPU百分比 = cpuUsage/node_Allocatable_cpu

kubectl top pod

//基于metrics-server,聚合值

内存使用量= pod_memory_working_set_bytes

//跟pod cgoup中数值大体一致

//跟kubelet 10255暴露的接口指标大体一致

N/A

top pod取值sumamry API的usagenaancores,是kubelet依据CumulativeCpuUsed计算后的结果

N/A

kubectl top pod --containers

//基于metrics-server,聚合值

内存使用量= container_memory_working_set_bytes

//跟container cgoup中数值大体一致

//跟kubelet 10255暴露的接口指标大体一致

N/A

HPA

//基于metrics-server,聚合值

设置desiredMetricValue ,跟CurrentUtilization对比计算扩缩数量

CurrentUtilization=int32((metricsUsageTotal*100) /requestsTotal) 其中,计算total时,是将container loop相加,一旦某一个container没有request就报错,不累计init container.

kubectl describe node

//不调用Metrics API

node_Allocatable_memory: capacity_memory - system_reserved_memory - kuebelt_reserved_memory – eviction_hard_memory

node_Capacity_memory: = 机型total_memory-系统启动预留(该值不可见)

request : kubectl describe node里的resource request,算了init-container等已经执行完毕退出的container;



结尾:

本篇文章主要针对metrics-server的指标获取链路做了梳理,也是自己的学习笔记,期间也参考了诸多前辈的精华总结,旨在掌握后可以在k8s集群指标获取异常时快速分析定位。参考:

https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/cri-container-stats.mdhttps://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/https://github.com/kubernetes/kubernetes/blob/65178fec72df6275ed0aa3ede12c785ac79ab97a/pkg/controller/podautoscaler/replica_calculator.go#L424https://github.com/kubernetes-sigs/metrics-serverhttps://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-cpu-usage-is-calculatedcAdvisor:https://cloud.tencent.com/developer/article/1096375https://github.com/google/cadvisor/blob/d6b0ddb07477b17b2f3ef62b032d815b1cb6884e/machine/machine.gohttps://github.com/google/cadvisor/tree/3beb265804ea4b00dc8ed9125f1f71d3328a7a94/container/libcontainerhttps://www.jianshu.com/p/7c18075aa735https://www.cnblogs.com/gaorong/p/11716907.htmlpodoomshmemhttps://developer.aliyun.com/article/1040230?spm=a2c6h.13262185.profile.46.994e2382PrPuO5



相关实践学习
通过Ingress进行灰度发布
本场景您将运行一个简单的应用,部署一个新的应用用于新的发布,并通过Ingress能力实现灰度发布。
容器应用与集群管理
欢迎来到《容器应用与集群管理》课程,本课程是“云原生容器Clouder认证“系列中的第二阶段。课程将向您介绍与容器集群相关的概念和技术,这些概念和技术可以帮助您了解阿里云容器服务ACK/ACK Serverless的使用。同时,本课程也会向您介绍可以采取的工具、方法和可操作步骤,以帮助您了解如何基于容器服务ACK Serverless构建和管理企业级应用。 学习完本课程后,您将能够: 掌握容器集群、容器编排的基本概念 掌握Kubernetes的基础概念及核心思想 掌握阿里云容器服务ACK/ACK Serverless概念及使用方法 基于容器服务ACK Serverless搭建和管理企业级网站应用
相关文章
|
8月前
|
监控 Java 索引
APM Server监控
APM Server监控
|
6月前
|
Prometheus 监控 Cloud Native
Prometheus结合Consul采集多个MySQL实例的监控指标
将 Prometheus 与 Consul 结合使用,实现对多个 MySQL 实例的自动发现与监控,不仅提高了监控的效率和准确性,也为管理动态扩缩容的数据库环境提供了强大的支持。通过细致配置每一部分,业务可以获得关键的性能指标和运行健康状况的即时反馈,进而优化资源配置,提高系统的稳定性和可用性。
188 3
|
5月前
|
监控 Linux 测试技术
|
8月前
|
Kubernetes 监控 API
Kubernetes指标监控metrics-server
Kubernetes指标监控metrics-server
99 0
Kubernetes指标监控metrics-server
|
8月前
|
Kubernetes 应用服务中间件 nginx
K8S部署Metrics-Server服务
K8S部署Metrics-Server服务
169 1
|
SQL Prometheus 监控
统一观测丨使用 Prometheus 监控 SQL Server 最佳实践
统一观测丨使用 Prometheus 监控 SQL Server 最佳实践
1591 13
|
存储 Prometheus Kubernetes
Etcd几个关键的监控指标
Etcd是一个高可靠、分布式的键值存储系统,Kubernetes的设计基本都是围绕Etcd设计的,可谓成也Etcd,败也Etcd。Etcd负责Kubernetes集群的数据存储,提供了集群数据一致性保证及监测(watch)等机制,是整个集群的核心,但由于Etcd本身的性能限制,制约了Kubernetes集群的规模,当前官宣的最大节点数是5000,但目前原生Kubernetes在生产环境中基本都不超过3000个节点,所以针对Etcd的监控尤为重要。
1095 0
|
存储 消息中间件 Prometheus
统一观测丨使用 Prometheus 监控 SNMP,我们该关注哪些指标?
统一观测丨使用 Prometheus 监控 SNMP,我们该关注哪些指标?
统一观测丨使用 Prometheus 监控 SNMP,我们该关注哪些指标?
|
消息中间件 Prometheus 监控
Metrics 指标分析 |学习笔记
快速学习 Metrics 指标分析
401 0
Metrics 指标分析 |学习笔记
|
存储 Prometheus 运维
通过Exporter收集一切指标
通过Exporter收集一切指标
通过Exporter收集一切指标