本文环境基于阿里云容器集群ACK,目的在于快速解决容器指标监控中的常见问题,比如:
为何kubectl top node 看到的资源使用率远高于top看到的?
为何kubectl top node看到资源百分比>100%?
为何hpa基于cpu指标伸缩异常?
Metrics-Server指标获取链路
以下是metrics-server收集基础指标(CPU/Memory)的链路:从cgroup的数据源,到cadvisor负责数据收集,kubelet负责数据计算汇总,再到apiserver中以api方式暴露出去供客户端(HPA/kubectl top)访问。
上图中数据请求流程(逆向路径):
Step1 . kubectl top向APIServer的Metrics API发起请求: #kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/xxxx #kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/xxxx/pods/xxxx Step2 . Aggregation 根据API service "metrics.k8s.io"的定义转发给后端svc:metrics-server #kubectl get apiservices v1beta1.metrics.k8s.io Step3 . Metrics-server pod获取最近一次的指标数据返回。 注:阿里云云监控容器监控控制台展示的指标基于metrics-server配置的sinkprovider 8093端口获取数据 # kubectl get svc -n kube-system heapster -oyaml 此处需要注意历史遗留的heapster svc是否也指向后端metrics-server pod Step4. Metrics server定期向kubelet暴露的endpoint收集数据转换成k8s API格式,暴露给Metrics API. Metrics-server本身不做数据采集,不做永久存储,相当于是将kubelet的数据做转换。 kubelet的在cadvisor采集数据的基础上做了计算汇总,提供container+pod+node级别的cgroup数据 older version:# curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true v0.6.0+:#curl 127.0.0.1:10255/metrics/resource Step5: cAdvisor定期向cgroup采集数据,container cgroup 级别。cadvisor的endpoint是 /metrics/cadvisor,仅提供contianer+machine数据,不做计算pod/node等指标。 #curl http://127.0.0.1:10255/metrics/cadvisor Step6: 在Node/Pod对应的Cgroup目录查看指标文件中的数据。 #cd /sys/fs/cgroup/xxxx
通过不同方式查看指标
此处简单总结一下各种方式查看指标,后续会对每一步做详细分析。
1. 确定节点cgroup根目录
mount |grep cgroup 查看节点cgroup根目录:/sys/fs/cgroup
2. 查看pod/container cgroup指标文件目录路径
- 通过pod uid定位指标文件:
获取pod uid: kubectl get pod -n xxxx xxxxx -oyaml |grep Uid -i -B10
获取containerid: kubectl describe pod -n xxxx xxxxx |grep id -i
比如可以根据以上得到的uid进入pod中container对应的cgroup目录:
/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod-<pod-uid>.slice/cri-containerd-<container-id>.scope
- 通过container pid定位指标源文件
获取pod对应pid的cgroup文件目录
crictl pods |grep pod-name 可以拿到pod-id crictl ps |grep container-name 或者crictl ps |grep pod-id 可以拿到 container-id critl inspect <container-id> |grep -i pid cat /proc/${CPID}/cgroup 或者 cat /proc/${CPID}/mountinfo |grep cgroup
比如cgroup文件可以看到具体的pod cgroup子目录:
"cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxx.slice/cri-containerd-xxxx.scope", "memory": "/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxx.slice/cri-containerd-xxxx.scope",
- pod的state.json文件也可以看到pod对应的cgroup信息
# crictl pods |grep pod-name
可以获取, 注意pod-id不是pod uid
# cat /run/containerd/runc/k8s.io//state.json |jq .
"cgroup_paths": { "cpu": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope", "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope", "cpuset": "/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope", ... "memory": "/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope", ... }, "namespace_paths": { "NEWCGROUP": "/proc/8443/ns/cgroup", "NEWNET": "/proc/8443/ns/net", "NEWNS": "/proc/8443/ns/mnt", "NEWPID": "/proc/8443/ns/pid", "NEWUTS": "/proc/8443/ns/uts" ... },
3. cgroup 指标计算
进入到node/pod/container对应的cgroup目录中,查看指标对应的文件,此处不过多解读每个指标文件的含义。
//pod cgroup for cpu # cd /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope # ls ... cpuacct.block_latency cpuacct.stat cpuacct.usage_percpu_sys cpuacct.wait_latency cpu.cfs_period_us cpu.stat cpuacct.cgroup_wait_latency cpuacct.usage cpuacct.usage_percpu_user cpu.bvt_warp_ns cpu.cfs_quota_us notify_on_release //注意: CPU的cgroup目录 其实都是指向了cpu,cpuacct lrwxrwxrwx 1 root root 11 3月 5 05:10 cpu -> cpu,cpuacct lrwxrwxrwx 1 root root 11 3月 5 05:10 cpuacct -> cpu,cpuacct dr-xr-xr-x 7 root root 0 4月 28 15:41 cpu,cpuacct dr-xr-xr-x 3 root root 0 4月 3 17:33 cpuset //pod cgroup for memory # cd /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope # ls memory.direct_swapout_global_latency memory.kmem.tcp.failcnt memory.min memory.thp_reclaim tasks memory.direct_swapout_memcg_latency memory.kmem.tcp.limit_in_bytes memory.move_charge_at_immigrate memory.thp_reclaim_ctrl memory.events memory.kmem.tcp.max_usage_in_bytes memory.numa_stat memory.thp_reclaim_stat memory.events.local memory.kmem.tcp.usage_in_bytes memory.oom_control memory.usage_in_bytes ...
- 内存计算:
公式:
container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file
比如:
cat memory.usage_in_bytes cat memory.stat |grep total_inactive_file workingset=expr $[$usage_in_bytes-$total_inactive_file] 转换为兆为单位: echo [workingset/1024/1024]
- CPU计算:
公式:
cpuUsage := float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed) / window.Seconds() //其中,cgroup源文件 cpuacct.usage 显示的为cpu累计值usagenamocoreseconds,需要按照以上公式做计算方可得到cpu使用量 //对一段时间 从 startTime ~ endTime间的瞬时的CPU Core的计算公式是: (endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)
比如计算过去十秒的平均使用量(不是百分比,是cpu core的使用量):
tstart=$(date +%s%N);cstart=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);sleep 10;tstop=$(date +%s%N);cstop=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);result=`awk 'BEGIN{printf "%.2f\n",'$(($cstop - $cstart))'/'$(($tstop - $tstart))'}'`;echo $result;
4. 查看cadvisor指标
- 如何通过raw api获取cadvisor指标
kubectl get --raw=/api/v1/nodes/nodename/proxy/metrics/cadvisor - 在节点内部通过cAdvisor的本地接口/metrics/cadvisor 获取数据
curl http://127.0.0.1:10255/metrics/cadvisor
5. 在节点内部通过kubelet接口获取指标
curl 127.0.0.1:10255/metrics/resource
curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true |jq '.node’
6. 通过metrics-server API获取数据 (kubectl top或者HPA的调用方式)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/xxxx |jq .
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/xxxx/pods/xxxx |jq .
指标链路分五步解读
如果客户端获取指标数据失败,可以在指标获取流程中(cgoup- > cadvisor-> kubelet-> MetrcisServer Pod -> apiserver/Metrics Api)通过各自暴露的接口获取数据,定位问题发生点。
一 数据源:Linux cgroup 层级结构:
最外层是node cgoup =》 qos级别cgroup =》 pod级别cgroup -》container级别cgroup
其中 node cgoup包含 kubepods +user+system部分
下图可直观显示cgroup层级包含关系:
也可以使用systemd-cgls看层级结构,以下在node级别根目录 : /sys/fs/cgroup
二 cAdvisor从cgroup中收集container级别的指标数据
注意:cadvsior 也只是个指标收集者,它的数据来自于cgroup 文件。
2.1 cadvisor如何计算working_set内存值
inactiveFileKeyName:="total_inactive_file"ifcgroups.IsCgroup2UnifiedMode(){inactiveFileKeyName="inactive_file"}workingSet:=ret.Memory.Usageifv, ok:=s.MemoryStats.Stats[inactiveFileKeyName];ok{ifworkingSet<v{workingSet=0}else{workingSet-=v}}ret.Memory.WorkingSet=workingSet}
2.2 cAdvisor的接口/metrics/cadvisor 数据分析
kubectl get --raw=/api/v1/nodes/nodename/proxy/metrics/cadvisor
或者 curl http://127.0.0.1:10255/metrics/cadvisor
注意:/metrics/cadvisor 也是prometheus的其中一个数据源。
https://github.com/google/cadvisor/blob/master/info/v1/container.go#L320
该接口返回以下指标:
#curlhttp://127.0.0.1:10255/metrics/cadvisor|awk-F"\{"'{print $1}' |sort |uniq |grep "#" |grep -v TYPE 1#HELPcadvisor_version_infoAmetricwithaconstant'1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.1#HELPcontainer_blkio_device_usage_totalBlkioDevicebytesusage1#HELPcontainer_cpu_cfs_periods_totalNumberofelapsedenforcementperiodintervals.1#HELPcontainer_cpu_cfs_throttled_periods_totalNumberofthrottledperiodintervals.1#HELPcontainer_cpu_cfs_throttled_seconds_totalTotaltimedurationthecontainerhasbeenthrottled.1#HELPcontainer_cpu_load_average_10sValueofcontainercpuloadaverageoverthelast10seconds.1#HELPcontainer_cpu_system_seconds_totalCumulativesystemcputimeconsumedinseconds.1#HELPcontainer_cpu_usage_seconds_totalCumulativecputimeconsumedinseconds.1#HELPcontainer_cpu_user_seconds_totalCumulativeusercputimeconsumedinseconds.1#HELPcontainer_file_descriptorsNumberofopenfiledescriptorsforthecontainer.1#HELPcontainer_fs_inodes_freeNumberofavailableInodes1#HELPcontainer_fs_inodes_totalNumberofInodes1#HELPcontainer_fs_io_currentNumberofI/Oscurrentlyinprogress1#HELPcontainer_fs_io_time_seconds_totalCumulativecountofsecondsspentdoingI/Os1#HELPcontainer_fs_io_time_weighted_seconds_totalCumulativeweightedI/Otimeinseconds1#HELPcontainer_fs_limit_bytesNumberofbytesthatcanbeconsumedbythecontaineronthisfilesystem.1#HELPcontainer_fs_reads_bytes_totalCumulativecountofbytesread1#HELPcontainer_fs_read_seconds_totalCumulativecountofsecondsspentreading1#HELPcontainer_fs_reads_merged_totalCumulativecountofreadsmerged1#HELPcontainer_fs_reads_totalCumulativecountofreadscompleted1#HELPcontainer_fs_sector_reads_totalCumulativecountofsectorreadscompleted1#HELPcontainer_fs_sector_writes_totalCumulativecountofsectorwritescompleted1#HELPcontainer_fs_usage_bytesNumberofbytesthatareconsumedbythecontaineronthisfilesystem.1#HELPcontainer_fs_writes_bytes_totalCumulativecountofbyteswritten1#HELPcontainer_fs_write_seconds_totalCumulativecountofsecondsspentwriting1#HELPcontainer_fs_writes_merged_totalCumulativecountofwritesmerged1#HELPcontainer_fs_writes_totalCumulativecountofwritescompleted1#HELPcontainer_last_seenLasttimeacontainerwasseenbytheexporter1#HELPcontainer_memory_cacheNumberofbytesofpagecachememory.1#HELPcontainer_memory_failcntNumberofmemoryusagehitslimits1#HELPcontainer_memory_failures_totalCumulativecountofmemoryallocationfailures.1#HELPcontainer_memory_mapped_fileSizeofmemorymappedfilesinbytes.1#HELPcontainer_memory_max_usage_bytesMaximummemoryusagerecordedinbytes1#HELPcontainer_memory_rssSizeofRSSinbytes.1#HELPcontainer_memory_swapContainerswapusageinbytes.1#HELPcontainer_memory_usage_bytesCurrentmemoryusageinbytes, includingallmemoryregardlessofwhenitwasaccessed1#HELPcontainer_memory_working_set_bytesCurrentworkingsetinbytes.1#HELPcontainer_network_receive_bytes_totalCumulativecountofbytesreceived1#HELPcontainer_network_receive_errors_totalCumulativecountoferrorsencounteredwhilereceiving1#HELPcontainer_network_receive_packets_dropped_totalCumulativecountofpacketsdroppedwhilereceiving1#HELPcontainer_network_receive_packets_totalCumulativecountofpacketsreceived1#HELPcontainer_network_transmit_bytes_totalCumulativecountofbytestransmitted1#HELPcontainer_network_transmit_errors_totalCumulativecountoferrorsencounteredwhiletransmitting1#HELPcontainer_network_transmit_packets_dropped_totalCumulativecountofpacketsdroppedwhiletransmitting1#HELPcontainer_network_transmit_packets_totalCumulativecountofpacketstransmitted1#HELPcontainer_processesNumberofprocessesrunninginsidethecontainer.1#HELPcontainer_scrape_error1iftherewasanerrorwhilegettingcontainermetrics, 0otherwise1#HELPcontainer_socketsNumberofopensocketsforthecontainer.1#HELPcontainer_spec_cpu_periodCPUperiodofthecontainer.1#HELPcontainer_spec_cpu_quotaCPUquotaofthecontainer.1#HELPcontainer_spec_cpu_sharesCPUshareofthecontainer.1#HELPcontainer_spec_memory_limit_bytesMemorylimitforthecontainer.1#HELPcontainer_spec_memory_reservation_limit_bytesMemoryreservationlimitforthecontainer.1#HELPcontainer_spec_memory_swap_limit_bytesMemoryswaplimitforthecontainer.1#HELPcontainer_start_time_secondsStarttimeofthecontainersinceunixepochinseconds.1#HELPcontainer_tasks_stateNumberoftasksingivenstate1#HELPcontainer_threads_maxMaximumnumberofthreadsallowedinsidethecontainer, infinityifvalueiszero1#HELPcontainer_threadsNumberofthreadsrunninginsidethecontainer1#HELPcontainer_ulimits_softSoftulimitvaluesforthecontainerrootprocess. Unlimitedif-1, exceptpriorityandnice1#HELPmachine_cpu_coresNumberoflogicalCPUcores.1#HELPmachine_cpu_physical_coresNumberofphysicalCPUcores.1#HELPmachine_cpu_socketsNumberofCPUsockets.1#HELPmachine_memory_bytesAmountofmemoryinstalledonthemachine.1#HELPmachine_nvm_avg_power_budget_wattsNVMpowerbudget.1#HELPmachine_nvm_capacityNVMcapacityvaluelabeledbyNVMmode(memorymodeorappdirectmod
2.3 问题:kubectl top pod包含pause容器的指标么?
实验发现cadvisor接口其实也返回了pause container的数据,但是kubectl top pod --containers中不包含pause。至少目前是不包含pause的.
#curlhttp://127.0.0.1:10255/metrics/cadvisor|grepcsi-plugin|grepcontainer_cpu_usage_seconds_totalcontainer_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice",image="",name="",namespace="kube-system",pod="csi-plugin-tfrc6"}675.447883931665548285204container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-9f878a414fa02a1009e84dc7cea417084e9f856e8ae811112d841b9b1b86713f.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/pause:3.5",name="9f878a414fa02a1009e84dc7cea417084e9f856e8ae811112d841b9b1b86713f",namespace="kube-system",pod="csi-plugin-tfrc6"}0.0199812571665548276989container_cpu_usage_seconds_total{container="csi-plugin",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-eab2bc5aba47420c5dcfee0cdc63e9327f6154e95a64da2a10eb5c4b9bc9b8d0.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.22.11-abbb810e-aliyun",name="eab2bc5aba47420c5dcfee0cdc63e9327f6154e95a64da2a10eb5c4b9bc9b8d0",namespace="kube-system",pod="csi-plugin-tfrc6"}452.0184394951665548286584container_cpu_usage_seconds_total{container="csi-provisioner",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod3ef3dfea_a052_4ed5_8b21_647e0ac42817.slice/cri-containerd-9b2a563af4409e0f19260af23f46c3f10102f5089ad97c696dbb77be20d95a82.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.20.7-aafce42-aliyun",name="9b2a563af4409e0f19260af23f46c3f10102f5089ad97c696dbb77be20d95a82",namespace="kube-system",pod="csi-provisioner-66d47b7f64-88lzs"}533.0275188471665548276678container_cpu_usage_seconds_total{container="disk-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-b565d92528539b319d929935e20316c3cd121ab816f00ec5e09f2b1d7ec57eec.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="b565d92528539b319d929935e20316c3cd121ab816f00ec5e09f2b1d7ec57eec",namespace="kube-system",pod="csi-plugin-tfrc6"}79.6957118371665548287771container_cpu_usage_seconds_total{container="nas-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-addd318c7faf17a566d11b9916452df298eb1c4f96214d8ca572920d53a05e06.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="addd318c7faf17a566d11b9916452df298eb1c4f96214d8ca572920d53a05e06",namespace="kube-system",pod="csi-plugin-tfrc6"}71.7271122061665548288055container_cpu_usage_seconds_total{container="oss-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-7742122f2aca67336939e9c4f24696df5dc2d400e3674c4008c5e8202bf5ed7a.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="7742122f2aca67336939e9c4f24696df5dc2d400e3674c4008c5e8202bf5ed7a",namespace="kube-system",pod="csi-plugin-tfrc6"}71.9868918471665548274774#curlhttp://127.0.0.1:10255/metrics/cadvisor|grepcsi-plugin-cvjwm|grepcontainer_memory_working_set_bytescontainer_memory_working_set_bytes{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice",image="",name="",namespace="kube-system",pod="csi-plugin-cvjwm"}5.6066048e+071651810984383container_memory_working_set_bytes{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-2b08fab496e500709102969ed459f5150e7db008194fb3e71f1f7b0ad48f7e8e.scope",image="sha256:ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c459",name="2b08fab496e500709102969ed459f5150e7db008194fb3e71f1f7b0ad48f7e8e",namespace="kube-system",pod="csi-plugin-cvjwm"}409601651810984964container_memory_working_set_bytes{container="csi-plugin",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-5c79494498d7905db8ffda91f0037dc7208fcf75b67a2c85932065e95640bd77.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.20.7-aafce42-aliyun",name="5c79494498d7905db8ffda91f0037dc7208fcf75b67a2c85932065e95640bd77",namespace="kube-system",pod="csi-plugin-cvjwm"}2.316288e+071651810988657container_memory_working_set_bytes{container="disk-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-c9b1bac3ca7539b317047188f233c3396be3df0796bc90b408caf98a4f51a70b.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="c9b1bac3ca7539b317047188f233c3396be3df0796bc90b408caf98a4f51a70b",namespace="kube-system",pod="csi-plugin-cvjwm"}1.1063296e+071651810991117container_memory_working_set_bytes{container="nas-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-6e4c525abb2337ae7eb6f884dc9fc7a2605699581e450d4586173f8b7a4187cd.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="6e4c525abb2337ae7eb6f884dc9fc7a2605699581e450d4586173f8b7a4187cd",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0940416e+071651810981728container_memory_working_set_bytes{container="oss-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-34f5248e3a3f812480a42b16e3c8514b41f8da8e5ae1c84684dd2525f583da79.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="34f5248e3a3f812480a42b16e3c8514b41f8da8e5ae1c84684dd2525f583da79",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0846208e+071651810991024
三 Kubelet的API endpoint 提供Node/Pod/Container级别指标
kubelet提供两个endpoint,是kubelet计算后的数据值。因为cAdvisor采集的是container级别的原始数据,不包含pod以及node的计算加和值。
不同版本的metrics-server请求的endpoint如下:
older version: /stats/summary?only_cpu_and_memory=true
v0.6.0+: /metrics/resource
Metrics-server v0.6.0跟kubelet的/metrics/resource获取数据:
//GetMetricsimplementsclient.KubeletMetricsGetterfunc(kc*kubeletClient)GetMetrics(ctxcontext.Context, node*corev1.Node)(*storage.MetricsBatch, error){port:=kc.defaultPortnodeStatusPort:=int(node.Status.DaemonEndpoints.KubeletEndpoint.Port)ifkc.useNodeStatusPort&&nodeStatusPort!=0{port=nodeStatusPort}addr, err:=kc.addrResolver.NodeAddress(node)iferr!=nil{returnnil, err}url:=url.URL{Scheme:kc.scheme, Host:net.JoinHostPort(addr, strconv.Itoa(port)), Path:"/metrics/resource", }returnkc.getMetrics(ctx, url.String(), node.Name)}
3.1 Kubelet接口提供的指标数据
/metrics/resource 接口:
该接口可以提供以下指标返回:
$curl127.0.0.1:10255/metrics/resource|awk-F"{"'{print $1}' | grep "#" |grep HELP#HELPcontainer_cpu_usage_seconds_total[ALPHA]Cumulativecputimeconsumedbythecontainerincore-seconds#HELPcontainer_memory_working_set_bytes[ALPHA]Currentworkingsetofthecontainerinbytes#HELPcontainer_start_time_seconds[ALPHA]Starttimeofthecontainersinceunixepochinseconds#HELPnode_cpu_usage_seconds_total[ALPHA]Cumulativecputimeconsumedbythenodeincore-seconds#HELPnode_memory_working_set_bytes[ALPHA]Currentworkingsetofthenodeinbytes#HELPpod_cpu_usage_seconds_total[ALPHA]Cumulativecputimeconsumedbythepodincore-seconds#HELPpod_memory_working_set_bytes[ALPHA]Currentworkingsetofthepodinbytes#HELPscrape_error[ALPHA]1iftherewasanerrorwhilegettingcontainermetrics, 0otherwise比如,查询特定pod的metrics,也会包含containermetrcis:#curl127.0.0.1:10255/metrics/resource|grepcsi-plugin-cvjwmcontainer_cpu_usage_seconds_total{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"}787.2114559261651651825597container_cpu_usage_seconds_total{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}16.585081241651651825600container_cpu_usage_seconds_total{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}16.8203297541651651825602container_cpu_usage_seconds_total{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}16.0166304341651651825605container_memory_working_set_bytes{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"}2.312192e+071651651825597container_memory_working_set_bytes{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.1071488e+071651651825600container_memory_working_set_bytes{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0940416e+071651651825602container_memory_working_set_bytes{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.0846208e+071651651825605container_start_time_seconds{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"}1.64639996363012e+091646399963630container_start_time_seconds{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.646399924462264e+091646399924462container_start_time_seconds{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.646399937591126e+091646399937591container_start_time_seconds{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"}1.6463999541537158e+091646399954153pod_cpu_usage_seconds_total{namespace="kube-system",pod="csi-plugin-cvjwm"}836.6334973541651651825616pod_memory_working_set_bytes{namespace="kube-system",pod="csi-plugin-cvjwm"}5.5980032e+071651651825643
/stats/summary 接口:
接口返回指标如下,包含node/pod/container:
#curl127.0.0.1:10255/stats/summary?only_cpu_and_memory=true|jq'.node'{"nodeName":"ack.dedicated00009yibei", "systemContainers":[{"name":"pods", "startTime":"2022-03-04T13:17:01Z", "cpu":{...}, "memory":{...}}, {"name":"kubelet", "startTime":"2022-03-04T13:17:22Z", "cpu":{...}, "memory":{...}}], "startTime":"2022-03-04T13:11:06Z", "cpu":{"time":"2022-05-05T13:54:02Z", "usageNanoCores":182358783, =======>HPA/kubectltop的取值来源,也是也是基于下面的累计值的计算值。"usageCoreNanoSeconds":979924257151862====>累计值,使用时间窗口window做后续计算。}, "memory":{"time":"2022-05-05T13:54:02Z", "availableBytes":1296015360, "usageBytes":3079581696, "workingSetBytes":2592722944, =======>kubectltopnode大体一致"rssBytes":1459187712, "pageFaults":9776943, "majorPageFaults":1782}}############################curl127.0.0.1:10255/stats/summary?only_cpu_and_memory=true|jq'.pods[1]'{"podRef":{"name":"kube-flannel-ds-s6mrk", "namespace":"kube-system", "uid":"5b328994-c4a1-421d-9ab0-68992ca79807"}, "startTime":"2022-03-04T13:18:41Z", "containers":[{...}], "cpu":{"time":"2022-05-05T13:53:03Z", "usageNanoCores":2817176, "usageCoreNanoSeconds":11613876607138}, "memory":{"time":"2022-05-05T13:53:03Z", "availableBytes":237830144, "usageBytes":30605312, "workingSetBytes":29876224, "rssBytes":26116096, "pageFaults":501002073, "majorPageFaults":1716}}############################curl127.0.0.1:10255/stats/summary?only_cpu_and_memory=true|jq'.pods[1].containers[0]'{"name":"kube-scheduler", "startTime":"2022-05-05T08:27:55Z", "cpu":{"time":"2022-05-05T13:56:16Z", "usageNanoCores":1169892, "usageCoreNanoSeconds":29353035680}, "memory":{"time":"2022-05-05T13:56:16Z", "availableBytes":9223372036817981000, "usageBytes":36790272, "workingSetBytes":32735232, "rssBytes":24481792, "pageFaults":5511, "majorPageFaults":165}}
3.2 Kubelet如何计算CPU使用率:
CPU重点指标解析:
usageCoreNanoSeconds: //累计值,单位是nano core * seconds . 根据时间窗口window跟总核数计算出cpu usage.
usageNanoCores://计算值,利用某个默认时间段的两个累计值usageCoreNanoSeconds做计算的结果.
CPU使用量usageNanoCores= (endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)
计算方式:
对一段时间 从 startTime ~ endTime间的瞬时的CPU Core的计算公式是:
(endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)
比如计算过去十秒的平均使用量(不是百分比,是cpu core的使用量):
tstart=$(date +%s%N);cstart=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);sleep 10;tstop=$(date +%s%N);cstop=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);result=`awk 'BEGIN{printf "%.2f\n",'$(($cstop - $cstart))'/'$(($tstop - $tstart))'}'`;echo $result;
//MetricsPointrepresentstheasetofspecificmetricsatsomepointintime.typeMetricsPointstruct{//StartTimeisthestarttimeofcontainer/node. CumulativeCPUusageatthatmomentshouldbeequalzero.StartTimetime.Time//Timestampisthetimewhenmetricpointwasmeasured. IfCPUandMemorywasmeasuredatdifferenttimeitshouldequalCPUtimetoallowaccurateCPUcalculation.Timestamptime.Time//CumulativeCpuUsedisthecumulativecpuusedatTimestampfromtheStartTimeofcontainer/node. Unit:nanocore*seconds.CumulativeCpuUseduint64//MemoryUsageistheworkingsetsize. Unit:bytes.MemoryUsageuint64}funcresourceUsage(last, prevMetricsPoint)(corev1.ResourceList, api.TimeInfo, error){iflast.CumulativeCpuUsed<prev.CumulativeCpuUsed{returncorev1.ResourceList{}, api.TimeInfo{}, fmt.Errorf("unexpected decrease in cumulative CPU usage value")}window:=last.Timestamp.Sub(prev.Timestamp)cpuUsage:=float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed)/window.Seconds()returncorev1.ResourceList{corev1.ResourceCPU:uint64Quantity(uint64(cpuUsage), resource.DecimalSI, -9), corev1.ResourceMemory:uint64Quantity(last.MemoryUsage, resource.BinarySI, 0), }, api.TimeInfo{Timestamp:last.Timestamp, Window:window, }, nil}
3.3 Kubelet计算节点内存逻辑脚本(计算驱逐时的逻辑):
#!/bin/bash#!/usr/bin/envbash#Thisscriptreproduceswhatthekubeletdoes#tocalculatememory.availablerelativetorootcgroup.#currentmemoryusagememory_capacity_in_kb=$(cat/proc/meminfo|grepMemTotal|awk'{print $2}')memory_capacity_in_bytes=$((memory_capacity_in_kb*1024))memory_usage_in_bytes=$(cat/sys/fs/cgroup/memory/memory.usage_in_bytes)memory_total_inactive_file=$(cat/sys/fs/cgroup/memory/memory.stat|greptotal_inactive_file|awk'{print $2}')memory_working_set=${memory_usage_in_bytes}if["$memory_working_set"-lt"$memory_total_inactive_file"];thenmemory_working_set=0elsememory_working_set=$((memory_usage_in_bytes-memory_total_inactive_file))fimemory_available_in_bytes=$((memory_capacity_in_bytes-memory_working_set))memory_available_in_kb=$((memory_available_in_bytes/1024))memory_available_in_mb=$((memory_available_in_kb/1024))echo"memory.capacity_in_bytes $memory_capacity_in_bytes"echo"memory.usage_in_bytes $memory_usage_in_bytes"echo"memory.total_inactive_file $memory_total_inactive_file"echo"memory.working_set $memory_working_set"echo"memory.available_in_bytes $memory_available_in_bytes"echo"memory.available_in_kb $memory_available_in_kb"echo"memory.available_in_mb $memory_available_in_mb"
四 Metrics-Server解读
4.1 关于apiservice metrics.k8s.io
部署apiservice的时候向apiserver的aggregation layer注册Metrcis API。这样apiserver收到特定metrics api (/metrics.k8s.io/)的请求后,会转发给后端定义的metrics-server做处理。
kubectl get apiservices v1beta1.metrics.k8s.io 后端指向的svc是metrics-service.
如果此处apiservice指向有问题导致指标获取异常,可清理资源重装metrics-server:
1,删除v1beta1.metrics.k8s.ioApiServiceskubectldeleteapiservicev1beta1.metrics.k8s.io2,卸载metrics-serverkubectldeletedeploymentmetrics-server-nkube-system3,重新安装metrics-server组件
4.2 关于heapster svc
Heapster是老版本集群中用来收集指标的,后续基础CPU/Memory由metrics-server负责,自定义指标可以prometheus负责。ACK集群中Heapster跟metrics-server两个svc共同指向metrics-server pod。
4.3 关于 metrics-server pod 启动参数
ACK集群中官方metrics-server组件的启动参数如下:
containers:
- command:
- /metrics-server
- --source=kubernetes.hybrid:''
- --sink=socket:tcp://monitor.csk.xxx.aliyuncs.com:8093?clusterId=xxx&public=true
image: registry-vpc.cn-beijing.aliyuncs.com/acs/metrics-server:v0.3.8.5-307cf45-aliyun
注意,此处sink中定义的8093端口,用于metrics-server pod向阿里云云监控提供指标数据。
另外,启动参数中没指定的,采用默认值,比如:
--metric-resolution duration The resolution at which metrics-server will retain metrics. (default 30s)
指标缓存sink的定义:
https://github.com/kubernetes-sigs/metrics-server/blob/v0.3.5/pkg/provider/sink/sinkprov.go#L134
//sinkMetricsProviderisaprovider.MetricsProviderthatalsoactsasasink.MetricSinktypesinkMetricsProviderstruct{musync.RWMutexnodesmap[string]sources.NodeMetricsPointpodsmap[apitypes.NamespacedName]sources.PodMetricsPoint}
4.4 Metrics-server API提供的指标数据
Metrcis-server相当于对kubelet endpoint的数据做了一次转换:
#kubectlget--raw/apis/metrics.k8s.io/v1beta1/nodes/ack.dedicated00009yibei|jq.{"kind":"NodeMetrics", "apiVersion":"metrics.k8s.io/v1beta1", "metadata":{"name":"ack.dedicated00009yibei", "selfLink":"/apis/metrics.k8s.io/v1beta1/nodes/ack.dedicated00009yibei", "creationTimestamp":"2022-05-05T14:28:27Z"}, "timestamp":"2022-05-05T14:27:32Z", "window":"30s", ===》计算cpu累计值差值的时间窗口"usage":{"cpu":"157916713n", ====》nanocore,基于CPU累计值的计算结果值"memory":"2536012Ki"====》cgroup.working_set}}#kubectlget--raw/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/centos7-74cd758d98-wcwnj|jq.{"kind":"PodMetrics", "apiVersion":"metrics.k8s.io/v1beta1", "metadata":{"name":"centos7-74cd758d98-wcwnj", "namespace":"default", "selfLink":"/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/centos7-74cd758d98-wcwnj", "creationTimestamp":"2022-05-05T14:32:39Z"}, "timestamp":"2022-05-05T14:32:04Z", "window":"30s", "containers":[{"name":"centos7", "usage":{"cpu":"0", "memory":"224Ki"}}]}
五 客户端组件对Metrics API的调用
5.1 kubectl top
// -v=6看API请求是发送给了metrics api:
5.2 HPA
向metrics API请求数据拿到CPU/Memory值后做计算与展示:
计算逻辑:
需要注意,HPA的百分比是usage/request; 因此若request设置比较低,会导致HPA阈值很高,容易引发误解。
DesiredReplicas=ceil[currentReplicas*(currentMetricValue/desiredMetricValue)]CurrentUtilization=int32((metricsTotal*100)/requestsTotal)其中,计算requestsTotal时,是将pod.Spec.Containersloop相加,一旦某一个container没有request就return报错,不累计initcontainer.
***衍生出一个知识点***
开启HPA的deployment必须给每个container定义request值,不过initcontainer不做要求。代码:
funccalculatePodRequests(pods[]*v1.Pod, containerstring, resourcev1.ResourceName)(map[string]int64, error){requests:=make(map[string]int64, len(pods))for_, pod:=rangepods{podSum:=int64(0)for_, c:=rangepod.Spec.Containers{ifcontainer==""||container==c.Name{ifcontainerRequest, ok:=c.Resources.Requests[resource];ok{podSum+=containerRequest.MilliValue()}else{returnnil, fmt.Errorf("missing request for %s", resource)}}}requests[pod.Name]=podSum}returnrequests, nil}//GetResourceUtilizationRatiotakesinasetofmetrics, asetofmatchingrequests, //andatargetutilizationpercentage, andcalculatestheratioof//desiredtoactualutilization(returningthat, theactualutilization, andtherawaveragevalue)funcGetResourceUtilizationRatio(metricsPodMetricsInfo, requestsmap[string]int64, targetUtilizationint32)(utilizationRatiofloat64, currentUtilizationint32, rawAverageValueint64, errerror){metricsTotal:=int64(0)requestsTotal:=int64(0)numEntries:=0forpodName, metric:=rangemetrics{request, hasRequest:=requests[podName]if!hasRequest{//wecheckformissingrequestselsewhere, soassumingmissingrequests==extraneousmetricscontinue}metricsTotal+=metric.ValuerequestsTotal+=requestnumEntries++}//ifthesetofrequestsiscompletelydisjointfromthesetofmetrics, //thenwecouldhaveanissuewheretherequeststotaliszeroifrequestsTotal==0{return0, 0, 0, fmt.Errorf("no metrics returned matched known pods")}currentUtilization=int32((metricsTotal*100)/requestsTotal)returnfloat64(currentUtilization)/float64(targetUtilization), currentUtilization, metricsTotal/int64(numEntries), nil}
附录
常见指标统计方式的计算
在分析链路之后,将一些常见指标获取方式的计算公式总结如下,毕竟梳理原理也是为了理解每种指标的计算方式,便于出现问题异常时快速定位。从公式总结中可以回答一些常见的疑问。
指标数据源 |
内存使用量 |
内存使用率 |
CPU使用量 |
CPU使用率 |
Node 节点级别 |
/proc/meminfo top/free输出的mem used不包含缓存; free不包含buffer/cache ;avaliable包含buffer/cache; 公式(对标prometheus指标):
|
/proc/cpuinfo |
||
Node 级别 cgroup |
目录:/sys/fs/cgroup/memory/ 取值:memory_total_inactive_file = cat memory.stat |grep total_inactive_file 取值:memory_usage_in_bytes = cat memory.usage_in_bytes //也有这种表达: container_memory_usage_bytes = container_memory_rss + container_memory_cache + kernel memory 公式:memory_working_set_bytes =memory_usage_in_bytes - memory_total_inactive_file /cache 应该是tocal_actice_file+total_inactive_file等,所以workingset是包含部分缓存的,所以会比top/free看到的使用值大 |
/sys/fs/cgroup/cpuxxx |
||
Pod/container 级别 cgroup |
/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podXXXXXX.slice/cri-containerd-XXXXX.scope |
/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podXXXXXX.slice/cri-containerd-XXXXXX.scope |
指标获取方式 |
内存使用量 |
内存使用率 |
CPU使用量 |
CPU使用率 |
kubectl top node //基于metrics-server,聚合值 |
内存使用量= node_memory_working_set_bytes //跟node cgoup中数值大体一致 //跟kubelet 10255暴露的接口指标大体一致 |
内存百分比= node_memory_working_set_bytes/node_Allocatable_memory |
计算方式: cpuUsage := float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed) / window.Seconds() |
CPU百分比 = cpuUsage/node_Allocatable_cpu |
kubectl top pod //基于metrics-server,聚合值 |
内存使用量= pod_memory_working_set_bytes //跟pod cgoup中数值大体一致 //跟kubelet 10255暴露的接口指标大体一致 |
N/A |
top pod取值sumamry API的usagenaancores,是kubelet依据CumulativeCpuUsed计算后的结果 |
N/A |
kubectl top pod --containers //基于metrics-server,聚合值 |
内存使用量= container_memory_working_set_bytes //跟container cgoup中数值大体一致 //跟kubelet 10255暴露的接口指标大体一致 |
N/A |
||
HPA //基于metrics-server,聚合值 |
设置desiredMetricValue ,跟CurrentUtilization对比计算扩缩数量 CurrentUtilization=int32((metricsUsageTotal*100) /requestsTotal) 其中,计算total时,是将container loop相加,一旦某一个container没有request就报错,不累计init container. |
|||
kubectl describe node //不调用Metrics API |
node_Allocatable_memory: capacity_memory - system_reserved_memory - kuebelt_reserved_memory – eviction_hard_memory node_Capacity_memory: = 机型total_memory-系统启动预留(该值不可见) request : kubectl describe node里的resource request,算了init-container等已经执行完毕退出的container; |
结尾:
本篇文章主要针对metrics-server的指标获取链路做了梳理,也是自己的学习笔记,期间也参考了诸多前辈的精华总结,旨在掌握后可以在k8s集群指标获取异常时快速分析定位。参考:
https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/cri-container-stats.mdhttps://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/https://github.com/kubernetes/kubernetes/blob/65178fec72df6275ed0aa3ede12c785ac79ab97a/pkg/controller/podautoscaler/replica_calculator.go#L424https://github.com/kubernetes-sigs/metrics-serverhttps://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-cpu-usage-is-calculatedcAdvisor源码分析:https://cloud.tencent.com/developer/article/1096375https://github.com/google/cadvisor/blob/d6b0ddb07477b17b2f3ef62b032d815b1cb6884e/machine/machine.gohttps://github.com/google/cadvisor/tree/3beb265804ea4b00dc8ed9125f1f71d3328a7a94/container/libcontainerhttps://www.jianshu.com/p/7c18075aa735https://www.cnblogs.com/gaorong/p/11716907.html《记一次podoom的异常shmem输出》https://developer.aliyun.com/article/1040230?spm=a2c6h.13262185.profile.46.994e2382PrPuO5