译者序
本文翻译自2021
年Linux 5.10
内核文档:Control Group v2,它是描述cgroupv2
用户空间侧的设计、接口和规范的权威文档。
原文非常全面详细,本文只翻译了目前感兴趣的部分,其他部分保留原文。 另外,由于技术规范的描述比较抽象,因此翻译时加了一些系统测试输出、内核代码片段和 链接,便于更好理解。
由于译者水平有限,本文不免存在遗漏或错误之处。如有疑问,请查阅原文。
以下是译文。
- 1 引言
- 2 基本操作
- 3 资源分配模型
- 4 接口文件
- 5 控制器
- 6 命名空间
- 7 内核编程的信息
- 8 废弃的v1特性
- 9 Issues with v1 and Rationales for v2
本文是描述cgroup v2
设计、接口和规范的权威文档。未来所有改动/变化都需反应到本文档中。v1
的文档见 cgroup-v1。
本文描述cgroup v2
所有用户空间可见的部分,包括cgroup core
和各controller
。
1 引言
1.1 术语
cgroup
是control group
的缩写,并且首字母永远不大写(never capitalized
)。
- 单数形式(
cgroup
)指这个特性,或用于 “cgroup controllers
” 等术语中的修饰词。 - 复数形式(
cgroups
)显式地指多个cgroup
。
1.2 cgroup
是什么?
cgroup
是以受控的、可配置的方法,按照树形层级结构组织进程,并依据层次结构分配系统资源的一种机制。
1.2.1 cgroup
组成
cgroup
主要由两部分组成:
核心(core)
:主要负责层级化地组织进程;控制器(controllers)
:大部分控制器负责cgroup
层级中特定类型的系统资源的分配,少部分工具类控制器用于其它目的。
1.2.2 cgroup
与进程/线程的关系
cgroup
以树形结构组织:
- 系统中每个进程有且仅属于一个
cgroup
; - 一个进程的所有线程属于同一个
cgroup
; - 创建子进程时,继承其父进程的
cgroup
; - 一个进程可以被迁移到其它
cgroup
; - 迁移一个进程时,子进程(后代进程)不会自动跟着一起迁移;
1.2.3 控制器
- 遵循特定的结构规范,可以选择性地针对一个
cgroup
启用或禁用某些控制器; - 控制器的所有行为都是分层结构化的。
- 如果一个
cgroup
启用了某个控制器,那这个cgroup
中的、以及子层次结构中的所有进程都会受控制。 - 在层次结构中,靠近
root
节点上的资源限制设置,不会被子层次上的设置覆盖。
2 基本操作
2.1 挂载
与v1
不同,cgroup v2
只有单个层级树(single hierarchy
)。 用如下命令挂载cgroup v2 hierarchy
(cgroup v2 hierarchy
等效于一个文件系统):
# mount -t <fstype> <device> <dir> mount -t cgroup2 none $MOUNT_POINT
2.1.1 控制器与v1/v2
绑定关系
cgroupv2
文件系统的magic number
是0x63677270
(“cgrp”
)。- 所有支持
v2
且未绑定到v1
的控制器,会被自动绑定到v2 hierarchy
,出现在root
层级中。 v2
中未在使用的控制器,可以绑定到其它层级中(hierarchies
)。
这说明,我们可以以完全向后兼容的方式混用v2
和v1
。
示例:**
ubuntu 20.04
同时挂载cgroupv1/cgroupv2
**(译者添注)查看
ubuntu 20.04
(5.11
内核)cgroup
相关的挂载点:
~$ mount | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755,inode64) cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/misc type cgroup (rw,nosuid,nodev,noexec,relatime,misc)
可以看到,系统同时挂载了
cgroup
和cgroup2
:
cgroup v2
是单一层级树,因此只有一个挂载点(第二行)/sys/fs/cgroup/unified
,这就是上一小节所说的root
层级。cgroup v1
根据控制器类型(cpuset/cpu
,cpuacct/hugetlb/
...),挂载到不同位置。接下来看哪些控制绑定到了
cgroup v2
:
$ ls -ahlp /sys/fs/cgroup/unified/ total 0 -r--r--r-- 1 root root 0 cgroup.controllers -rw-r--r-- 1 root root 0 cgroup.max.depth -rw-r--r-- 1 root root 0 cgroup.max.descendants -rw-r--r-- 1 root root 0 cgroup.procs -r--r--r-- 1 root root 0 cgroup.stat -rw-r--r-- 1 root root 0 cgroup.subtree_control -rw-r--r-- 1 root root 0 cgroup.threads -rw-r--r-- 1 root root 0 cpu.pressure -r--r--r-- 1 root root 0 cpu.stat drwxr-xr-x 2 root root 0 init.scope/ -rw-r--r-- 1 root root 0 io.pressure -rw-r--r-- 1 root root 0 memory.pressure drwxr-xr-x 121 root root 0 system.slice/ drwxr-xr-x 3 root root 0 user.slice/
只有
cpu/io/memory
等少量控制器(大部分还在cgroup v1
中,系统默认使用v1
)。最后看几个控制器文件的内容,加深一点直观印象,后面章节会详细解释这些分别表示什么意思:
$ cd /sys/fs/cgroup/unified $ cat cpu.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=2501067303 $ cat cpu.stat usage_usec 44110960000 user_usec 29991256000 system_usec 14119704000 $ cat io.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=299044042 full avg10=0.00 avg60=0.00 avg300=0.00 total=271257559 $ cat memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=298215 full avg10=0.00 avg60=0.00 avg300=0.00 total=229843
2.1.2 控制器在v1
和v2
之间切换
- 控制器在当前
hierarchy
中已经不再被引用,才能移动到其它hierarchy
。 - 由于每个
cgroup
控制器状态是异步销毁的,从v1
卸载之后可能会有引用持续一段时间,因此,可能不会立即出现在v2
的hierarchy
中。 - 类似地,一个控制器只有被完全禁用之后,才能被移出
v2 hierarchy
中,且可能过一段时间才能在v1 hierarchy
中可用; - 此外,由于控制器间的依赖,其它控制器也可能需要被禁用。
在 v2
和 v1
之间动态移动控制器对开发和手动配置很有用,但 强烈建议不要在生产环境这么做。建议在系统启动、控制器开始使用之后, 就不要再修改hierarchy
和控制器的关联关系了。
另外,迁移到v2
时,系统管理软件可能仍然会自动挂载v1 cgroup
文件系统, 因此需要在系统启动过程中劫持所有的控制器,因为启动之后就晚了。为方便测试,内核提供了cgroup_no_v1=allows
配置,可完全禁用v1
控制器(强制使用v2
)。
2.1.3 cgroupv2
挂载选项
前面mount
命令没指定任何特殊参数。目前支持如下mount
选项:
nsdelegate
考虑将cgroup namespaces
作为委托边界。该选项是系统范围内的,只能在mount
时设置,或者从init
命名空间卸载时修改。在non-init
命名空间挂载时会忽略mount
命令的选项设置。memory_localevents
Only populate memory.events with data for the current cgroup, and not any subtrees. This is legacy behaviour, the default behaviour without this option is to include subtree counts. This option is system wide and can only be set on mount or modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts.memory_recursiveprot
Recursively apply memory.min and memory.low protection to entire subtrees, without requiring explicit downward propagation into leaf cgroups. This allows protecting entire subtrees from one another, while retaining free competition within those subtrees. This should have been the default behavior but is a mount-option to avoid regressing setups relying on the original semantics (e.g. specifying bogusly high ‘bypass’ protection values at higher tree levels).
2.2 组织进程和线程
2.2.1 进程
初始状态下,只有root cgroup
,所有进程都属于这个cgroup
。
- 创建
子cgroup
,只需创建一个子目录
即可:
# mkdir $CGROUP_NAME
- 读取该文件,会列出属于该
cgroup
的所有进程标识符(PID
)。 PID
并未排序。- 同一个
PID
可能出现多次:一个进程迁移到另一个cgroup
后又回来;读文件期间PID
被重用了;都可能发生这种情况。 - 一个
cgroup
可以拥有多个子cgroup
,构成一个属性结构。 - 每个
cgroup
有一个可读写的接口文件(cgroup.procs
)。
- 迁移
cgroup
:通过写进程的PID
到目标cgroup
的cgroup.procs
文件:
- 每次
write(2)
调用只能迁移一个进程; - 如果一个进程包含多个线程,写任何一个线程的
PID
,该进程所有线程都会迁移; - 如果父进程
fork
一个子进程,则新的子进程属于执行fork
操作的父进程所属的cgroup
; - 进程退出(
exit
)后,仍然留在退出时它所属的cgroup
,直到这个进程被收割(reaped
); - 僵尸进程不会出现在
cgroup.procs
文件中,也不能迁移到其它cgroup
。
- 销毁
cgroup
:删除对应的目录即可:
# rmdir $CGROUP_NAME
- 如果一个
cgroup
没有任何子进程、或激活进程,则可以通过删除该目录来销毁对应的cgroup
; - 如果一个
cgroup
没有任何子进程,但有僵尸进程,则认为是空的,可以删除。
- 查看进程所属的
cgroup
信息:cat /proc/$PID/cgroup
会列出进程所属的所有cgroup
之间关系。
- 如果是
cgroupv1
,则该文件包含多行,每一行代表一个层级(hierarchy
)。cgroupv2
的每一项永远是0::$PATH
的格式:
# cat /proc/842/cgroup ... 0::/test-cgroup/test-cgroup-nested
cgroupv1
的显示格式(译者添注):
$ cat /proc/\$$/cgroup 13:blkio:/user.slice 12:pids:/user.slice/user-1000.slice/user@1000.service 11:cpu,cpuacct:/user.slice 10:freezer:/ 9:devices:/user.slice 8:misc:/ 7:net_cls,net_prio:/ 6:perf_event:/ 5:rdma:/ 4:memory:/user.slice/user-1000.slice/user@1000.service 3:cpuset:/ 2:hugetlb:/ 1:name=systemd:/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-12708505-7a30-4f4a-99a6-682552c4e4b3.scope 0::/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-12708505-7a30-4f4a-99a6-682552c4e4b3.scope
- 如果一个进程变成僵尸进程(
zombie
),并且与它关联的cgroup
随后被删掉了,那行尾会出现 (deleted
) 字样:
# cat /proc/842/cgroup ... 0::/test-cgroup/test-cgroup-nested (deleted)
2.2.2 线程
cgroup v2 supports thread granularity for a subset of controllers to support use cases requiring hierarchical resource distribution across the threads of a group of processes. By default, all threads of a process belong to the same cgroup, which also serves as the resource domain to host resource consumptions which are not specific to a process or thread. The thread mode allows threads to be spread across a subtree while still maintaining the common resource domain for them.
Controllers which support thread mode are called threaded controllers. The ones which don’t are called domain controllers.
Marking a cgroup threaded makes it join the resource domain of its parent as a threaded cgroup. The parent may be another threaded cgroup whose resource domain is further up in the hierarchy. The root of a threaded subtree, that is, the nearest ancestor which is not threaded, is called threaded domain or thread root interchangeably and serves as the resource domain for the entire subtree.
Inside a threaded subtree, threads of a process can be put in different cgroups and are not subject to the no internal process constraint - threaded controllers can be enabled on non-leaf cgroups whether they have threads in them or not.
As the threaded domain cgroup hosts all the domain resource consumptions of the subtree, it is considered to have internal resource consumptions whether there are processes in it or not and can’t have populated child cgroups which aren’t threaded. Because the root cgroup is not subject to no internal process constraint, it can serve both as a threaded domain and a parent to domain cgroups.
The current operation mode or type of the cgroup is shown in the “cgroup.type” file which indicates whether the cgroup is a normal domain, a domain which is serving as the domain of a threaded subtree, or a threaded cgroup.
On creation, a cgroup is always a domain cgroup and can be made threaded by writing “threaded” to the “cgroup.type” file. The operation is single direction:
# echo threaded > cgroup.type
Once threaded, the cgroup can’t be made a domain again. To enable the thread mode, the following conditions must be met.
- As the cgroup will join the parent’s resource domain. The parent must either be a valid (threaded) domain or a threaded cgroup.
- When the parent is an unthreaded domain, it must not have any domain controllers enabled or populated domain children. The root is exempt from this requirement.
Topology-wise, a cgroup can be in an invalid state. Please consider the following topology:
A (threaded domain) - B (threaded) - C (domain, just created)
C is created as a domain but isn’t connected to a parent which can host child domains. C can’t be used until it is turned into a threaded cgroup. “cgroup.type” file will report “domain (invalid)” in these cases. Operations which fail due to invalid topology use EOPNOTSUPP as the errno.
A domain cgroup is turned into a threaded domain when one of its child cgroup becomes threaded or threaded controllers are enabled in the “cgroup.subtree_control” file while there are processes in the cgroup. A threaded domain reverts to a normal domain when the conditions clear.
When read, “cgroup.threads” contains the list of the thread IDs of all threads in the cgroup. Except that the operations are per-thread instead of per-process, “cgroup.threads” has the same format and behaves the same way as “cgroup.procs”. While “cgroup.threads” can be written to in any cgroup, as it can only move threads inside the same threaded domain, its operations are confined inside each threaded subtree.
The threaded domain cgroup serves as the resource domain for the whole subtree, and, while the threads can be scattered across the subtree, all the processes are considered to be in the threaded domain cgroup. “cgroup.procs” in a threaded domain cgroup contains the PIDs of all processes in the subtree and is not readable in the subtree proper. However, “cgroup.procs” can be written to from anywhere in the subtree to migrate all threads of the matching process to the cgroup.
Only threaded controllers can be enabled in a threaded subtree. When a threaded controller is enabled inside a threaded subtree, it only accounts for and controls resource consumptions associated with the threads in the cgroup and its descendants. All consumptions which aren’t tied to a specific thread belong to the threaded domain cgroup.
Because a threaded subtree is exempt from no internal process constraint, a threaded controller must be able to handle competition between threads in a non-leaf cgroup and its child cgroups. Each threaded controller defines how such competitions are handled.
2.3 [Un]populated Notification
Each non-root cgroup has a “cgroup.events” file which contains “populated” field indicating whether the cgroup’s sub-hierarchy has live processes in it. Its value is 0 if there is no live process in the cgroup and its descendants; otherwise, 1. poll and [id]notify events are triggered when the value changes. This can be used, for example, to start a clean-up operation after all processes of a given sub-hierarchy have exited. The populated state updates and notifications are recursive. Consider the following sub-hierarchy where the numbers in the parentheses represent the numbers of processes in each cgroup:
A(4) - B(0) - C(1) \ D(0)
A, B and C’s “populated” fields would be 1 while D’s 0. After the one process in C exits, B and C’s “populated” fields would flip to “0” and file modified events will be generated on the “cgroup.events” files of both cgroups.
2.4 管理控制器(controlling controllers
)
2.4.1 启用和禁用
每个 cgroup
都有一个 cgroup.controllers
文件, 其中列出了这个 cgroup
可用的所有控制器:
$ cat cgroup.controllers cpu io memory
默认没有启用任何控制。启用或禁用是通过写 cgroup.subtree_control
文件完成的:
$ echo "+cpu +memory -io" > cgroup.subtree_control
只有出现在 cgroup.controllers
中的控制器才能被启用。
- 如果像上面的命令一样,一次指定多个操作,那它们要么全部功能,要么全部失败;
- 如果对同一个控制器指定了多个操作,最后一个是有效的。
启用 cgroup
的某个控制器,意味着控制它在子节点之间分配目标资源(target resource
)的行为。 考虑下面的 sub-hierarchy
,括号中是已经启用的控制器:
A(cpu,memory) - B(memory) - C() \ D()
A
启用了cpu
和memory
,因此会控制它的child
(即B
)的CPU
和memory
使用;B
只启用了memory
,因此C
和D
的memory
使用量会受B
控制,但CPU
可以随意竞争(compete freely
)。
控制器限制 children
的资源使用方式,是创建或写入 children cgroup
的接口文件。 还是以上面的拓扑为例:
- 在
B
上启用cpu
将会在C
和D
的cgroup
目录中创建cpu.
开头的接口文件; - 同理,禁用
memory
时会删除对应的memory.
开头的文件。
这也意味着cgroup
目录中所有不以 cgroup.
开头的控制器接口文件 —— 在管理上,都属于 parent cgroup
而非当前 cgroup
自己。
2.4.2 自顶向下启用(top-down constraint)
资源是自顶向下(top-down
)分配的,只有当一个 cgroup
从 parent
获得了某种资源,它 才可以继续向下分发。这意味着:
- 只有父节点启用了某个控制器,子节点才能启用;
- 对应到实现上,所有非根节点(
non-root
)的cgroup.subtree_control
文件中, 只能包含它的父节点的cgroup.subtree_control
中有的控制器; - 另一方面,只要有子节点还在使用某个控制器,父节点就无法禁用之。
2.4.3 无内部进程限制
只有当一个 non-root cgroup
中没有任何进程时,才能将其domain resource
分配给它的children
。换句话说,只有那些没有任何进程的domain cgroup
,才能将它们的domain controllers
写到其children
的cgroup.subtree_control
文件中。
这种方式保证了在给定的 domain controller
范围内,所有进程都位于叶子节点上, 因而避免了 child cgroup
内的进程与 parent
内的进程竞争的情况,便于 domain controller
扫描 hierarchy
。
但 root cgroup
不受此限制。
- 对大部分类型的控制器来说,
root
中包含了一些没有与任何cgroup
相关联的进程和匿名资源占用 (anonymous resource consumption
),需要特殊对待。 root cgroup
的资源占用是如何管理的,因控制器而异(更多信息可参考Controllers
小结)。
注意,在parent
的cgroup.subtree_control
启用控制器之前,这些限制不会生效。 这非常重要,因为它决定了创建 populated cgroup children
的方式。 要控制一个 cgroup
的资源分配,这个 cgroup
需要:
- 创建
children cgroup
, - 将自己所有的进程转移到
children cgroup
中, - 在它自己的
cgroup.subtree_control
中启用控制器。
2.5 委派
2.5.1 委派模型
A cgroup can be delegated in two ways. First, to a less privileged user by granting write access of the directory and its “cgroup.procs”, “cgroup.threads” and “cgroup.subtree_control” files to the user. Second, if the “nsdelegate” mount option is set, automatically to a cgroup namespace on namespace creation.
Because the resource control interface files in a given directory control the distribution of the parent’s resources, the delegatee shouldn’t be allowed to write to them. For the first method, this is achieved by not granting access to these files. For the second, the kernel rejects writes to all files other than “cgroup.procs” and “cgroup.subtree_control” on a namespace root from inside the namespace.
The end results are equivalent for both delegation types. Once delegated, the user can build sub-hierarchy under the directory, organize processes inside it as it sees fit and further distribute the resources it received from the parent. The limits and other settings of all resource controllers are hierarchical and regardless of what happens in the delegated sub-hierarchy, nothing can escape the resource restrictions imposed by the parent.
Currently, cgroup doesn’t impose any restrictions on the number of cgroups in or nesting depth of a delegated sub-hierarchy; however, this may be limited explicitly in the future.
2.5.2 委派限制
A delegated sub-hierarchy is contained in the sense that processes can’t be moved into or out of the sub-hierarchy by the delegatee.
For delegations to a less privileged user, this is achieved by requiring the following conditions for a process with a non-root euid to migrate a target process into a cgroup by writing its PID to the “cgroup.procs” file.
The writer must have write access to the “cgroup.procs” file. The writer must have write access to the “cgroup.procs” file of the common ancestor of the source and destination cgroups. The above two constraints ensure that while a delegatee may migrate processes around freely in the delegated sub-hierarchy it can’t pull in from or push out to outside the sub-hierarchy.
For an example, let’s assume cgroups C0 and C1 have been delegated to user U0 who created C00, C01 under C0 and C10 under C1 as follows and all processes under C0 and C1 belong to U0:
~~~~~~~~~~~~~ - C0 - C00 ~ cgroup ~ \ C01 ~ hierarchy ~ ~~~~~~~~~~~~~ - C1 - C10
Let’s also say U0 wants to write the PID of a process which is currently in C10 into “C00/cgroup.procs”. U0 has write access to the file; however, the common ancestor of the source cgroup C10 and the destination cgroup C00 is above the points of delegation and U0 would not have write access to its “cgroup.procs” files and thus the write will be denied with -EACCES.
For delegations to namespaces, containment is achieved by requiring that both the source and destination cgroups are reachable from the namespace of the process which is attempting the migration. If either is not reachable, the migration is rejected with -ENOENT.
2.6 指导原则
2.6.1 Organize Once and Control
Migrating a process across cgroups is a relatively expensive operation and stateful resources such as memory are not moved together with the process. This is an explicit design decision as there often exist inherent trade-offs between migration and various hot paths in terms of synchronization cost.
As such, migrating processes across cgroups frequently as a means to apply different resource restrictions is discouraged. A workload should be assigned to a cgroup according to the system’s logical and resource structure once on start-up. Dynamic adjustments to resource distribution can be made by changing controller configuration through the interface files.
2.6.2 避免命名冲突
Interface files for a cgroup and its children cgroups occupy the same directory and it is possible to create children cgroups which collide with interface files.
All cgroup core interface files are prefixed with “cgroup.” and each controller’s interface files are prefixed with the controller name and a dot. A controller’s name is composed of lower case alphabets and ‘_’s but never begins with an ‘_’ so it can be used as the prefix character for collision avoidance. Also, interface file names won’t start or end with terms which are often used in categorizing workloads such as job, service, slice, unit or workload.
cgroup doesn’t do anything to prevent name collisions and it’s the user’s responsibility to avoid them.
3 资源分配模型
cgroup controllers implement several resource distribution schemes depending on the resource type and expected use cases. This section describes major schemes in use along with their expected behaviors.
3.1 权重
A parent’s resource is distributed by adding up the weights of all active children and giving each the fraction matching the ratio of its weight against the sum. As only children which can make use of the resource at the moment participate in the distribution, this is work-conserving. Due to the dynamic nature, this model is usually used for stateless resources.
All weights are in the range [1, 10000] with the default at 100. This allows symmetric multiplicative biases in both directions at fine enough granularity while staying in the intuitive range.
As long as the weight is in range, all configuration combinations are valid and there is no reason to reject configuration changes or process migrations.
“cpu.weight” proportionally distributes CPU cycles to active children and is an example of this type.
3.2 限制
A child can only consume upto the configured amount of the resource. Limits can be over-committed - the sum of the limits of children can exceed the amount of resource available to the parent.
Limits are in the range [0, max] and defaults to “max”, which is noop.
As limits can be over-committed, all configuration combinations are valid and there is no reason to reject configuration changes or process migrations.
“io.max” limits the maximum BPS and/or IOPS that a cgroup can consume on an IO device and is an example of this type.
3.3 保护
A cgroup is protected upto the configured amount of the resource as long as the usages of all its ancestors are under their protected levels. Protections can be hard guarantees or best effort soft boundaries. Protections can also be over-committed in which case only upto the amount available to the parent is protected among children.
Protections are in the range [0, max] and defaults to 0, which is noop.
As protections can be over-committed, all configuration combinations are valid and there is no reason to reject configuration changes or process migrations.
“memory.low” implements best-effort memory protection and is an example of this type.
3.4 分配
A cgroup is exclusively allocated a certain amount of a finite resource. Allocations can’t be over-committed - the sum of the allocations of children can not exceed the amount of resource available to the parent.
Allocations are in the range [0, max] and defaults to 0, which is no resource.
As allocations can’t be over-committed, some configuration combinations are invalid and should be rejected. Also, if the resource is mandatory for execution of processes, process migrations may be rejected.
“cpu.rt.max” hard-allocates realtime slices and is an example of this type.
4 接口文件
4.1 格式
All interface files should be in one of the following formats whenever possible:
New-line separated values (when only one value can be written at once) VAL0\n VAL1\n ... Space separated values (when read-only or multiple values can be written at once) VAL0 VAL1 ...\n Flat keyed KEY0 VAL0\n KEY1 VAL1\n ... Nested keyed KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... ...
For a writable file, the format for writing should generally match reading; however, controllers may allow omitting later fields or implement restricted shortcuts for most common use cases.
For both flat and nested keyed files, only the values for a single key can be written at a time. For nested keyed files, the sub key pairs may be specified in any order and not all pairs have to be specified.
4.2 约定
- Settings for a single feature should be contained in a single file.
- The root cgroup should be exempt from resource control and thus shouldn’t have resource control interface files.
- The default time unit is microseconds. If a different unit is ever used, an explicit unit suffix must be present.
- A parts-per quantity should use a percentage decimal with at least two digit fractional part - e.g. 13.40.
- If a controller implements weight based resource distribution, its interface file should be named “weight” and have the range [1, 10000] with 100 as the default. The values are chosen to allow enough and symmetric bias in both directions while keeping it intuitive (the default is 100%).
- If a controller implements an absolute resource guarantee and/or limit, the interface files should be named “min” and “max” respectively. If a controller implements best effort resource guarantee and/or limit, the interface files should be named “low” and “high” respectively.
In the above four control files, the special token “max” should be used to represent upward infinity for both reading and writing. - If a setting has a configurable default value and keyed specific overrides, the default entry should be keyed with “default” and appear as the first entry in the file.
The default value can be updated by writing either “default VAL”.
When writing to update a specific override, “default” can be used as the value to indicate removal of the override. Override entries with “default” as the value must not appear when read.
For example, a setting which is keyed by major:minor device numbers with integer values may look like the following:
# cat cgroup-example-interface-file default 150 8:0 300
- The default value can be updated by:
# echo 125 > cgroup-example-interface-file
- 或者
# echo "default 125" > cgroup-example-interface-file
- An override can be set by:
# echo "8:16 170" > cgroup-example-interface-file
- and cleared by:
# echo "8:0 default" > cgroup-example-interface-file # cat cgroup-example-interface-file default 125 8:16 170
- For events which are not very high frequency, an interface file “events” should be created which lists event key value pairs. Whenever a notifiable event happens, file modified event should be generated on the file.
4.3 核心接口文件
All cgroup core files are prefixed with “cgroup.”
- cgroup.type
A read-write single value file which exists on non-root cgroups.
When read, it indicates the current type of the cgroup, which can be one of the following values.
A cgroup can be turned into a threaded cgroup by writing “threaded” to this file.
- “domain” : A normal valid domain cgroup.
- “domain threaded” : A threaded domain cgroup which is serving as the root of a threaded subtree.
- “domain invalid” : A cgroup which is in an invalid state. It can’t be populated or have controllers enabled. It may be allowed to become a threaded cgroup.
- “threaded” : A threaded cgroup which is a member of a threaded subtree.
- cgroup.procs
A read-write new-line separated values file which exists on all cgroups.
When read, it lists the PIDs of all processes which belong to the cgroup one-per-line. The PIDs are not ordered and the same PID may show up more than once if the process got moved to another cgroup and then back or the PID got recycled while reading.
A PID can be written to migrate the process associated with the PID to the cgroup. The writer should match all of the following conditions.
When delegating a sub-hierarchy, write access to this file should be granted along with the containing directory.
In a threaded cgroup, reading this file fails with EOPNOTSUPP as all the processes belong to the thread root. Writing is supported and moves every thread of the process to the cgroup.
- It must have write access to the “cgroup.procs” file.
- It must have write access to the “cgroup.procs” file of the common ancestor of the source and destination cgroups.
- cgroup.threads
A read-write new-line separated values file which exists on all cgroups.
When read, it lists the TIDs of all threads which belong to the cgroup one-per-line. The TIDs are not ordered and the same TID may show up more than once if the thread got moved to another cgroup and then back or the TID got recycled while reading.
A TID can be written to migrate the thread associated with the TID to the cgroup. The writer should match all of the following conditions.
When delegating a sub-hierarchy, write access to this file should be granted along with the containing directory.
- It must have write access to the “cgroup.threads” file.
- The cgroup that the thread is currently in must be in the same resource domain as the destination cgroup.
- It must have write access to the “cgroup.procs” file of the common ancestor of the source and destination cgroups.
- cgroup.controllers
A read-only space separated values file which exists on all cgroups.
It shows space separated list of all controllers available to the cgroup. The controllers are not ordered. - cgroup.subtree_control
A read-write space separated values file which exists on all cgroups. Starts out empty.
When read, it shows space separated list of the controllers which are enabled to control resource distribution from the cgroup to its children.
Space separated list of controllers prefixed with ‘+’ or ‘-‘ can be written to enable or disable controllers. A controller name prefixed with ‘+’ enables the controller and ‘-‘ disables. If a controller appears more than once on the list, the last one is effective. When multiple enable and disable operations are specified, either all succeed or all fail. - cgroup.events
A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
- populated
1 if the cgroup or its descendants contains any live processes; otherwise, 0. - frozen
1 if the cgroup is frozen; otherwise, 0.
- cgroup.max.descendants
A read-write single value files. The default is “max”.
Maximum allowed number of descent cgroups. If the actual number of descendants is equal or larger, an attempt to create a new cgroup in the hierarchy will fail. - cgroup.max.depth
A read-write single value files. The default is “max”.
Maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. - cgroup.stat
A read-only flat-keyed file with the following entries:
- nr_descendants
Total number of visible descendant cgroups. nr_dying_descendants
Total number of dying descendant cgroups. A cgroup becomes dying after being deleted by a user. The cgroup will remain in dying state for some time undefined time (which can depend on system load) before being completely destroyed.
A process can’t enter a dying cgroup under any circumstances, a dying cgroup can’t revive.
A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion.
- cgroup.freeze
A read-write single value file which exists on non-root cgroups. Allowed values are “0” and “1”. The default is “0”.
Writing “1” to the file causes freezing of the cgroup and all descendant cgroups. This means that all belonging processes will be stopped and will not run until the cgroup will be explicitly unfrozen. Freezing of the cgroup may take some time; when this action is completed, the “frozen” value in the cgroup.events control file will be updated to “1” and the corresponding notification will be issued.
A cgroup can be frozen either by its own settings, or by settings of any ancestor cgroups. If any of ancestor cgroups is frozen, the cgroup will remain frozen.
Processes in the frozen cgroup can be killed by a fatal signal. They also can enter and leave a frozen cgroup: either by an explicit move by a user, or if freezing of the cgroup races with fork(). If a process is moved to a frozen cgroup, it stops. If a process is moved out of a frozen cgroup, it becomes running.
Frozen status of a cgroup doesn’t affect any cgroup tree operations: it’s possible to delete a frozen (and empty) cgroup, as well as create new sub-cgroups.
5 控制器
5.1 CPU
The “cpu” controllers regulates distribution of CPU cycles. This controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy.
In all the above models, cycles distribution is defined only on a temporal base and it does not account for the frequency at which tasks are executed. The (optional) utilization clamping support allows to hint the schedutil cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU.
WARNING: cgroup2 doesn’t yet support control of realtime processes and the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system management software may already have placed RT processes into nonroot cgroups during the system boot process, and these processes may need to be moved to the root cgroup before the cpu controller can be enabled.
5.1.1 CPU接口文件
All time durations are in microseconds.
- cpu.stat
A read-only flat-keyed file. This file exists whether the controller is enabled or not.
It always reports the following three stats:
and the following three when the controller is enabled:
- nr_periods
- nr_throttled
- throttled_usec
- usage_usec
- user_usec
- system_usec
- cpu.weight
A read-write single value file which exists on non-root cgroups. The default is “100”.
The weight in the range [1, 10000]. - cpu.weight.nice
A read-write single value file which exists on non-root cgroups. The default is “0”.
The nice value is in the range [-20, 19].
This interface file is an alternative interface for “cpu.weight” and allows reading and setting weight using the same values used by nice(2). Because the range is smaller and granularity is coarser for the nice values, the read value is the closest approximation of the current weight. - cpu.max
A read-write two value file which exists on non-root cgroups. The default is “max 100000”.
The maximum bandwidth limit. It’s in the following format:
$MAX $PERIOD
- which indicates that the group may consume upto PERIOD duration. “max” for MAX is updated.
- cpu.pressure
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for CPU. See PSI - Pressure Stall Information for details. - cpu.uclamp.min
A read-write single value file which exists on non-root cgroups. The default is “0”, i.e. no utilization boosting.
The requested minimum utilization (protection) as a percentage rational number, e.g. 12.34 for 12.34%.
This interface allows reading and setting minimum utilization clamp values similar to the sched_setattr(2). This minimum utilization value is used to clamp the task specific minimum utilization clamp.
The requested minimum utilization (protection) is always capped by the current value for the maximum utilization (limit), i.e. cpu.uclamp.max. - cpu.uclamp.max
A read-write single value file which exists on non-root cgroups. The default is “max”. i.e. no utilization capping
The requested maximum utilization (limit) as a percentage rational number, e.g. 98.76 for 98.76%.
This interface allows reading and setting maximum utilization clamp values similar to the sched_setattr(2). This maximum utilization value is used to clamp the task specific maximum utilization clamp.
5.2 Memory
The “memory” controller regulates distribution of memory. Memory is stateful and implements both limit and protection models. Due to the intertwining between memory usage and reclaim pressure and the stateful nature of memory, the distribution model is relatively complex.
While not completely water-tight, all major memory usages by a given cgroup are tracked so that the total memory consumption can be accounted and controlled to a reasonable extent. Currently, the following types of memory usages are tracked.
- Userland memory - page cache and anonymous memory.
- Kernel data structures such as dentries and inodes.
- TCP socket buffers.
The above list may expand in the future for better coverage.
5.2.1 Memory接口文件
All memory amounts are in bytes. If a value which is not aligned to PAGE_SIZE is written, the value may be rounded up to the closest PAGE_SIZE multiple when read back.
memory.current A read-only single value file which exists on non-root cgroups.
The total amount of memory currently being used by the cgroup and its descendants.
memory.min A read-write single value file which exists on non-root cgroups. The default is “0”.
Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup’s memory won’t be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer is invoked. Above the effective min boundary (or effective low boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent’s protection proportional to its actual memory usage below memory.min.
Putting more memory than generally available under this protection is discouraged and may lead to constant OOMs.
If a memory cgroup is not populated with processes, its memory.min is ignored.
memory.low A read-write single value file which exists on non-root cgroups. The default is “0”.
Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup’s memory won’t be reclaimed unless there is no reclaimable memory available in unprotected cgroups. Above the effective low boundary (or effective min boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
Effective low boundary is limited by memory.low values of all ancestor cgroups. If there is memory.low overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent’s protection proportional to its actual memory usage below memory.low.
Putting more memory than generally available under this protection is discouraged.
memory.high A read-write single value file which exists on non-root cgroups. The default is “max”.
Memory usage throttle limit. This is the main mechanism to control memory usage of a cgroup. If a cgroup’s usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure.
Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached.
memory.max A read-write single value file which exists on non-root cgroups. The default is “max”.
Memory usage hard limit. This is the final protection mechanism. If a cgroup’s memory usage reaches this limit and can’t be reduced, the OOM killer is invoked in the cgroup. Under certain circumstances, the usage may go over the limit temporarily.
In default configuration regular 0-order allocations always succeed unless OOM killer chooses current task as a victim.
Some kinds of allocations don’t invoke the OOM killer. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead.
This is the ultimate protection mechanism. As long as the high limit is used and monitored properly, this limit’s utility is limited to providing the final safety net.
memory.oom.group A read-write single value file which exists on non-root cgroups. The default value is “0”.
Determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. This can be used to avoid partial kills to guarantee workload integrity.
Tasks with the OOM protection (oom_score_adj set to -1000) are treated as an exception and are never killed.
If the OOM killer is invoked in a cgroup, it’s not going to kill any tasks outside of this cgroup, regardless memory.oom.group values of ancestor cgroups.
memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
Note that all fields in this file are hierarchical and the file modified event can be generated due to an event down the hierarchy. For for the local events at the cgroup level see memory.events.local.
low The number of times the cgroup is reclaimed due to high memory pressure even though its usage is under the low boundary. This usually indicates that the low boundary is over-committed. high The number of times processes of the cgroup are throttled and routed to perform direct memory reclaim because the high memory boundary was exceeded. For a cgroup whose memory usage is capped by the high limit rather than global memory pressure, this event’s occurrences are expected. max The number of times the cgroup’s memory usage was about to go over the max boundary. If direct reclaim fails to bring it down, the cgroup goes to OOM state. oom The number of time the cgroup’s memory usage was reached the limit and allocation was about to fail.
This event is not raised if the OOM killer is not considered as an option, e.g. for failed high-order allocations or if caller asked to not retry attempts.
oom_kill The number of processes belonging to this cgroup killed by any kind of OOM killer. memory.events.local Similar to memory.events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events. memory.stat A read-only flat-keyed file which exists on non-root cgroups.
This breaks down the cgroup’s memory footprint into different types of memory, type-specific details, and other information on the state and past events of the memory management system.
All memory amounts are in bytes.
The entries are ordered to be human readable, and new entries can show up in the middle. Don’t rely on items remaining in a fixed position; use the keys to look up specific values!
If the entry has no per-node counter(or not show in the mempry.numa_stat). We use ‘npn’(non-per-node) as the tag to indicate that it will not show in the mempry.numa_stat.
anon Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS) file Amount of memory used to cache filesystem data, including tmpfs and shared memory. kernel_stack Amount of memory allocated to kernel stacks. percpu(npn) Amount of memory used for storing per-cpu kernel data structures. sock(npn) Amount of memory used in network transmission buffers shmem Amount of cached filesystem data that is swap-backed, such as tmpfs, shm segments, shared anonymous mmap()s file_mapped Amount of cached filesystem data mapped with mmap() file_dirty Amount of cached filesystem data that was modified but not yet written back to disk file_writeback Amount of cached filesystem data that was modified and is currently being written back to disk anon_thp Amount of memory used in anonymous mappings backed by transparent hugepages inactive_anon, active_anon, inactive_file, active_file, unevictable Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the page reclaim algorithm.
As these represent internal list state (eg. shmem pages are on anon memory management lists), inactive_foo + active_foo may not be equal to the value for the foo counter, since the foo counter is type-based, not list-based.
slab_reclaimable Part of “slab” that might be reclaimed, such as dentries and inodes. slab_unreclaimable Part of “slab” that cannot be reclaimed on memory pressure. slab(npn) Amount of memory used for storing in-kernel data structures. workingset_refault_anon Number of refaults of previously evicted anonymous pages. workingset_refault_file Number of refaults of previously evicted file pages. workingset_activate_anon Number of refaulted anonymous pages that were immediately activated. workingset_activate_file Number of refaulted file pages that were immediately activated. workingset_restore_anon Number of restored anonymous pages which have been detected as an active workingset before they got reclaimed. workingset_restore_file Number of restored file pages which have been detected as an active workingset before they got reclaimed. workingset_nodereclaim Number of times a shadow node has been reclaimed pgfault(npn) Total number of page faults incurred pgmajfault(npn) Number of major page faults incurred pgrefill(npn) Amount of scanned pages (in an active LRU list) pgscan(npn) Amount of scanned pages (in an inactive LRU list) pgsteal(npn) Amount of reclaimed pages pgactivate(npn) Amount of pages moved to the active LRU list pgdeactivate(npn) Amount of pages moved to the inactive LRU list pglazyfree(npn) Amount of pages postponed to be freed under memory pressure pglazyfreed(npn) Amount of reclaimed lazyfree pages thp_fault_alloc(npn) Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. thp_collapse_alloc(npn) Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. memory.numa_stat A read-only nested-keyed file which exists on non-root cgroups.
This breaks down the cgroup’s memory footprint into different types of memory, type-specific details, and other information per node on the state of the memory management system.
This is useful for providing visibility into the NUMA locality information within an memcg since the pages are allowed to be allocated from any physical node. One of the use case is evaluating application performance by combining this information with the application’s CPU allocation.
All memory amounts are in bytes.
The output format of memory.numa_stat is:
type N0=<bytes in node 0> N1=<bytes in node 1> ... The entries are ordered to be human readable, and new entries can show up in the middle. Don’t rely on items remaining in a fixed position; use the keys to look up specific values!
The entries can refer to the memory.stat.
memory.swap.current A read-only single value file which exists on non-root cgroups.
The total amount of swap currently being used by the cgroup and its descendants.
memory.swap.high A read-write single value file which exists on non-root cgroups. The default is “max”.
Swap usage throttle limit. If a cgroup’s swap usage exceeds this limit, all its further allocations will be throttled to allow userspace to implement custom out-of-memory procedures.
This limit marks a point of no return for the cgroup. It is NOT designed to manage the amount of swapping a workload does during regular operation. Compare to memory.swap.max, which prohibits swapping past a set amount, but lets the cgroup continue unimpeded as long as other memory can be reclaimed.
Healthy workloads are not expected to reach this limit.
memory.swap.max A read-write single value file which exists on non-root cgroups. The default is “max”.
Swap usage hard limit. If a cgroup’s swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out.
memory.swap.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
high The number of times the cgroup’s swap usage was over the high threshold. max The number of times the cgroup’s swap usage was about to go over the max boundary and swap allocation failed. fail The number of times swap allocation failed either because of running out of swap system-wide or max limit. When reduced under the current usage, the existing swap entries are reclaimed gradually and the swap usage may stay higher than the limit for an extended period of time. This reduces the impact on the workload and memory management.
memory.pressure A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for memory. See PSI - Pressure Stall Information for details.
5.2.2 使用指导
“memory.high” is the main mechanism to control memory usage. Over-committing on high limit (sum of high limits > available memory) and letting global memory pressure to distribute memory according to usage is a viable strategy.
Because breach of the high limit doesn’t trigger the OOM killer but throttles the offending cgroup, a management agent has ample opportunities to monitor and take appropriate actions such as granting more memory or terminating the workload.
Determining whether a cgroup has enough memory is not trivial as memory usage doesn’t indicate whether the workload can benefit from more memory. For example, a workload which writes data received from network to a file can use all available memory but can also operate as performant with a small amount of memory. A measure of memory pressure - how much the workload is being impacted due to lack of memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn’t implemented yet.
5.2.3 Memory所有权
A memory area is charged to the cgroup which instantiated it and stays charged to the cgroup until the area is released. Migrating a process to a different cgroup doesn’t move the memory usages that it instantiated while in the previous cgroup to the new cgroup.
A memory area may be used by processes belonging to different cgroups. To which cgroup the area will be charged is in-deterministic; however, over time, the memory area is likely to end up in a cgroup which has enough memory allowance to avoid high reclaim pressure.
If a cgroup sweeps a considerable amount of memory which is expected to be accessed repeatedly by other cgroups, it may make sense to use POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership.