5.3 IO
The “io” controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution; however, weight based distribution is available only if cfq-iosched is in use and neither scheme is available for blk-mq devices.
5.3.1 IO接口文件
- io.stat
A read-only nested-keyed file.
Lines are keyed by MIN device numbers and not ordered. The following nested keys are defined.
键值 | 描述 |
rbytes | Bytes read |
wbytes | Bytes written |
rios | Number of read IOs |
wios | Number of write IOs |
dbytes | Bytes discarded |
dios | Number of discard IOs |
- An example read output follows:
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
- io.cost.qos
A read-write nested-keyed file with exists only on the root cgroup.
This file configures the Quality of Service of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements “io.weight” proportional control. Lines are keyed by MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on “io.cost.qos” or “io.cost.model”. The following nested keys are defined.
键值 | 描述 |
enable | Weight-based control enable |
ctrl | “auto” or “user” |
rpct | Read latency percentile [0, 100] |
rlat | Read latency threshold |
wpct | Write latency percentile [0, 100] |
wlat | Write latency threshold |
min | Minimum scaling percentage [1, 10000] |
max | Maximum scaling percentage [1, 10000] |
- The controller is disabled by default and can be enabled by setting “enable” to 1. “rpct” and “wpct” parameters default to zero and the controller uses internal device saturation state to adjust the overall IO rate between “min” and “max”.
When a better control quality is needed, latency QoS parameters can be configured. For example:
8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
- shows that on sdb, the controller is enabled, will consider the device saturated if the 95th percentile of read completion latencies is above 75ms or write 150ms, and adjust the overall IO issue rate between 50% and 150% accordingly.
The lower the saturation point, the better the latency QoS at the cost of aggregate bandwidth. The narrower the allowed adjustment range between “min” and “max”, the more conformant to the cost model the IO behavior. Note that the IO issue base rate may be far off from 100% and setting “min” and “max” blindly can lead to a significant loss of device capacity or control quality. “min” and “max” are useful for regulating devices which show wide temporary behavior changes - e.g. a ssd which accepts writes at the line speed for a while and then completely stalls for multiple seconds.
When “ctrl” is “auto”, the parameters are controlled by the kernel and may change automatically. Setting “ctrl” to “user” or setting any of the percentile and latency parameters puts it into “user” mode and disables the automatic changes. The automatic mode can be restored by setting “ctrl” to “auto”. - io.cost.model
A read-write nested-keyed file with exists only on the root cgroup.
This file configures the cost model of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements “io.weight” proportional control. Lines are keyed by MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on “io.cost.qos” or “io.cost.model”. The following nested keys are defined.
键值 | 描述 |
ctrl | “auto” or “user” |
model | The cost model in use - “linear” |
- When “ctrl” is “auto”, the kernel may change all parameters dynamically. When “ctrl” is set to “user” or any other parameters are written to, “ctrl” become “user” and the automatic changes are disabled.
When “model” is “linear”, the following model parameters are defined.
键值 | 描述 |
`[r | w]bps` |
`[r | w]seqiops` |
`[r | w]randiops` |
- From the above, the builtin linear model determines the base costs of a sequential and random IO and the cost coefficient for the IO size. While simple, this model can cover most common device classes acceptably.
The IO cost model isn’t expected to be accurate in absolute sense and is scaled to the device behavior dynamically.
If needed, tools/cgroup/iocost_coef_gen.py can be used to generate device-specific coefficients. - io.weight
A read-write flat-keyed file which exists on non-root cgroups. The default is “default 100”.
The first line is the default weight applied to devices without specific override. The rest are overrides keyed by MIN device numbers and not ordered. The weights are in the range [1, 10000] and specifies the relative amount IO time the cgroup can use in relation to its siblings.
The default weight can be updated by writing either “default WEIGHT”. Overrides can be set by writing “MIN MAJ:$MIN default”.
An example read output follows:
default 100 8:16 200 8:0 50
- io.max
A read-write nested-keyed file which exists on non-root cgroups.
BPS and IOPS based IO limit. Lines are keyed by MIN device numbers and not ordered. The following nested keys are defined.
键值 | 描述 |
rbps | Max read bytes per second |
wbps | Max write bytes per second |
riops | Max read IO operations per second |
wiops | Max write IO operations per second |
- When writing, any number of nested key-value pairs can be specified in any order. “max” can be specified as the value to remove a specific limit. If the same key is specified multiple times, the outcome is undefined.
BPS and IOPS are measured in each IO direction and IOs are delayed if limit is reached. Temporary bursts are allowed.
Setting read limit at 2M BPS and write at 120 IOPS for 8:16:
echo "8:16 rbps=2097152 wiops=120" > io.max
- Reading returns the following:
8:16 rbps=2097152 wbps=max riops=max wiops=120
- Write IOPS limit can be removed by writing the following:
echo "8:16 wiops=max" > io.max
- Reading now returns the following:
8:16 rbps=2097152 wbps=max riops=max wiops=max
- io.pressure
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for IO. See PSI - Pressure Stall Information for details.
5.3.2 回写(writeback
)
Page cache is dirtied through buffered writes and shared mmaps and written asynchronously to the backing filesystem by the writeback mechanism. Writeback sits between the memory and IO domains and regulates the proportion of dirty memory by balancing dirtying and write IOs.
The io controller, in conjunction with the memory controller, implements control of page cache writeback IOs. The memory controller defines the memory domain that dirty memory ratio is calculated and maintained for and the io controller defines the io domain which writes out dirty pages for the memory domain. Both system-wide and per-cgroup dirty memory states are examined and the more restrictive of the two is enforced.
cgroup writeback requires explicit support from the underlying filesystem. Currently, cgroup writeback is implemented on ext2, ext4, btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are attributed to the root cgroup.
There are inherent differences in memory and writeback management which affects how cgroup ownership is tracked. Memory is tracked per page while writeback per inode. For the purpose of writeback, an inode is assigned to a cgroup and all IO requests to write dirty pages from the inode are attributed to that cgroup.
As cgroup ownership for memory is tracked per page, there can be pages which are associated with different cgroups than the one the inode is associated with. These are called foreign pages. The writeback constantly keeps track of foreign pages and, if a particular foreign cgroup becomes the majority over a certain period of time, switches the ownership of the inode to that cgroup.
While this model is enough for most use cases where a given inode is mostly dirtied by a single cgroup even when the main writing cgroup changes over time, use cases where multiple cgroups write to a single inode simultaneously are not supported well. In such circumstances, a significant portion of IOs are likely to be attributed incorrectly. As memory controller assigns page ownership on the first use and doesn’t update it until the page is released, even if writeback strictly follows page ownership, multiple cgroups dirtying overlapping areas wouldn’t work as expected. It’s recommended to avoid such usage patterns.
The sysctl knobs which affect writeback behavior are applied to cgroup writeback as follows.
- vm.dirty_background_ratio, vm.dirty_ratio
These ratios apply the same to cgroup writeback with the amount of available memory capped by limits imposed by the memory controller and system-wide clean memory. - vm.dirty_background_bytes, vm.dirty_bytes
For cgroup writeback, this is calculated into ratio against total available memory and applied the same way as vm.dirty[_background]_ratio.
5.3.3 IO延迟(IO Latency
)
This is a cgroup v2 controller for IO workload protection. You provide a group with a latency target, and if the average latency exceeds that target the controller will throttle any peers that have a lower latency target than the protected workload.
The limits are only applied at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and groups D and F will influence each other. Group G will influence nobody::
[root] / | \ A B C / \ | D F G
So the ideal way to configure this is to set io.latency in groups A, B, and C. Generally you do not want to set a value lower than the latency your device supports. Experiment to find the value that works best for your workload. Start at higher than the expected latency for your device and watch the avg_lat value in io.stat for your workload group to get an idea of the latency you see during normal operation. Use the avg_lat value as a basis for your real setting, setting at 10-15% higher than the value in io.stat.
5.3.4 IO延迟节流如何工作
io.latency is work conserving; so as long as everybody is meeting their latency target the controller doesn’t do anything. Once a group starts missing its target it begins throttling any peer group that has a higher target than itself. This throttling takes 2 forms:
- Queue depth throttling. This is the number of outstanding IO’s a group is allowed to have. We will clamp down relatively quickly, starting at no limit and going all the way down to 1 IO at a time.
- Artificial delay induction. There are certain types of IO that cannot be throttled without possibly adversely affecting higher priority groups. This includes swapping and metadata IO. These types of IO are allowed to occur normally, however they are “charged” to the originating group. If the originating group is being throttled you will see the use_delay and delay fields in io.stat increase. The delay value is how many microseconds that are being added to any process that runs in this group. Because this number can grow quite large if there is a lot of swapping or metadata IO occurring we limit the individual delay events to 1 second at a time.
Once the victimized group starts meeting its latency target again it will start unthrottling any peer groups that were throttled previously. If the victimized group simply stops doing IO the global counter will unthrottle appropriately.
5.3.5 IO延迟接口文件
io.latency
This takes a similar format as the other controllers.
“MAJOR:MINOR target=<target time in microseconds”
io.stat
If the controller is enabled you will see extra stats in io.stat in addition to the normal ones.
depth
This is the current queue depth for the group.avg_lat
This is an exponential moving average with a decay rate of 1/exp bound by the sampling interval. The decay rate interval can be calculated by multiplying the win value in io.stat by the corresponding number of samples based on the win value.win
The sampling window size in milliseconds. This is the minimum duration of time between evaluation events. Windows only elapse with IO activity. Idle periods extend the most recent window.
5.4 PID
The process number controller is used to allow a cgroup to stop any new tasks from being fork()’d or clone()’d after a specified limit is reached.
The number of tasks in a cgroup can be exhausted in ways which other controllers cannot prevent, thus warranting its own controller. For example, a fork bomb is likely to exhaust the number of tasks before hitting memory restrictions.
Note that PIDs used in this controller refer to TIDs, process IDs as used by the kernel.
5.4.1 PID接口文件
pids.max
A read-write single value file which exists on non-root cgroups. The default is “max”.
Hard limit of number of processes.pids.current
A read-only single value file which exists on all cgroups.
The number of processes currently in the cgroup and its descendants.
Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough processes to the cgroup such that pids.current is larger than pids.max. However, it is not possible to violate a cgroup PID policy through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated.
5.5 Cpuset
The “cpuset” controller provides a mechanism for constraining the CPU and memory node placement of tasks to only the resources specified in the cpuset interface files in a task’s current cgroup. This is especially valuable on large NUMA systems where placing jobs on properly sized subsets of the systems with careful processor and memory placement to reduce cross-node memory access and contention can improve overall system performance.
The “cpuset” controller is hierarchical. That means the controller cannot use CPUs or memory nodes not allowed in its parent.
5.5.1 cpuset接口文件
cpuset.cpus
A read-write multiple values file which exists on non-root cpuset-enabled cgroups.
It lists the requested CPUs to be used by tasks within this cgroup. The actual list of CPUs to be granted, however, is subjected to constraints imposed by its parent and can differ from the requested CPUs.
The CPU numbers are comma-separated numbers or ranges. For example:
# cat cpuset.cpus 0-4,6,8-10
- An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty “cpuset.cpus” or all the available CPUs if none is found.
The value of “cpuset.cpus” stays constant until the next update and won’t be affected by any CPU hotplug events. cpuset.cpus.effective
A read-only multiple values file which exists on all cpuset-enabled cgroups.
It lists the onlined CPUs that are actually granted to this cgroup by its parent. These CPUs are allowed to be used by tasks within the current cgroup.
If “cpuset.cpus” is empty, the “cpuset.cpus.effective” file shows all the CPUs from the parent cgroup that can be available to be used by this cgroup. Otherwise, it should be a subset of “cpuset.cpus” unless none of the CPUs listed in “cpuset.cpus” can be granted. In this case, it will be treated just like an empty “cpuset.cpus”.
Its value will be affected by CPU hotplug events.cpuset.mems
A read-write multiple values file which exists on non-root cpuset-enabled cgroups.
It lists the requested memory nodes to be used by tasks within this cgroup. The actual list of memory nodes granted, however, is subjected to constraints imposed by its parent and can differ from the requested memory nodes.
The memory node numbers are comma-separated numbers or ranges. For example:
# cat cpuset.mems 0-1,3
- An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty “cpuset.mems” or all the available memory nodes if none is found.
The value of “cpuset.mems” stays constant until the next update and won’t be affected by any memory nodes hotplug events. cpuset.mems.effective
A read-only multiple values file which exists on all cpuset-enabled cgroups.
It lists the onlined memory nodes that are actually granted to this cgroup by its parent. These memory nodes are allowed to be used by tasks within the current cgroup.
If “cpuset.mems” is empty, it shows all the memory nodes from the parent cgroup that will be available to be used by this cgroup. Otherwise, it should be a subset of “cpuset.mems” unless none of the memory nodes listed in “cpuset.mems” can be granted. In this case, it will be treated just like an empty “cpuset.mems”.
Its value will be affected by memory nodes hotplug events.cpuset.cpus.partition
A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable.
It accepts only the following input values when written to.
“root” - a partition root “member” - a non-root member of a partition
When set to be a partition root, the current cgroup is the root of a new partition or scheduling domain that comprises itself and all its descendants except those that are separate partition roots themselves and their descendants. The root cgroup is always a partition root.
There are constraints on where a partition root can be set. It can only be set in a cgroup if all the following conditions are true.
Setting it to partition root will take the CPUs away from the effective CPUs of the parent cgroup. Once it is set, this file cannot be reverted back to “member” if there are any child cgroups with cpuset enabled.
A parent partition cannot distribute all its CPUs to its child partitions. There must be at least one cpu left in the parent partition.
Once becoming a partition root, changes to “cpuset.cpus” is generally allowed as long as the first condition above is true, the change will not take away all the CPUs from the parent partition and the new “cpuset.cpus” value is a superset of its children’s “cpuset.cpus” values.
Sometimes, external factors like changes to ancestors’ “cpuset.cpus” or cpu hotplug can cause the state of the partition root to change. On read, the “cpuset.sched.partition” file can show the following values.
“member” Non-root member of a partition “root” Partition root “root invalid” Invalid partition root
It is a partition root if the first 2 partition root conditions above are true and at least one CPU from “cpuset.cpus” is granted by the parent cgroup.
A partition root can become invalid if none of CPUs requested in “cpuset.cpus” can be granted by the parent cgroup or the parent cgroup is no longer a partition root itself. In this case, it is not a real partition even though the restriction of the first partition root condition above will still apply. The cpu affinity of all the tasks in the cgroup will then be associated with CPUs in the nearest ancestor partition.
An invalid partition root can be transitioned back to a real partition root if at least one of the requested CPUs can now be granted by its parent. In this case, the cpu affinity of all the tasks in the formerly invalid partition will be associated to the CPUs of the newly formed partition. Changing the partition state of an invalid partition root to “member” is always allowed even if child cpusets are present.
- The “cpuset.cpus” is not empty and the list of CPUs are exclusive, i.e. they are not shared by any of its siblings.
- The parent cgroup is a partition root.
- The “cpuset.cpus” is also a proper subset of the parent’s “cpuset.cpus.effective”.
- There is no child cgroups with cpuset enabled. This is for eliminating corner cases that have to be handled if such a condition is allowed.
5.6 设备控制器
设备控制器管理对设备文件
的访问,包括创建新的设备文件(使用mknod
),访问已存在的设备文件。
cgroupv2
设备控制器没有接口文件
,是在cgroup BPF
之上实现的。为了控制对设备文件的访问,用户需要创建BPF_CGROUP_DEVICE
类型的bpf
程序,并将其附加到对应的cgroup
上。一旦尝试访问某个设备文件,对应的BPF
程序就会被执行,依赖这个返回值,访问是否成功还是失败(-EPERM
)。
BPF_CGROUP_DEVICE
类型的程序会接受一个bpf_cgroup_dev_ctx
数据类型的指针,其描述了尝试访问的设备:访问类型(mknod/read/write
)和设备(type
,major
和minor
)。如果程序返回0,则访问失败并返回-EPERM
,否则访问成功。
在内核源代码目录下有一个BPF_CGROUP_DEVICE
程序的示例,tools/testing/selftests/bpf/dev_cgroup.c
。
5.7 RDMA
The “rdma” controller regulates the distribution and accounting of RDMA resources.
5.7.1 RDMA接口文件
rdma.max
A readwrite nested-keyed file that exists for all the cgroups except root that describes current configured resource limit for a RDMA/IB device.
Lines are keyed by device name and are not ordered. Each line contains space separated resource name and its configured limit that can be distributed.
The following nested keys are defined.
键值 | 描述 |
hca_handle | Maximum number of HCA Handles |
hca_object | Maximum number of HCA Objects |
- An example for mlx4 and ocrdma device follows:
mlx4_0 hca_handle=2 hca_object=2000 ocrdma1 hca_handle=3 hca_object=max
rdma.current
A read-only file that describes current resource usage. It exists for all the cgroup except root.
An example for mlx4 and ocrdma device follows:
mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23
5.8 HugeTLB
The HugeTLB controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault.
5.8.1 HugeTLB接口文件
hugetlb.<hugepagesize>.current
Show current usage for “hugepagesize” hugetlb. It exists for all the cgroup except root.hugetlb.<hugepagesize>.max
Set/show the hard limit of “hugepagesize” hugetlb usage. The default value is “max”. It exists for all the cgroup except root.hugetlb.<hugepagesize>.events
A read-only flat-keyed file which exists on non-root cgroups.
max
The number of allocation failure due to HugeTLB limit
hugetlb.<hugepagesize>.events.local
Similar to hugetlb..events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events.
5.9 Misc
5.9.1 perf_event
perf_event controller, if not mounted on a legacy hierarchy, is automatically enabled on the v2 hierarchy so that perf events can always be filtered by cgroup v2 path. The controller can still be moved to a legacy hierarchy after v2 hierarchy is populated.
5.10 非规范化的信息
This section contains information that isn’t considered to be a part of the stable kernel API and so is subject to change.
5.10.1 CPU controller root cgroup process behaviour
When distributing CPU cycles in the root cgroup each thread in this cgroup is treated as if it was hosted in a separate child cgroup of the root cgroup. This child cgroup weight is dependent on its thread nice level.
For details of this mapping see sched_prio_to_weight array in kernel/sched/core.c file (values from this array should be scaled appropriately so the neutral - nice 0 - value is 100 instead of 1024).
5.10.2 IO controller root cgroup process behaviour
Root cgroup processes are hosted in an implicit leaf child node. When distributing IO resources this implicit child node is taken into account as if it was a normal child cgroup of the root cgroup with a weight value of 200.
6 命名空间
容器环境中用
cgroup
和其它一些namespace
来隔离进程,但/proc/$PID/cgroup
文件可能会泄露潜在的系统层信息。例如:
$ cat /proc/self/cgroup 0::/batchjobs/container_id1 # <-- cgroup 的绝对路径,属于系统层信息,不希望暴露给隔离的进程
因此引入了
cgroup namespace
,以下简写为cgroupns
(类似于network namespace
简写为netns
)。
6.1 基础
cgroup namespace
提供了一种机制,用来虚拟/proc/$PID/cgroup
文件和cgroup
挂载点的视角。clone(2)
和unshare(2)
系统调用,使用CLONE_NEWCGROUP
克隆标志,可以创建新的cgroup namespace
命名空间。运行在该命名空间中的进程将/proc/$PID/cgroup
输出限制为cgroupns root
根目录。cgroupns root
是创建cgroup namespace
的进程所在的cgroup
。
如果没有cgroup namespace
,/proc/$PID/cgroup
文件显示的是进程所属的cgroup
的完整路径。比如,我们在配置容器时,会设置一组cgroup
和namespaces
,用来隔离进程,但是,/proc/$PID/cgroup
可能会向隔离进程泄露潜在的系统信息。例如:
# cat /proc/self/cgroup 0::/batchjobs/container_id1
路径信息/batchjobs/container_id1
被认为是系统数据,不想暴露给隔离的进程们。cgroup namespace
就是用来隐匿这种路径信息的方法。例如,在没有创建cgroup namespace
之前:
# ls -l /proc/self/ns/cgroup lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] # cat /proc/self/cgroup 0::/batchjobs/container_id1
在创建一个新的命名空间并取消共享后,我们只能看到根路径:
# ls -l /proc/self/ns/cgroup lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] # cat /proc/self/cgroup 0::/
对于多线程的进程,任何一个线程通过unshare
创建新cgroupns
时,整个进程(所有线程)都会进入到新的cgroupns
。这对v2 hierarchy
是很自然的事情,但对v1
来说,可能是不期望的行为。
cgroup namespace
的生命周期:只要满足下面两个条件之一,该命名空间就处于激活状态中:
- 命名空间中还有进程存活
- 挂载的文件系统中,还有对象固定到这个
cgroup namespace
上
当其中最后一个进程退出,或挂载的文件系统卸载后,cgroup namespace
就会被销毁,但cgroupns root
和真实的cgroup
仍然存在。
6.2 The Root and Views
cgroup namespace
的根空间cgroupns root
就是调用unshare(2)
的进程所在的cgroup
。例如,/batchjobs/container_id1
这个cgroup
中的进程调用了unshare
,那么/batchjobs/container_id1
就成了cgroupns root
根空间。对于init_cgroup_ns
,这是真正的root
(/
)cgroup
。
即使创建命名空间的进程迁移到了另外一个cgroup
中,cgroupns root cgroup
也不会发生改变:
# ~/unshare -c # 在某个cgroup中取消命名空间的共享 # cat /proc/self/cgroup 0::/ # mkdir sub_cgrp_1 # echo 0 > sub_cgrp_1/cgroup.procs # cat /proc/self/cgroup 0::/sub_cgrp_1
每个进程都会得到特定的命名空间视角,可以通过/proc/$PID/cgroup
查看。
运行在命名空间中的进程,只能看到它们root cgroup
。对于一个取消共享的cgroup
命名空间:
# sleep 100000 & [1] 7353 # echo 7353 > sub_cgrp_1/cgroup.procs # cat /proc/7353/cgroup 0::/sub_cgrp_1
对于初始cgroup
命名空间,能够看到真正的cgroup
路径:
$ cat /proc/7353/cgroup 0::/batchjobs/container_id1/sub_cgrp_1
在兄弟cgroup
命名空间中(也就是说,另一个起源于不同cgroup
的命名空间),将会显示相对于它自己的根命名空间的相对cgroup
路径。例如,进程(7353
)的根命名空间是/batchjobs/container_id2
,则它看到的会是:
# cat /proc/7353/cgroup 0::/../container_id2/sub_cgrp_1
相对路径都是以
/
开头,这是表示它是相对于调用者根命名空间的。
6.3 在 cgroupns 之间迁移进程
cgroup
命名空间中的进程,如果能够正确访问外部的cgroup
,则能够迁入或迁出命名空间的root
位置。假设命名空间的root
位置在/batchjobs/container_id1
,并且能够从该命名空间访问全局hierarchy
:
# cat /proc/7353/cgroup 0::/sub_cgrp_1 # echo 7353 > batchjobs/container_id2/cgroup.procs # cat /proc/7353/cgroup 0::/../container_id2
注意,这类配置不鼓励。
cgroup
命名空间内的任务只应该暴露给它自己的cgroupns hierarchy
。
还可以使用setns(2)
将进程迁移到其它cgroup
命名空间中,前提条件是:
- 进程对自己当前用户的命名空间具有
CAP_SYS_ADMIN
能力 - 进程对目标
cgroup
命名空间中的用户命名空间具有CAP_SYS_ADMIN
能力
将一个进程附加到另一个cgroup
命名空间中不会发生隐含的cgroup
改变。预期是将要附加的进程迁移到目标cgroup
的命名空间的root
位置。
6.4 与其它命名空间的交互
具体命名空间的cgroup hierarchy
可以由non-init
cgroup
命名空间内的进程挂载
# mount -t <fstype> <device> <dir> $ mount -t cgroup2 none $MOUNT_POINT
这将会挂载默认的unified cgroup hierarchy
(可以理解为文件系统),这会把cgroup
命名空间的root
位置作为文件系统的root
节点。这个操作需要用户和挂载命名空间具有CAP_SYS_ADMIN
权限能力。
/proc/self/cgroup
文件的虚拟化,加上限制cgroup hierarchy
的视角(通过命名空间私有的cgroupfs
挂载),实现了容器内部一个独立的cgroup
视角。
7 内核编程的信息
This section contains kernel programming information in the areas where interacting with cgroup is necessary. cgroup core and controllers are not covered.
7.1 Filesystem Support for Writeback
A filesystem can support cgroup writeback by updating address_space_operations->writepages to annotate bio’s using the following two functions.
wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and associates the bio with the inode’s owner cgroup and the corresponding request queue. This must be called after a queue (device) has been associated with the bio and before submission.wbc_account_cgroup_owner(@wbc, @page, @bytes)
Should be called for each data segment being written out. While this function doesn’t care exactly when it’s called during the writeback session, it’s the easiest and most natural to call it as data segments are added to a bio.
With writeback bio’s annotated, cgroup support can be enabled per super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for selective disabling of cgroup writeback support which is helpful when certain filesystem features, e.g. journaled data mode, are incompatible.
wbc_init_bio() binds the specified bio to its cgroup. Depending on the configuration, the bio may be executed at a lower priority and if the writeback session is holding shared resources, e.g. a journal entry, may lead to priority inversion. There is no one easy solution for the problem. Filesystems can try to work around specific problem cases by skipping wbc_init_bio() and using bio_associate_blkg() directly.
8 废弃的v1特性
- 不支持多个
hierarchy
- 不支持所有的
v1
挂载选项 - 所有的
tasks
文件被移除,使用cgroup.procs
,而且不排序 cgroup.clone_children
文件被移除/proc/cgroups
对于v2
没有意义。在root
使用cgroup.controllers
文件代替
9 Issues with v1 and Rationales for v2
9.1 多层hierarchy
带来的问题
v1
允许任意数量的hierarchy
,每个hierarchy
可使用任意数量的controller
。看上去提供了高度灵活性,实际上却没有用。
For example, as there is only one instance of each controller, utility type controllers such as freezer which can be useful in all hierarchies could only be used in one. The issue is exacerbated by the fact that controllers couldn’t be moved to another hierarchy once hierarchies were populated. Another issue was that all controllers bound to a hierarchy were forced to have exactly the same view of the hierarchy. It wasn’t possible to vary the granularity depending on the specific controller.
In practice, these issues heavily limited which controllers could be put on the same hierarchy and most configurations resorted to putting each controller on its own hierarchy. Only closely related ones, such as the cpu and cpuacct controllers, made sense to be put on the same hierarchy. This often meant that userland ended up managing multiple similar hierarchies repeating the same steps on each hierarchy whenever a hierarchy management operation was necessary.
Furthermore, support for multiple hierarchies came at a steep cost. It greatly complicated cgroup core implementation but more importantly the support for multiple hierarchies restricted how cgroup could be used in general and what controllers was able to do.
There was no limit on how many hierarchies there might be, which meant that a thread’s cgroup membership couldn’t be described in finite length. The key might contain any number of entries and was unlimited in length, which made it highly awkward to manipulate and led to addition of controllers which existed only to identify membership, which in turn exacerbated the original problem of proliferating number of hierarchies.
Also, as a controller couldn’t have any expectation regarding the topologies of hierarchies other controllers might be on, each controller had to assume that all other controllers were attached to completely orthogonal hierarchies. This made it impossible, or at least very cumbersome, for controllers to cooperate with each other.
In most use cases, putting controllers on hierarchies which are completely orthogonal to each other isn’t necessary. What usually is called for is the ability to have differing levels of granularity depending on the specific controller. In other words, hierarchy may be collapsed from leaf towards root when viewed from specific controllers. For example, a given configuration might not care about how memory is distributed beyond a certain level while still wanting to control how CPU cycles are distributed.
9.2 Thread Granularity
cgroup v1 allowed threads of a process to belong to different cgroups. This didn’t make sense for some controllers and those controllers ended up implementing different ways to ignore such situations but much more importantly it blurred the line between API exposed to individual applications and system management interface.
Generally, in-process knowledge is available only to the process itself; thus, unlike service-level organization of processes, categorizing threads of a process requires active participation from the application which owns the target process.
cgroup v1 had an ambiguously defined delegation model which got abused in combination with thread granularity. cgroups were delegated to individual applications so that they can create and manage their own sub-hierarchies and control resource distributions along them. This effectively raised cgroup to the status of a syscall-like API exposed to lay programs.
First of all, cgroup has a fundamentally inadequate interface to be exposed this way. For a process to access its own knobs, it has to extract the path on the target hierarchy from /proc/self/cgroup, construct the path by appending the name of the knob to the path, open and then read and/or write to it. This is not only extremely clunky and unusual but also inherently racy. There is no conventional way to define transaction across the required steps and nothing can guarantee that the process would actually be operating on its own sub-hierarchy.
cgroup controllers implemented a number of knobs which would never be accepted as public APIs because they were just adding control knobs to system-management pseudo filesystem. cgroup ended up with interface knobs which were not properly abstracted or refined and directly revealed kernel internal details. These knobs got exposed to individual applications through the ill-defined delegation mechanism effectively abusing cgroup as a shortcut to implementing public APIs without going through the required scrutiny.
This was painful for both userland and kernel. Userland ended up with misbehaving and poorly abstracted interfaces and kernel exposing and locked into constructs inadvertently.
9.3 Competition Between Inner Nodes and Threads
cgroup v1 allowed threads to be in any cgroups which created an interesting problem where threads belonging to a parent cgroup and its children cgroups competed for resources. This was nasty as two different types of entities competed and there was no obvious way to settle it. Different controllers did different things.
The cpu controller considered threads and cgroups as equivalents and mapped nice levels to cgroup weights. This worked for some cases but fell flat when children wanted to be allocated specific ratios of CPU cycles and the number of internal threads fluctuated - the ratios constantly changed as the number of competing entities fluctuated. There also were other issues. The mapping from nice level to weight wasn’t obvious or universal, and there were various other knobs which simply weren’t available for threads.
The io controller implicitly created a hidden leaf node for each cgroup to host the threads. The hidden leaf had its own copies of all the knobs with leaf_ prefixed. While this allowed equivalent control over internal threads, it was with serious drawbacks. It always added an extra layer of nesting which wouldn’t be necessary otherwise, made the interface messy and significantly complicated the implementation.
The memory controller didn’t have a way to control what happened between internal tasks and child cgroups and the behavior was not clearly defined. There were attempts to add ad-hoc behaviors and knobs to tailor the behavior to specific workloads which would have led to problems extremely difficult to resolve in the long term.
Multiple controllers struggled with internal tasks and came up with different ways to deal with it; unfortunately, all the approaches were severely flawed and, furthermore, the widely different behaviors made cgroup as a whole highly inconsistent.
This clearly is a problem which needs to be addressed from cgroup core in a uniform way.
9.4 Other Interface Issues
cgroup v1 grew without oversight and developed a large number of idiosyncrasies and inconsistencies. One issue on the cgroup core side was how an empty cgroup was notified - a userland helper binary was forked and executed for each event. The event delivery wasn’t recursive or delegatable. The limitations of the mechanism also led to in-kernel event delivery filtering mechanism further complicating the interface.
Controller interfaces were problematic too. An extreme example is controllers completely ignoring hierarchical organization and treating all cgroups as if they were all located directly under the root cgroup. Some controllers exposed a large amount of inconsistent implementation details to userland.
There also was no consistency across controllers. When a new cgroup was created, some controllers defaulted to not imposing extra restrictions while others disallowed any resource usage until explicitly configured. Configuration knobs for the same type of control used widely differing naming schemes and formats. Statistics and information knobs were named arbitrarily and used different formats and units even in the same controller.
cgroup v2 establishes common conventions where appropriate and updates controllers so that they expose minimal and consistent interfaces.
9.5 Controller Issues and Remedies
9.5.1 Memory
The original lower boundary, the soft limit, is defined as a limit that is per default unset. As a result, the set of cgroups that global reclaim prefers is opt-in, rather than opt-out. The costs for optimizing these mostly negative lookups are so high that the implementation, despite its enormous size, does not even provide the basic desirable behavior. First off, the soft limit has no hierarchical meaning. All configured groups are organized in a global rbtree and treated like equal peers, regardless where they are located in the hierarchy. This makes subtree delegation impossible. Second, the soft limit reclaim pass is so aggressive that it not just introduces high allocation latencies into the system, but also impacts system performance due to overreclaim, to the point where the feature becomes self-defeating.
The memory.low boundary on the other hand is a top-down allocated reserve. A cgroup enjoys reclaim protection when it’s within its effective low, which makes delegation of subtrees possible. It also enjoys having reclaim pressure proportional to its overage when above its effective low.
The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. But this generally goes against the goal of making the most out of the available memory. The memory consumption of workloads varies during runtime, and that requires users to overcommit. But doing that with a strict upper limit requires either a fairly accurate prediction of the working set size or adding slack to the limit. Since working set size estimation is hard and error prone, and getting it wrong results in OOM kills, most users tend to err on the side of a looser limit and end up wasting precious resources.
The memory.high boundary on the other hand can be set much more conservatively. When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer. As a result, a high boundary that is chosen too aggressively will not terminate the processes, but instead it will lead to gradual performance degradation. The user can monitor this and make corrections until the minimal memory footprint that still gives acceptable performance is found.
In extreme cases, with many concurrent allocations and a complete breakdown of reclaim progress within the group, the high boundary can be exceeded. But even then it’s mostly better to satisfy the allocation from the slack available in other groups or the rest of the system than killing the group. Otherwise, memory.max is there to limit this type of spillover and ultimately contain buggy or even malicious applications.
Setting the original memory.limit_in_bytes below the current usage was subject to a race condition, where concurrent charges could cause the limit setting to fail. memory.max on the other hand will first set the limit to prevent new charges, and then reclaim and OOM kill until the new limit is met - or the task writing to memory.max is killed.
The combined memory+swap accounting and limiting is replaced by real control over swap space.
The main argument for a combined memory+swap facility in the original cgroup design was that global or parental pressure would always be able to swap all anonymous memory of a child group, regardless of the child’s own (possibly untrusted) configuration. However, untrusted groups can sabotage swapping by other means - such as referencing its anonymous memory in a tight loop - and an admin can not assume full swappability when overcommitting untrusted jobs.
For trusted jobs, on the other hand, a combined counter is not an intuitive userspace interface, and it flies in the face of the idea that cgroup controllers should account and limit specific physical resources. Swap space is a resource like all others in the system, and that’s why unified hierarchy allows distributing it separately.