7.5.5 cgroups

Table of Contents

Introduction

Control groups, usually called cgroups, are a kernel feature that lets you group processes and apply resource limits, priorities, and accounting to the entire group. They are one of the key building blocks behind containers, because they make it possible to confine how much CPU, memory, I/O, and other resources a set of processes can use, without needing separate virtual machines.

From a practical point of view, cgroups provide a set of controllers that manage different kinds of resources, and a unified kernel interface that exposes those controllers. User space tools and higher level systems like systemd, Docker, and Kubernetes build on this interface.

A cgroup is a kernel construct that associates one or more processes with a set of resource controllers, which apply limits, priorities, and accounting to that group of processes.

In this chapter we will focus on what is unique to cgroups as an internal kernel mechanism, how they are structured, how controllers work, and how they relate to containers and other isolation features.

cgroup v1 and cgroup v2

Linux has two major generations of the cgroup interface. Many systems today use both in parallel, but new designs and distributions prefer the second version.

In cgroup v1, each controller is mounted as its own virtual filesystem, or as a combination of controllers on a single mount. This leads to a forest of unrelated hierarchies, one per controller or controller set. For example, you might have one hierarchy for CPU control and another for memory control, with different group layouts and independent paths. This flexibility made the interface complex and awkward to manage programmatically.

In cgroup v2, often called the unified hierarchy, there is a single cgroup filesystem mount. All enabled controllers share the same directory tree. A cgroup directory represents one group of processes, and all active controllers apply to that group. This makes it easier to reason about what resources are applied to which processes, and it enforces consistent rules on how leaf and internal nodes can contain processes.

The Linux kernel allows you to configure which mode is used. Modern distributions usually mount cgroup v2 by default, but some controllers might still rely on v1 for compatibility. Tools like systemd can manage both, but increasingly target the unified hierarchy.

Key distinction: cgroup v1 can have multiple independent hierarchies per controller, while cgroup v2 uses a single unified hierarchy that all controllers share.

Hierarchies, cgroups, and processes

Internally, cgroups are organized into a tree. At the top of the tree, the root cgroup contains every process by default. Each subdirectory in the cgroup filesystem represents a child cgroup. Processes can be moved between cgroups by writing their process IDs into special files in these directories.

Each cgroup node in the hierarchy can contain:

Text files that expose controller settings and statistics.

A list of processes that belong to that cgroup.

Child cgroups represented as subdirectories.

When a process is placed into a cgroup, any children it creates remain in that same cgroup unless they are explicitly moved. This makes it natural to manage entire process trees by placing the initial process into a group and letting inheritance handle the rest.

In cgroup v2, there is an important rule about where processes can live relative to controllers. If a cgroup has active controllers, it normally acts as an internal node that groups child cgroups, and leaf nodes hold the processes. There is also a threaded mode to support certain use cases, but the general idea is to keep a clean separation between levels of control and leaves that contain processes. This structure resolves many ambiguities that existed in v1.

From a user space perspective, you typically interact with the cgroup hierarchies through paths under /sys/fs/cgroup or via higher level managers. Internally, the kernel tracks which cgroup each task belongs to, and each controller consults that association when making scheduling or accounting decisions.

Controllers and resource domains

Cgroups by themselves are just groupings. The interesting behavior comes from controllers that attach to the cgroup hierarchy. Each controller understands a specific resource domain and provides knobs and counters for that domain.

Common controllers include CPU, memory, I/O, PIDs, and others. Each controller exports controls in a defined set of files within each cgroup directory. For example, the memory controller might provide files that let you set a memory limit, another that shows current usage, and another that reports failures due to hitting a limit.

When you set a limit or configuration in one of these files, the controller logic in the kernel applies that setting to the cgroup and propagates the constraints down the tree according to its semantics. For instance, a memory limit on a parent cgroup becomes an upper bound for the aggregate of its children. Controllers must respect the hierarchy, so a child cgroup can be further restricted but cannot exceed its parent.

Hierarchy rule: for any resource controller, a child cgroup cannot exceed the effective limit of its parent. Limits in the tree are monotone nonincreasing as you move from root to leaves.

Each controller can also provide statistics and events. For example, a memory controller might emit an event when a cgroup hits its limit, or a CPU controller might track the exact amount of CPU time used. These statistics are crucial for orchestration systems that need to monitor real usage and adjust configurations dynamically.

CPU control and scheduling

The CPU controller integrates with the scheduler to control which processes run and for how long. At the kernel level, the scheduler assigns runnable tasks to CPUs based on priorities. With cgroups, the scheduler considers not only individual processes, but also their group membership and each group’s weight or quota.

There are two central concepts in CPU control.

A proportional share, often called weight, determines how much CPU time a cgroup receives relative to other sibling cgroups. If one group has weight $w_1$ and another has $w_2$, then over time, the CPU time each receives is approximately in proportion to those weights. The share of CPU time for group 1 compared to the total of both is

$$
\text{share}_1 = \frac{w_1}{w_1 + w_2}.
$$

The absolute timeslices still depend on total demand and number of CPUs, but the scheduler uses these ratios to guide distribution.

A quota, combined with a period, defines a hard cap on CPU usage over a sliding window. The period is the length of the accounting window in microseconds, and the quota is how much CPU time within that window the cgroup may consume. The effective fraction of one CPU that a cgroup can use is

$$
\text{CPU fraction} = \frac{\text{quota}}{\text{period}}.
$$

For example, if the period is 100000 microseconds (0.1 seconds) and the quota is 50000, the cgroup can use at most half of one CPU on average within each period.

The scheduler considers these parameters when placing tasks on run queues. If a cgroup exceeds its quota during the current period, its tasks are throttled until the next period begins. This is critical for container platforms that need to enforce CPU budgets per container.

Memory control and reclaim

The memory controller tracks how much memory a cgroup’s processes use and tries to keep that usage within configured limits. Internally, it cooperates with the kernel memory management subsystem that already handles page allocation, caching, and swapping.

Each memory controller instance maintains counters for anonymous pages, file cache pages, kernel memory parts that are chargeable to the cgroup, and sometimes swap usage. When a process in a cgroup allocates memory, the controller charges that usage to the cgroup and updates its counters. If a limit is in place and the allocation would push the cgroup over that limit, the controller triggers reclaim logic for that group.

Reclaim operates by identifying pages associated with that cgroup that can be freed. This usually starts by trimming file caches and reclaimable page cache associated with that group. If that is not enough, and depending on configuration, the kernel may swap out anonymous pages belonging to the cgroup, or in stricter configurations may begin to kill processes in that cgroup with an out of memory policy that is specific to the group.

This behavior isolates memory pressure so that one group experiencing heavy allocations mostly disturbs its own caches and processes instead of the entire system. From the kernel’s point of view, global memory pressure is still a concern, but cgroup limits provide an inner accounting layer that refines decisions.

Some memory controllers also track high watermark events, low and high thresholds, and expose files that record how often reclaim occurred. Orchestration layers read these values to decide whether to adjust limits, reschedule work, or move processes.

I/O, PIDs, and other controllers

Beyond CPU and memory, cgroups host a variety of other controllers. Each brings its own rules and integration points with kernel subsystems.

The I/O controller interacts with block devices and I/O schedulers. It can enforce bandwidth caps or I/O operation limits per cgroup on specific disks. Internally, the kernel maintains per cgroup queues or tokens tied to devices. When a process in a cgroup submits I/O requests, those requests are tagged and then scheduled according to the cgroup’s configured rates or weights. This avoids pathological scenarios like a single container saturating a disk with write requests and starving the rest of the system.

The PIDs controller puts a specific limit on how many processes and threads a cgroup can create. It does this by maintaining a counter of tasks in the group, including any children groups if configured that way. When a process in that cgroup calls fork or clone, the kernel consults the PIDs controller. If creating a new task would exceed the configured limit, the kernel denies the creation, typically returning an error code to the caller. This provides a defense against fork bombs that could otherwise consume all process slots on the system.

Other controllers, such as those for huge pages, network class, or device access, plug into their respective kernel areas. For example, a device controller can associate a set of allowed major and minor device numbers with a cgroup and have the kernel check those rules whenever a process tries to open or create device nodes. This lets user space define which devices a container can see and use.

The cgroup filesystem interface

All the configuration of cgroups occurs through special pseudo filesystems that the kernel exposes, not through traditional syscalls. At the core, there is a cgroup filesystem that behaves like a tree of directories and files, backed by internal kernel objects instead of disk blocks.

On systems using cgroup v2 as the unified hierarchy, this filesystem is usually mounted at /sys/fs/cgroup. Inside that mount you see a tree of directories representing cgroups, and a fixed set of files in each directory, such as cgroup.procs, cgroup.subtree_control, and specific controller files like memory.max or cpu.max.

Writing a process ID into cgroup.procs moves that process into the cgroup represented by the directory that contains the file. The kernel then updates the internal association between that task and the cgroup object. Likewise, writing controller parameters into controller specific files updates the controller’s configuration for that cgroup.

The unified hierarchy introduces a mechanism to enable or disable controllers for subtrees by writing to cgroup.subtree_control. When you enable a controller in a parent, that controller becomes available in its children. This activation mechanism ensures that controllers are only active on parts of the tree where they are intended to be used, and that internal nodes and leaf nodes observe the required constraints of the v2 model.

Reading files under the cgroup filesystem provides statistics and status. For example, memory controllers typically export memory.current and memory.max for usage and limits, and CPU controllers export cpu.stat for accumulated times. The format is text, which makes direct inspection convenient, although serious orchestration software reads and parses these fields programmatically.

Interaction with namespaces and containers

Although cgroups and namespaces are separate kernel features, they are combined heavily in container systems. Namespaces isolate what a process can see, like process IDs, mounts, and network interfaces. Cgroups control how much of each resource the process can use. Together they create an environment that looks and feels like a separate machine from inside, yet shares the same kernel.

In a typical container platform, each container corresponds to a cgroup subtree. The container runtime creates a cgroup for the container, configures CPU, memory, and other resources according to desired limits, and then starts the container processes with PIDs attached to that cgroup. The kernel enforces limits via cgroup controllers, while namespaces ensure those processes do not see or affect unrelated processes and mounts.

Systemd integrates tightly with cgroups as well. Each systemd unit, such as a service, gets its own cgroup. Systemd becomes a top level manager for the cgroup hierarchy, creating and destroying cgroups as services start and stop. It uses the cgroup filesystem to adjust resource limits of services, and you can see this layout in the directories under /sys/fs/cgroup where each unit has its own path.

The important point from an internals angle is that cgroups are the underlying resource accounting and limiting layer. Namespaces, init systems, and container runtimes all interface with this layer, but they do not replace it. They rely on cgroups to enforce the constraints that make container style isolation practical.

cgroup life cycle and overhead

From the kernel’s perspective, cgroups are reference counted objects. When a user space manager creates a new cgroup directory, the kernel allocates the necessary structures for that level in the hierarchy. As processes are attached, references increase, and controllers start tracking usage. When a cgroup is emptied and its directory is removed from the filesystem, the kernel removes its internal representation once no more references remain.

Each controller adds some overhead in terms of bookkeeping. For example, the memory controller maintains per cgroup counters and possibly per page charge metadata. The CPU controller needs per cgroup scheduling statistics. For a small number of cgroups, this overhead is negligible. However, building systems with extremely deep or wide cgroup trees, or with very frequent creation and destruction of cgroups, can increase accounting costs.

Kernel developers design controllers to scale to typical container workloads, where you might have dozens or hundreds of cgroups. For very large scale use, tradeoffs appear between granularity of control and monitoring costs. This is part of why the unified hierarchy simplifies things. By keeping a single tree and aligning controllers, the kernel can optimize traversal and reduce duplicated structures.

From a behavior standpoint, once controllers are configured, they operate continuously in the background. There is no active polling loop in user space that enforces limits. Instead, the kernel hooks into allocation paths, scheduler paths, or I/O submission paths and consults cgroup data there. This makes enforcement efficient and reliable, even if no user space component is running.

Summary

cgroups are a fundamental Linux kernel feature that group processes and apply resource control policies to those groups. They form a hierarchical tree, with each cgroup represented as a directory in a special filesystem, and each process can belong to one cgroup per hierarchy. Controllers attach to this tree and implement specific resource domains, such as CPU, memory, I/O, and PIDs.

The move from cgroup v1 to v2 unified the hierarchy and made control more consistent across controllers. Internally, the kernel tightly integrates cgroups with the scheduler, memory manager, I/O subsystem, and process management, so that resource limits and accounting are enforced directly where resources are allocated and consumed.

Cgroups, in combination with namespaces and higher level managers like systemd and container runtimes, provide the building blocks for modern workload isolation on Linux. Understanding their structure and internal behavior is key to understanding how Linux enforces resource constraints and supports containers at scale.

Comments

Please login to add a comment.

Don't have an account? Register now!