Features/ControlGroups

ControlGroups

Summary

Control Groups consists of two parts:

an upstream kernel feature that allows system resources to be partitioned/divided up amongst different processes, or a group of processes.
user-space tools which handle kernel control groups mechanism. We want to improve them where necessary and feasible and/or to create new ones e.g. to create or modify cgroups configuration or display control groups data (using libcgroups package).

Owner

Linda Wang
- email: lwang@redhat.com
Nils Philippsen
- email: nphilipp@redhat.com
Ivana Varekova
- email: varekova@redhat.com
Jan Šafránek
- email: jsafrane@redhat.com

Current status

Targeted release: Fedora 11
- kernel part:

  * Overall CGROUP infrastructure [completed, in Fedora 10]
  * Sub-CGROUP features:
    * CPUSET [completed, in Fedora 10]
    * CPUACCT [completed, in Fedora 10]
    * MEMCTL [completed, in Fedora 10]
    * DEVICE [completed, in Fedora 10]
    * NETWORKING [new, targeted for Fedora 11]

- tools part

  * man-pages
  * code review
  * spec-file cleanup
  * make file cleanup
  * ps option for cgroups
  * new tools

Last updated: 2009-02-19
Percentage of completion: 65%

Detailed Description

Kernel Part

Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

Definitions:

A *cgroup* associates a set of tasks with a set of parameters for one or more subsystems.

A *subsystem* is a module that makes use of the task grouping facilities provided by cgroups to treat groups of tasks in particular ways. A subsystem is typically a "resource controller" that schedules a resource or applies per-cgroup limits, but it may be anything that wants to act on a group of processes, e.g. a virtualization subsystem.

A *hierarchy* is a set of cgroups arranged in a tree, such that every task in the system is in exactly one of the cgroups in the hierarchy, and a set of subsystems; each subsystem has system-specific state attached to each cgroup in the hierarchy. Each hierarchy has an instance of the cgroup virtual filesystem associated with it.

At any one time there may be multiple active hierachies of task cgroups. Each hierarchy is a partition of all tasks in the system.

User level code may create and destroy cgroups by name in an instance of the cgroup virtual file system, specify and query to which cgroup a task is assigned, and list the task pids assigned to a cgroup. Those creations and assignments only affect the hierarchy associated with that instance of the cgroup file system.

On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access. For example, cpusets (see Documentation/cpusets.txt) allows you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.

User space tools

Libcgroups makes that functionality available to programmers and contains two tools, cgexec and cgclassify, to start processes in a control group or move existing processes from one control group to another. In Fedora libcgroups package is already incorporated, but the overall quality is very poor. There is almost no documentation, no man pages, no configuration file samples, there should be done code review and created other necessary tools and improve installations:

The goal for Fedora 11 is to improve this package where necessary, i.e.:

bugfixing
add/fix documentation and man-pages
add examples
fix error handling
rework logging
create displaying tool (to see, in which control group is given process)
prepare a way, how to start a service daemon in given context group

The long term goal is to create new tools to e.g. create or modify persistent cgroups configuration and display control groups data. At the beginning the focus will be on command line tools, but we'll keep in mind that in the long term we'll likely want to have graphical tools. These would offer similar functionality and we should try to make sure that any non-UI code written is usable from both kinds of frontends.

Benefit to Fedora

The implementation of of "control groups" schema and its improvement should enable users to partitioned/divided resources up amongst different processes, or a group of processes. Libcgroups should helps them to create persistent configuration of partitioning devices and handle cgroups from user point of view. This project should help the user to make the best of control groups kernel feature.

Scope

Kernel Part:

There are several sub-features under control group:

* CGROUPS (grouping infrastructure mechanism)
* CPUSET (cpuset controller, in F10)
* CPUACCT (cpu account controller, in F10)
* SCHED (schedule controller, in F10)
* MEMCTL (memory controller, in F10)
* DEVICE
* NETCTL (network controller, New)

tools part:

Required extended testing and fixing of libcgroups package and in time when libcgroups will be stable enough try to add start to write another parts - based on existing ones.

How To Test

To help test, and use the control group features in Fedora; there are multiple way to test, depends on the feature set that you are interested in.

For CPUSET:

0. targeted mostly for x86, x86_64 1. Documentation/cgroups/cpusets.txt, section 2, Usage Examples and Syntax: To start a new job that is to be contained within a cpuset, the steps are:

1) mkdir /dev/cpuset
2) mount -t cgroup -ocpuset cpuset /dev/cpuset
3) Create the new cpuset by doing mkdir's and write's (or echo's) in
   the /dev/cpuset virtual file system.
4) Start a task that will be the "founding father" of the new job.
5) Attach that task to the new cpuset by writing its pid to the
   /dev/cpuset tasks file for that cpuset.
6) fork, exec or clone the job tasks from this founding father task.

For example, the following sequence of commands will setup a cpuset named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, and then start a subshell 'sh' in that cpuset:

 mount -t cgroup -ocpuset cpuset /dev/cpuset
 cd /dev/cpuset
 mkdir Charlie
 cd Charlie
 /bin/echo 2-3 > cpus
 /bin/echo 1 > mems
 /bin/echo $$ > tasks
 sh
 # The subshell 'sh' is now running in cpuset Charlie
 # The next line should display '/Charlie'
 cat /proc/self/cpuset

For CPUACCT

The CPU accounting controller is used to group tasks using cgroups and account the CPU usage of these groups of tasks.

The CPU accounting controller supports multi-hierarchy groups. An accounting group accumulates the CPU usage of all of its child groups and the tasks directly present in its group.

Accounting groups can be created by first mounting the cgroup filesystem.

mkdir /cgroups
mount -t cgroup -ocpuacct none /cgroups

With the above step, the initial or the parent accounting group becomes visible at /cgroups. At bootup, this group includes all the tasks in the system. /cgroups/tasks lists the tasks in this cgroup. /cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by this group which is essentially the CPU time obtained by all the tasks in the system.

New accounting groups can be created under the parent group /cgroups.

cd /cgroups
mkdir g1
echo $$ > g1

The above steps create a new group g1 and move the current shell process (bash) into it. CPU time consumed by this bash and its children can be obtained from g1/cpuacct.usage and the same is accumulated in /cgroups/cpuacct.usage also.

For Memory Controller 0. Configuration

a. Enable CONFIG_CGROUPS b. Enable CONFIG_RESOURCE_COUNTERS c. Enable CONFIG_CGROUP_MEM_RES_CTLR (still valid??)

1. Prepare the cgroups

mkdir -p /cgroups
mount -t cgroup none /cgroups -o memory

2. Make the new group and move bash into it

mkdir /cgroups/0
echo $$ > /cgroups/0/tasks

Since now we're in the 0 cgroup, We can alter the memory limit:

echo 4M > /cgroups/0/memory.limit_in_bytes

NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, mega or gigabytes.

cat /cgroups/0/memory.limit_in_bytes

4194304

NOTE: The interface has now changed to display the usage in bytes instead of pages

We can check the usage:

cat /cgroups/0/memory.usage_in_bytes

1216512

A successful write to this file does not guarantee a successful set of this limit to the value written into the file. This can be due to a number of factors, such as rounding up to page boundaries or the total availability of memory on the system. The user is required to re-read this file after a write to guarantee the value committed by the kernel.

echo 1 > memory.limit_in_bytes
cat memory.limit_in_bytes

4096

The memory.failcnt field gives the number of times that the cgroup limit was exceeded.

The memory.stat file gives accounting information. Now, the number of caches, RSS and Active pages/Inactive pages are shown.

For Control Group tools: From now to other tests it is necessary to have a kernel with cgroups support and the libcgroup package.

1. yum install libcgroup

Creating cgroups:

Configure /etc/cgconfig.conf file - there should be nice example and man page packaged.
Start/stop cgconfig service and test whether the created groups are as expected.

Moving task to groups:

Prepare some cgroups, i.e. prepare /etc/cgconfig.conf and start cgconfig service.
Start/stop new proces using cgexec and check that it's in appropriate cgroup.
Prepare cgrules.conf file - there should be some sample and man page available.
Test cgrulesengd daemon (it should automatically move processes as written in cgrules.conf).
Configure cgroup pam module and test that works if a user logs in (again, driven by cgrules.conf).

Looking in which cgroup the task is

ps -o cgroup

User Experience

End-user who will use this feature will hopefully find it useful to help partition their server/machine resources into different functional units that they can dedicate these resources to.

The control group user interfaces are very straight forward, and are a set of common easy to use command-line operations. The concept of allocating different system resources such as number of CPUs, amount of memories, and network bandwidth should be easy.

libcgroups package should help the user to create persistent configuration and would help to reduce the barrier of entry to using control groups on Linux significantly.

Dependencies

Majority of the implementation is done inside of the kernel. Tools part is implemented in package libcgroups

Contingency Plan

The contingency plan for under develop sub-feature is to simply not enable the kernel option during development freeze. Hence it will not expose the incomplete sub-feature to the fedora community. Currently, nothing depends on libcgroup or the tools which would use it. If things go really wrong, we can always go back to the last working version of libcgroup.

Documentation

kernel documentation:
- Documentation/cgroups

libcg:
- upstream site
- LWN.net article: libcg: design and plans
- documentation from source tarball (directories doc and samples)

Release Notes

libcgroups is a tool which helps to manipulate, control and administrate control groups and the associated controllers. Using this tool it is possible to aggregate/partition set of tasks and their future children into hierarchical groups with specialized access to resources.

The tool consists of two parts -

The first one enables user to create persistent cgroups configuration using a configuration file and a service which creates configured groups on startup.

The second part enables user to define to which group belong the given process/given processes. This divison is based on uid or gid of processes. The user can start a service which will put the processes to the right subsystem, or there is a tool to move the process to the right subsystem or to create the process in it.

Comments and Discussion

See Talk:Features/ControlGroups

Search