- Luca Cavallin
barco is a project I worked on to learn more about Linux containers and the Linux kernel. It's a simple implementation of a container runtime in C, which I wrote from scratch (based on other guides on the Internet) using just C,
libseccomp for seccomp filters,
libcap for container capabilities,
libcuni1 for unit tests with CUnit,
argtable for handling the CLI and another third-party library for logging. It's not meant to be used in production, but rather as a learning tool.
Linux containers are made up by a set of Linux kernel features:
namespaces: are used to group kernel objects into different sets that can be accessed by specific process trees. There are different types of
namespaces, for example,the
PIDnamespace is used to isolate the process tree, while the
networknamespace is used to isolate the network stack.
seccomp: a Linux Kernel mechanism used to limit the system calls that a process can make (handled via syscalls)
capabilities: a Linux Kernel mechanism used to set limits on what the root user can do in the container (handled via syscalls)
cgroups: are used to limit the resources (e.g. memory, disk I/O, CPU-tme) that a process can use (handled via cgroupfs)
barco uses all of these features to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work.
barco can be used to run
bin/sh . from the
/ directory as
root (-u 0) with the following command (optional
-v for verbose output):
$ sudo ./bin/barco -u 0 -m / -c /bin/sh -a . [-v] 22:08:41 INFO ./src/barco.c:96: initializing socket pair... 22:08:41 INFO ./src/barco.c:103: setting socket flags... 22:08:41 INFO ./src/barco.c:112: initializing container stack... 22:08:41 INFO ./src/barco.c:120: initializing container... 22:08:41 INFO ./src/barco.c:131: initializing cgroups... 22:08:41 INFO ./src/cgroups.c:73: setting memory.max to 1G... 22:08:41 INFO ./src/cgroups.c:73: setting cpu.weight to 256... 22:08:41 INFO ./src/cgroups.c:73: setting pids.max to 64... 22:08:41 INFO ./src/cgroups.c:73: setting cgroup.procs to 1458... 22:08:41 INFO ./src/barco.c:139: configuring user namespace... 22:08:41 INFO ./src/barco.c:147: waiting for container to exit... 22:08:41 INFO ./src/container.c:43: ### BARCONTAINER STARTING - type 'exit' to quit ### # ls bin home lib32 media root sys vmlinuz boot initrd.img lib64 mnt run tmp vmlinuz.old dev initrd.img.old libx32 opt sbin usr etc lib lost+found proc srv var # echo "i am a container" i am a container # exit 22:08:55 INFO ./src/barco.c:153: freeing resources... 22:08:55 INFO ./src/barco.c:168: so long and thanks for all the fish
If you want to build and run
barco from source, you can do so by cloning the repository and following the instructions in the README file. In short,
barco provides configuration for development on GitHub Codespaces using Visual Studio Code and the included
Makefile can be used to build and run the project as follows:
$ sudo apt install -y make $ make build
There are also other targets in the
Makefile that can be used to run the unit tests, build the documentation, run the formatter and the linter, etc (most of the tools are native to the Clang compiler). Furthermore, while working on
barco I did investigate best practices for the structure of C projects and I adopted the following:
├── .devcontainer - configuration for GitHub Codespaces ├── .github - configuration GitHub Actions and other GitHub features ├── .vscode - configuration for Visual Studio Code ├── bin - the executable (created by make) ├── build - intermediate build files e.g. *.o (created by make) ├── docs - documentation ├── include - header files ├── lib - third-party libraries ├── scripts - scripts for setup and other tasks ├── src - C source files │ ├── barco.c - (main) Entry point for the CLI │ └── *.c ├── tests - contains tests ├── .clang-format - configuration for the formatter ├── .clang-tidy - configuration for the linter ├── .gitignore ├── LICENSE ├── Makefile └── README.md
How does it work?
barco executable is the entry point for the CLI. It's responsible for parsing the CLI arguments, setting up the container and running the container process: it all starts at barco.c, where I used
argtable to parse the CLI arguments and set up, start and ultimately cleanup the container and other resources. The first two steps towards running a container are creating a pair of sockets (to communicate with the container process) and initializing the container stack (to set up the container process).
The Call to container_init
After the initial setup, the
container_init function, defined in container.c, is called to start the container process. The function is relatively simple, and all it does it calling the
clone system call with a function to run (
container_start), stack configuration, and the appropriate flags (
CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS allow some control over mounts, pids, IPC data structures, network devices and hostname) to create the container process.
The container_start Function
The resulting process is a child of the
barco process, and it's the one that will run the container. The
container_start function is the entry point for the container process, and it's defined in container.c. It's responsible for setting up the container, and it does so by:
setting the hostname
setting the root directory (mount namespace) to the one specified by the user (via the
setting the user namespace
setting capabilities and seccomp filters (for security)
The Mount Namespace
The mount namespace is used to isolate the filesystem mount points seen by the container process. The mount_set function is responsible for setting the root directory for the container process, and it does so by calling the
mount system call with the appropriate flags (
MS_BIND | MS_REC | MS_PRIVATE) to create a new mount namespace and bind-mount the root directory to the one specified by the user. The result is that the container process will see the root directory as the one specified by the user, and it will be able to access the files in that directory and its subdirectories.
The User Namespace
The user namespace is used to isolate the user and group IDs seen by the container process. The user_namespace_init function is responsible for setting the user namespace, and it does so by calling the
unshare system call with the appropriate flags (
CLONE_NEWUSER) to create a new user namespace.
user_namespace_init relies on user_namespace_prepare_mappings, a function called in
barco.c and used by the parent process (barco) to listen for the child process (the container) to request setting
gid before updating the
gid_map for the container. The
gid_map files are a Linux kernel mechanism for mapping uids and gids between the parent and child processes. The result is that the container process will see the user and group IDs as specified by the user, and it will be able to access the files in that directory and its subdirectories.
Capabilities and Seccomp Filters
The container process is running as
root (uid 0), but it's not a full
root user. It's a
root user with limited capabilities and seccomp filters. The sec_set_caps function uses
libcap to set the capabilities for the container process, and it does so by calling the
cap_set_proc function with the appropriate flags (e.g.
CAP_SYSLOG...). This configuration makes it so that the container process will be able to perform only the actions specified by the capabilities.
Seccomp Filters are used instead to limit the system calls that a process can make (in my case, the container). The sec_set_seccomp function blocks sensitive system calls based on Docker's default seccomp profile and other obsolete or dangerous system calls. It does so by calling the
seccomp_rule_add function with the appropriate parameters to specify how calls to a system call should be handled. Let's use the following as an example:
seccomp_rule_add(ctx, SEC_SCMP_FAIL, SCMP_SYS(fchmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
This instruction tells the Linux Kernel to block calls to
fchmod if the
S_ISGID bit is set in the
mode parameter. The
mode parameter is the second parameter of the
fchmod system call, and it's used to set the permissions of a file. The
S_ISGID bit is used to set the
setgid bit, which is used to set the group ID on execution. The
S_ISGID bit is set when the
mode parameter is
S_ISGID | S_IRWXU | S_IRWXG | S_IRWXO, which is the default value for the
mode parameter. This means that the
fchmod system call will be blocked if the
mode parameter is the default value, which is the case when the
fchmod system call is used to set the permissions of a file. This is a good example of how seccomp filters can be used to block dangerous system calls.
As the container is starting, the parent process (barco) is setting up cgroups for the child process. The cgroups_init function is responsible for setting up cgroups (version 2), and it does so by creating the
/sys/fs/cgroup/barcontainer directory (barcontainer is the hostname for the container) for the new cgroup and then writing cgroups settings to the appropriate files in the cgroupfs. The cgroups settings are:
memory.max: the maximum amount of memory that the container process can use
cpu.weight: the relative weight of the container process compared to other processes
pids.max: the maximum number of processes that the container process can spawn
By setting these values, the container process will be limited in the amount of memory, CPU time and processes that it can use.
Waiting for the Container to Exit
The container is setup, and it's now running. The parent process (barco) is waiting for the container process to exit, and it does so by calling the
waitpid system call with the PID (process ID) of the container. The container I used as an example in the README.md file is running
/bin/sh and waiting for user input. The user can type commands, and the container process will execute them. For example, the user can type
ls and the container process will execute it, listing the contents of the current directory. The container process will continue to run until the user types
exit, which will cause the container process to exit. The parent process (barco) will then exit as well.
As the container process exits, the parent process (barco) exits too, and it's responsible for cleaning up the resources used by the container process. Starting at the cleanup label in
barco.c, the parent process is responsible for:
freeing the container stack
closing the sockets used to communicate with the container process
removing the cgroup it created
freeing the data structures used by
Linux containers are made up of a set of Linux kernel features, and
barco uses them to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work. I learned a lot while working on this project, and I hope that this post can be useful to others who want to learn more about Linux containers and the Linux kernel. If you have any questions, feedback or improvements you'd like to make, feel free to reach out to me directly or on the barco repository!