barco: Linux Containers From Scratch in C. | Blog

barco is a project I worked on to learn more about Linux containers and the Linux kernel. It's a simple implementation of a container runtime in C, which I wrote from scratch (based on other guides on the Internet) using just C, libseccomp for seccomp filters, libcap for container capabilities, libcuni1 for unit tests with CUnit, argtable for handling the CLI and another third-party library for logging. It's not meant to be used in production, but rather as a learning tool.

Linux containers are made up by a set of Linux kernel features:

namespaces: are used to group kernel objects into different sets that can be accessed by specific process trees. There are different types of namespaces, for example,the PID namespace is used to isolate the process tree, while the network namespace is used to isolate the network stack.
seccomp: a Linux Kernel mechanism used to limit the system calls that a process can make (handled via syscalls)
capabilities: a Linux Kernel mechanism used to set limits on what the root user can do in the container (handled via syscalls)
cgroups: are used to limit the resources (e.g. memory, disk I/O, CPU-tme) that a process can use (handled via cgroupfs)

barco uses all of these features to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work.

barco can be used to run bin/sh . from the / directory as root (-u 0) with the following command (optional -v for verbose output):

$ sudo ./bin/barco -u 0 -m / -c /bin/sh -a . [-v]

22:08:41 INFO  ./src/barco.c:96: initializing socket pair...
22:08:41 INFO  ./src/barco.c:103: setting socket flags...
22:08:41 INFO  ./src/barco.c:112: initializing container stack...
22:08:41 INFO  ./src/barco.c:120: initializing container...
22:08:41 INFO  ./src/barco.c:131: initializing cgroups...
22:08:41 INFO  ./src/cgroups.c:73: setting memory.max to 1G...
22:08:41 INFO  ./src/cgroups.c:73: setting cpu.weight to 256...
22:08:41 INFO  ./src/cgroups.c:73: setting pids.max to 64...
22:08:41 INFO  ./src/cgroups.c:73: setting cgroup.procs to 1458...
22:08:41 INFO  ./src/barco.c:139: configuring user namespace...
22:08:41 INFO  ./src/barco.c:147: waiting for container to exit...
22:08:41 INFO  ./src/container.c:43: ### BARCONTAINER STARTING - type 'exit' to quit ###

# ls
bin         home                lib32       media       root        sys         vmlinuz
boot        initrd.img          lib64       mnt         run         tmp         vmlinuz.old
dev         initrd.img.old      libx32      opt         sbin        usr
etc         lib                 lost+found  proc        srv         var
# echo "i am a container"
i am a container
# exit

22:08:55 INFO  ./src/barco.c:153: freeing resources...
22:08:55 INFO  ./src/barco.c:168: so long and thanks for all the fish

Development

If you want to build and run barco from source, you can do so by cloning the repository and following the instructions in the README file. In short, barco provides configuration for development on GitHub Codespaces using Visual Studio Code and the included Makefile can be used to build and run the project as follows:

$ sudo apt install -y make
$ make build

There are also other targets in the Makefile that can be used to run the unit tests, build the documentation, run the formatter and the linter, etc (most of the tools are native to the Clang compiler). Furthermore, while working on barco I did investigate best practices for the structure of C projects and I adopted the following:

├── .devcontainer - configuration for GitHub Codespaces
├── .github - configuration GitHub Actions and other GitHub features
├── .vscode - configuration for Visual Studio Code
├── bin - the executable (created by make)
├── build - intermediate build files e.g. *.o (created by make)
├── docs - documentation
├── include - header files
├── lib - third-party libraries
├── scripts - scripts for setup and other tasks
├── src - C source files
│     ├── barco.c - (main) Entry point for the CLI
│     └── *.c
├── tests - contains tests
├── .clang-format - configuration for the formatter
├── .clang-tidy - configuration for the linter
├── .gitignore
├── LICENSE
├── Makefile
└── README.md

How does it work?

The barco executable is the entry point for the CLI. It's responsible for parsing the CLI arguments, setting up the container and running the container process: it all starts at barco.c, where I used argtable to parse the CLI arguments and set up, start and ultimately cleanup the container and other resources. The first two steps towards running a container are creating a pair of sockets (to communicate with the container process) and initializing the container stack (to set up the container process).

The Call to container_init

After the initial setup, the container_init function, defined in container.c, is called to start the container process. The function is relatively simple, and all it does it calling the clone system call with a function to run (container_start), stack configuration, and the appropriate flags (CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS allow some control over mounts, pids, IPC data structures, network devices and hostname) to create the container process.

The container_start Function

The resulting process is a child of the barco process, and it's the one that will run the container. The container_start function is the entry point for the container process, and it's defined in container.c. It's responsible for setting up the container, and it does so by:

setting the hostname
setting the root directory (mount namespace) to the one specified by the user (via the -m flag)
setting the user namespace
setting capabilities and seccomp filters (for security)

The Mount Namespace

The mount namespace is used to isolate the filesystem mount points seen by the container process. The mount_set function is responsible for setting the root directory for the container process, and it does so by calling the mount system call with the appropriate flags (MS_BIND | MS_REC | MS_PRIVATE) to create a new mount namespace and bind-mount the root directory to the one specified by the user. The result is that the container process will see the root directory as the one specified by the user, and it will be able to access the files in that directory and its subdirectories.

The User Namespace

The user namespace is used to isolate the user and group IDs seen by the container process. The user_namespace_init function is responsible for setting the user namespace, and it does so by calling the unshare system call with the appropriate flags (CLONE_NEWUSER) to create a new user namespace. user_namespace_init relies on user_namespace_prepare_mappings, a function called in barco.c and used by the parent process (barco) to listen for the child process (the container) to request setting uid and gid before updating the uid_map and gid_map for the container. The uid_map and gid_map files are a Linux kernel mechanism for mapping uids and gids between the parent and child processes. The result is that the container process will see the user and group IDs as specified by the user, and it will be able to access the files in that directory and its subdirectories.

Capabilities and Seccomp Filters

The container process is running as root (uid 0), but it's not a full root user. It's a root user with limited capabilities and seccomp filters. The sec_set_caps function uses libcap to set the capabilities for the container process, and it does so by calling the cap_set_proc function with the appropriate flags (e.g. CAP_MAC_OVERRIDE, CAP_MKNOD, CAP_SETFCAP, CAP_SYSLOG...). This configuration makes it so that the container process will be able to perform only the actions specified by the capabilities.

Seccomp Filters are used instead to limit the system calls that a process can make (in my case, the container). The sec_set_seccomp function blocks sensitive system calls based on Docker's default seccomp profile and other obsolete or dangerous system calls. It does so by calling the seccomp_rule_add function with the appropriate parameters to specify how calls to a system call should be handled. Let's use the following as an example:

seccomp_rule_add(ctx, SEC_SCMP_FAIL, SCMP_SYS(fchmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))

This instruction tells the Linux Kernel to block calls to fchmod if the S_ISGID bit is set in the mode parameter. The mode parameter is the second parameter of the fchmod system call, and it's used to set the permissions of a file. The S_ISGID bit is used to set the setgid bit, which is used to set the group ID on execution. The S_ISGID bit is set when the mode parameter is S_ISGID | S_IRWXU | S_IRWXG | S_IRWXO, which is the default value for the mode parameter. This means that the fchmod system call will be blocked if the mode parameter is the default value, which is the case when the fchmod system call is used to set the permissions of a file. This is a good example of how seccomp filters can be used to block dangerous system calls.

Cgroups

As the container is starting, the parent process (barco) is setting up cgroups for the child process. The cgroups_init function is responsible for setting up cgroups (version 2), and it does so by creating the /sys/fs/cgroup/barcontainer directory (barcontainer is the hostname for the container) for the new cgroup and then writing cgroups settings to the appropriate files in the cgroupfs. The cgroups settings are:

memory.max: the maximum amount of memory that the container process can use
cpu.weight: the relative weight of the container process compared to other processes
pids.max: the maximum number of processes that the container process can spawn

By setting these values, the container process will be limited in the amount of memory, CPU time and processes that it can use.

Waiting for the Container to Exit

The container is setup, and it's now running. The parent process (barco) is waiting for the container process to exit, and it does so by calling the waitpid system call with the PID (process ID) of the container. The container I used as an example in the README.md file is running /bin/sh and waiting for user input. The user can type commands, and the container process will execute them. For example, the user can type ls and the container process will execute it, listing the contents of the current directory. The container process will continue to run until the user types exit, which will cause the container process to exit. The parent process (barco) will then exit as well.

Cleaning up

As the container process exits, the parent process (barco) exits too, and it's responsible for cleaning up the resources used by the container process. Starting at the cleanup label in barco.c, the parent process is responsible for:

freeing the container stack
closing the sockets used to communicate with the container process
removing the cgroup it created
freeing the data structures used by argtable

Summary

Linux containers are made up of a set of Linux kernel features, and barco uses them to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work. I learned a lot while working on this project, and I hope that this post can be useful to others who want to learn more about Linux containers and the Linux kernel. If you have any questions, feedback or improvements you'd like to make, feel free to reach out to me directly or on the barco repository!