- Published on
Containers Are Not a Security Boundary
- Name
- Luca Cavallin
Containers changed how we package and ship software, but they did not rewrite the basic security rules. Trust boundaries, privilege, and attack surface are all still there. That's one of the things I learned while digging into container security, partly from Liz Rice's Container Security and partly from spending time with the Linux pieces underneath.
A container is really just a Linux process with some isolation around it. It talks to the kernel through syscalls, runs as some user, sees some filesystem, reaches some network, and gets some resource limits. If those foundations are weak, the container is weak too.
This is more obvious in real systems, because real systems are shared everywhere. Shared clusters, shared nodes, shared build machines, shared registries, shared secrets, shared mistakes. When one layer breaks, another layer has to stop the problem from getting bigger. That is why container security is not one feature, but stack of (boring) practical controls.
Container Security Follows Old Rules
docker run \
--read-only \
--cap-drop=ALL \
--security-opt=no-new-privileges:true \
--security-opt=seccomp=default.json \
--pids-limit=100 \
--memory=256m \
--cpus=1 \
--user=10001:10001 \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--network=bridge \
myapp@sha256:abc123
The first mistake is to treat a container like a hard security boundary. It is but a packaging unit with some isolation around it, which is useful and fast, but also easy to trust too much!
The threat model starts with the shared kernel. If two containers run on the same host, they both depend on that same kernel. That means kernel bugs, messy mounts, broad capabilities, weak runtime defaults, or even ol' regular resource abuse can affect other workloads too. This gets much more serious in multitenant systems. A trusted internal service and hostile customer code should not be treated as if they have the same isolation level. The same goes for Kubernetes namespaces: they are helpful for organization and policy, but they are not a hard wall between tenants like a VM is.
None of this is new! Least privilege, defense in depth, reducing attack surface, limiting blast radius, separating duties: ancient ideas that keep showing up because they solve real problems. You do not want one person, or one pipeline, to build, sign, approve, and deploy everything. Ask yourself: if this workload gets compromised, what can it reach next? If the answer is the whole node, the control plane, or other tenants, then the isolation is weak. Sad noises.
System Calls, Permissions, and Capabilities Are the Front Door Into the Kernel
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
int main() {
printf("uid=%d gid=%d\n", getuid(), getgid());
execl("/bin/sh", "/bin/sh", NULL);
return 0;
}
Containers are not doing anything magical. They are just processes making syscalls, and the kernel is the thing deciding whether the answer is yes or no. That is the gist of it. Breathe that in, and the rest of container security starts making more sense.
That is also why the normal Linux permission model applies inside containers too. File ownership, mode bits, setuid, and setgid do not suddenly stop working because the process is in a container. If you have a setuid helper in there, it can become a privilege escalation path when the runtime leaves dangerous capabilities enabled or mounts something it should not.
Capabilities are part of the same picture. Splitting root into smaller privileges was a good idea, but it also created more ways to get the setup wrong. Some capabilities, like CAP_NET_BIND_SERVICE, are pretty narrow and easy to reason about. Others, like CAP_SYS_ADMIN, are basically a cosmic hammer of power. This is why "not running as root" is not enough. A process can be non-root and still have enough privileges to inspect other processes, affect mounts, or reach parts of the kernel you do not want it to. Think about what syscalls a process can make, what files it can touch, what namespaces it affects, and what kernel features can it reach.
Cgroups Stop One Bad Process from Becoming Everyone's Problem
mkdir -p /sys/fs/cgroup/demo
echo $$ > /sys/fs/cgroup/demo/cgroup.procs
echo "200M" > /sys/fs/cgroup/demo/memory.max
echo "100000 50000" > /sys/fs/cgroup/demo/cpu.max
Isolation without resource control is not enough either. If one process can eat all the memory, fork forever, or clog the CPU, then giving it its own hostname does not really protect anything. The process may look isolated, but it can still hurt everything else on the machine.
That is what cgroups are for: they are the Linux mechanism for putting resource boundaries around processes. cgroups v1 split things across multiple hierarchies and got messy fast. cgroups v2 cleaned that up by using one unified hierarchy. But the basic model is quite simple: make a group, set limits, and place processes in that group.
It is basic and useful. Even a "safe internal batch job" can cause damage if it has no memory or CPU limits and the platform lets it behave like it owns the host. Docker and Kubernetes both depend on cgroups underneath for memory, CPU, I/O, and PID limits. So this is definitely one of those controls you want to use on purpose!
Container Isolation with Linux Namespaces
unshare --uts --pid --mount --net --ipc --fork bash
hostname demo
ps aux
When we talk about container isolation, we are usually talking about Linux namespaces plus a bit of filesystem setup. UTS namespaces give a container its own hostname, PID namespaces give it its own process ID space, mount namespaces give it its own view of mounts, and network namespaces split off interfaces, routes, and firewall state. IPC namespaces do the same for shared memory and semaphores, while cgroup namespaces hide the host's cgroup layout. These features change what a process can see and how much of the system it can observe.
If you start a shell inside a fresh PID namespace, it gets its own process tree and can even believe it is PID 1. But from the host, nothing special happened. It is just another host process with a different view of the world. Containers are not above the operating system or running beside it, they are regular processes living inside the host kernel, with some careful isolation wrapped around them.
Beware of chroot! chroot changes the apparent root directory, but on its own it doesn't provide much isolation. It is closer to a filesystem trick than a serious security feature. User namespaces are more interesting: they let you map root inside the container to an unprivileged user on the host, which is one of the cleaner ways to limit the blast radius when something goes wrong. BUT! Namespaces can hide parts of the host, they do not make the host go away.
Hardening Isolation Means More Than One Line of Defense
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{ "names": ["read", "write", "exit", "futex"], "action": "SCMP_ACT_ALLOW" }
]
}
The risky part of containers is the shared kernel, so most hardening work comes down to reducing how much of that kernel a workload can actually touch. Seccomp helps a lot because it filters syscalls, and syscalls are the front door into the kernel. AppArmor and SELinux push things further by adding mandatory access control around files, capabilities, process behavior, and labels. They are not as friendly as "just run the container", but they can stop a bad situation from getting worse.
Once you want stronger isolation, you start looking beyond the default container model. Something like gVisor puts a user-space kernel between the workload and the host, which gives the process a narrower and more controlled path into the real kernel. Kata Containers take a different approach and wrap workloads in lightweight VMs, while Firecracker goes even more in that direction with microVMs designed for fast startup and high density. Instead, unikernels strip things down by building specialized images with very little OS surface in the first place.
These tools exist because sharing one kernel is efficient, but it also creates a real risk boundary you have to think about. For trusted internal services, basic hardening is often enough, but for untrusted code, customer workloads, or aggressive multitenancy, stopping at namespaces does not feel serious.
Virtual Machines Give You a Harder Wall
Container app -> Host kernel -> Hardware
VM app -> Guest kernel -> Hypervisor -> Hardware
A VM changes the trust boundary in a very simple way: the guest gets its own kernel, while a container shares the host kernel. That is the main difference when it comes to security.
The way this works is through the hypervisor, which sits between the guest and the hardware. Type 1 hypervisors run close to the metal, while Type 2 hypervisors run on top of a host operating system. KVM is a neat thing that turned Linux into a very capable hypervisor by building on hardware virtualization support and fitting that approach cleanly into the kernel.
The old model was trap-and-emulate: the guest tried to run a sensitive instruction, the hypervisor intercepted it, and then handled it safely. Modern CPUs made that better than it used to be, and the core idea is still the same: with a VM, the guest kernel is not the host kernel. That is important when you are running code you do not trust, because in hostile multitenancy a VM boundary is often a safer choice than one shared kernel with a long list of hardening rules. VMs do come with real costs, since they are heavier, slower to boot, and more expensive to patch and operate, but that trade can be reasonable when the alternative is putting untrusted workloads "one kernel bug away" from each other. Containers and VMs are not competing tools, but rather tools for different trust models: containers fit trusted workloads well, and VMs might often be the better option when trust is weak.
Container Images and Software Supply Chain Security
````Dockerfile`
Bad
FROM ubuntu:latest RUN apt-get update && apt-get install -y curl
Better
FROM cgr.dev/chainguard/static:latest COPY app /app USER 65532:65532 ENTRYPOINT ["/app"]
A container image is really just a root filesystem plus some configuration, but
look at how many hands touch it before it reaches production! Developers build
it, CI systems tweak it, registries store it, admission controllers inspect it,
runtimes pull it, and clusters finally run it, so every one of those steps
becomes part of the supply chain.
OCI standards helped a lot because they gave the ecosystem common formats for
images and distribution, and that removed a lot of friction. But the harder
question is whether the image you are about to run is actually the one you meant
to run, built from the sources you expected, with the dependencies you thought
you were getting. Image layers are great for efficiency, but they also keep
history around, so if a secret gets copied into one layer and removed later, it
may still be there in the image record. If your `Dockerfile` pulls whatever
package repository happens to serve during the build, without pinning versions
or tracking provenance, you have to factor in the whole external system in your
security dreams.
There is also the build process itself, and `docker build` has quite some
baggage: broad daemon access, messy build contexts, accidental secret leaks,
symlink edge cases, and more host interaction than you'd think. Daemonless or
rootless builds improve that, and so do smaller contexts, cleaner `Dockerfile`s,
and reproducible builds. Useful mental model: a `Dockerfile` is not just
configuration, but executable supply chain policy, so it is worth treating it
with the same care as application code: use minimal base images, pin digests
instead of tags, split build and runtime stages, keep secrets out of the build
context.
### Image Security Is Not Just Scanning
```bash
cosign sign my-registry.example.com/myapp@sha256:abc123
cosign verify my-registry.example.com/myapp@sha256:abc123
Image security has two sides to it, build time and deploy time, With that in mind, build pipelines feel like the place to focus on, but the image is only safe if you can trust both how it was produced and what actually ends up running in the cluster.
At build time, the big question is provenance: who wrote the Dockerfile, which base image it started from, what packages or artifacts got pulled in, and which builder produced the final image. If things go wrong, the build machine itself becomes the attack surface, and that is a bad position to be in because an attacker does not need to break production directly. They can compromise the build path, poison tomorrow's release, and then wait for you to deploy it.
Once the application is deployed, the question is whether the cluster is running the exact image you approved. This is where tags become a problem, because latest is just a moving pointer, and even version tags that look stable can be retagged and reused. Digests are real OG identity, which is why admission control is the most important thing: if a deployment points at the wrong image, an unsigned image, or an image from the wrong registry, the cluster should reject it before anything starts. GitOps helps by making desired state visible and reviewable, but it does not enforce safety on its own, so a bad or malicious manifest can still roll out the wrong image in a very clean way. One of the most common mistakes is signing images correctly and then deploying the wrong thing anyway because the manifest uses a mutable tag.
Vulnerability Scanning Helps, but...
trivy image --severity HIGH,CRITICAL --ignore-unfixed myapp:latest
grype myapp:latest
Scanning images for vulnerabilities is useful, but tools aren't perfect. A scanner looks at installed packages, versions, and metadata, then tries to map that to advisory databases, which gives you a list of known issues. That is a helpful signal, but it is not the absolute truth.
A lot of the confusion comes from how messy the underlying data is. Distros often backport fixes without changing versions in the way a naive scanner expects, some CVEs appear in subpackages you do not even use, and some issues get marked as won't-fix for (sometimes debatable) reasons. Package names differ across distros, advisory feeds go stale, scanners disagree with each other, and every now and then they are just wrong. It could happen that a scanner reports a critical issue that your distro already fixed through a backport the scanner does not understand.
Then, Image scanning only covers part of the risk. A small static image is not automatically safe, because the application still has dependencies, parsing logic, authentication code, and all the usual ways software can fail. The practical approach is to scan regularly, since new advisories show up after build time, to scan in CI/CD, to re-scan images sitting in registries, and to block obviously bad images from reaching deployment. However, a green report from the scanner only tells you the tool did not find a known problem on that day.
Most Container Escapes Start with Bad Choices
docker run --privileged \
-v /:/host \
-v /var/run/docker.sock:/var/run/docker.sock \
ubuntu
Kernel escapes are cool, but most isolation failures in the real world are more basic. Containers often run as root by default, which is bad decision, especially when the application has no real reason to need it. Then somebody adds --privileged because something failed and they wanted the fast fix, mounts a sensitive host directory because it was convenient, and eventually mounts the Docker socket because it is "just for automation". Another list of bad decisions!
Rootless containers are useful, because they reduce privilege on the runtime side as well, not just inside the container. Running as non-root inside the container is also a good step, but on its own it is not enough. Shared namespaces are another easy way to make mistakes, since a container that shares the host network or PID namespace gets a lot more visibility and reach than people realize. Sidecars can create similar issues, especially when a debug sidecar that was supposed to be temporary becomes a permanent tunnel into production.
In many cases, you do not need a sophisticated container escape because the environment has already given away most of what an attacker would want.If a container needs --privileged, stop and ask whether a VM, a different runtime, or a different application design would make more sense.
Network Security Is About Cutting Off Paths
iptables -A FORWARD -s 10.42.1.0/24 -d 10.42.2.0/24 -j DROP
A lot of clusters are set up like flat office networks, and that's no good. Containers get IP addresses through bridges, overlays, or CNI plugins, but once they can talk to each other, should that communication should be allowed in the first place?
In container platforms, network security is mostly about cutting off paths that do not need to exist. Layer 3 and 4 controls handle routing and packet filtering, iptables has been the mechanism for a long time, and IPVS improved load balancing in some paths. Kubernetes network policies gave teams a cleaner way to describe allowed traffic, but writing a policy is only half of it. The policy only helps if the network plugin actually supports it and enforces it the way you think it does.
The practical advice is: default deny, restrict both ingress and egress, separate workloads by trust level, and treat DNS, metadata endpoints, and internal control-plane paths as sensitive. Assume that if one service gets compromised, it will try to move sideways as soon as it can. Service meshes can help with identity, traffic policy, and encryption, but they are also a pain to maintain so only used them if you really need them.
TLS Because Internal Networks Are Not Safe by Default
openssl req -new -newkey rsa:2048 -nodes \
-keyout server.key -out server.csr
openssl x509 -req -in server.csr \
-CA ca.crt -CAkey ca.key -CAcreateserial \
-out server.crt -days 365
It is wrong to think that traffic inside the cluster is trusted just because it never leaves the internal networks. Internal networks are full of mistakes, shortcuts, broad access, and workloads that may already be compromised.
TLS helps because it gives you confidentiality and integrity, but the more interesting part is identity. A certificate binds a public key to a subject, a CA signs that binding, and the private key has to remain private or the whole model starts to collapse. A CSR is just the way a service asks for that certificate, and during the TLS handshake the two sides validate the certificate chain, prove they control the right keys, negotiate the crypto they will use, and derive session keys for the connection. Mutual TLS gets this one step further by making both sides prove who they are, which is useful in container platforms where service-to-service traffic is constant and trust assumptions can spread quickly.
Short-lived certificates and reliable rotation usually give you something better than just revocation machinery, because they reduce the window of damage and keep the process in check. Ignoring certificate rotation until the week before expiry is not a great security plan.
Secrets Management in Containers
mkdir -p /run/secrets
chmod 0700 /run/secrets
printf '%s' "$DB_PASSWORD" > /run/secrets/db_password
chmod 0400 /run/secrets/db_password
Secrets are not just configuration: they need confidentiality, integrity, access control, rotation, and some way to audit who touched them and when, because once you lose control of a secret, you have problems.
The worst place to put a secret is inside the image, because once it gets baked in, it tends to get copied, cached, scanned, mirrored, and retained in more places than you can track. The next bad option is often an environment variable, which feels convenient, but that convenience can turn into leakage through logs, crash dumps, process inspection, or debugging tools. Files are usually a better fit, especially when they live on an in-memory filesystem with tight permissions, and pulling secrets over the network is really the best option when the workload has a strong identity and the connection is properly protected.
Kubernetes Secrets are useful, but they don't fix everything. They are API objects that usually end up into environment variables or files, so you still need to protect etcd, enable encryption at rest, lock down RBAC, and limit who can create pods that mount secrets. Unfortunately, the root inside the container can usually read mounted secrets, so avoiding root is part of keeping secret exposure under control. Overall, do not put secrets in environment variables: they could show up in logs, crash output, or debugging data.
Container Runtime Protection
- rule: Terminal shell in container
condition: container and proc.name in (bash, sh, zsh)
output: "Shell spawned in container %container.id"
priority: WARNING
Runtime protection is not about stopping every attack, but about making sure you can tell the difference between a normal deploy and the first few minutes of an incident.
First of all, you need a picture of normal behavior. That usually means knowing which image the container came from, which executables it is expected to launch, which files it normally reads or writes, which user ID it should run as, and which other services it should talk to. Once you have that baseline, odd behavior stands out more clearly. A web service spawning bash is strange, a container writing into system paths is strange, and a workload that never talks to the internet suddenly opening outbound connections is strange too. The same goes for a process that moves away from a stable non-root identity and starts running with different privileges.
eBPF-based detectors, syscall monitors, policy engines, and behavior profilers are all trying to help understand: is this container doing what this service is supposed to do, or has it drifted into something else. Drift prevention helps too, because if the image is meant to be immutable, then package installs at runtime or new binaries suddenly appearing should feel suspicious right away.
Containers Do Not Save You From The OWASP Top 10
query = f"SELECT * FROM users WHERE name = '{user_input}'"
Containerization does not solve application security. The old bugs that have been hurting systems for years are still there.
Injection is still injection, broken authentication still gets accounts taken over, and sensitive data exposure still happens when logs, responses, or storage paths are handled poorly. XXE is still there when hostile XML gets parsed badly, broken access control is still one of the fastest ways to lose real data, and XSS still goes after browsers exactly as it always has. Insecure deserialization can still send an application down dangerous code paths, known vulnerable components still make it into production when teams ignore them, and weak logging and monitoring still means you hear about the incident from users or attackers before you hear about it from your own systems.
In container platforms, misconfiguration tends to stack instead of staying in one place. and you can get many things wrong all at the same time. Key takeaway here is that containers change where software runs and how quickly you can replace it, but they do not change whether the software itself is broken.
What To Fix First if Your Setup Is Weak
If the platform is soft today, the best place to start is with the changes that improve safety without adding a huge amount of complexity. You do not need a major security overhaul on day one.
The practical steps go a long way: run containers as non-root, drop capabilities, set memory and PID limits, stop baking secrets into images, pin image digests instead of tags, add admission checks for trusted images, restrict network paths between workloads, and enable seccomp along with no-new-privileges. These are the kinds of controls that reduce risk without forcing every team to become security specialists overnight.
These steps will usually fix many of the issues that cause real incidents, and they will also give you a better picture of what is actually running in the cluster. Once you have that, you can start looking at more advanced controls like gVisor, microVMs, or service meshes, but the basics are where most of the wins are.
The End!
The main idea is pretty simple: containers are excellent packaging tools, but they are not a hard security boundary out of the box. That is probably the biggest lesson I took away from looking at container security more seriously.
Too much privilege, too much trust in a shared kernel, too much faith in defaults, and not enough attention to what a workload can actually reach are the cause behind a lot of the problems. Good container security starts as a simple recipe: run as non-root, drop capabilities, tighten image controls, restrict networking, improve isolation, watch runtime behavior, handle secrets properly, and build trust into the supply chain.
Containers are great because they are fast, portable, and very good at turning deployment into something boring in a good way. But they are not magic, and they do not erase trust boundaries or make the shared kernel any less shared. They also do not fix weak privilege models, flat networks, sloppy secret handling, or broken application code just because everything is now wrapped in an image and scheduled by a cluster.
