Understand how linux containers works with practical examples

Nowadays a bast majority of server workloads run using linux containers because of his flexibility and lightweight but have you ever think how does linux containers works. In this tutorial we will demystify how does linux containers works with some practical examples. Linux containers works thanks two kernel features: namespaces and cgroups.

Linux Namespaces

Currently the linux kernel have 8 types of namespaces:

Linux control groups (cgroups)

Container Fundamentals (key technologies)

NOTE: This tutorial was made using a VM with 1GB of ram and 1vCPU using debian 10 buster with kernel 4.19.0-16-amd64

Process namespace fundamentals

List process namespaces

$ lsns -t pid

Get the PID of the current terminal

$ echo $$ # parent PID

Launch a new zsh terminal using namespaces

$ unshare --fork --pid --mount-proc zsh
$ sleep 300 &
$ sleep 300 &
$ sleep 300 &
$ sleep 300 &
$ sleep 300 &
$ top

See the process tree from the parent

$ ps f -g <PPID>

List namespaces

$ lsns -t pid

Filesystem — Overlay FS fundamentals

Create directories

$ cd /tmp
$ mkdir {lower1,lower2,upper,work,merged}

Create some files in lower directories

$ echo "Lower 1 - original" > lower1/file1.txt
$ echo "Lower 2 - original" > lower2/file2.txt

Create overlay FS

$ mount -t overlay -o lowerdir=/tmp/lower1:/tmp/lower2,upperdir=/tmp/upper,workdir=/tmp/work none /tmp/merged

Create, modify files

$ cd /tmp/merged
$ echo "file created in merged directory" > file_created.txt
$ echo "file 1 modified" > file1.txt

Umount overlay fs

$ cd /tmp
$ umount /tmp/merged

Inspect lower and upper dirs

$ find -name '*.txt' -type f 2>/dev/null | while read fn; do echo ">> cat $fn"; cat $fn; done

Networking — Linux bridge fundamentals

Create a Network Virtual bridge

$ ip link add br-net type bridge

List Network Interfaces

$ ip link

Assign an IP Address to bridge interface

$ ip addr add 192.168.55.1/24 brd + dev br-net

Bring UP the bridge interface

$ ip link set br-net up

Create 2 Network Namespaces

$ ip netns add ns1
$ ip netns add ns2

Create a Virtual Ethernet cable pair

$ ip link add veth-ns1 type veth peer name br-ns1
$ ip link add veth-ns2 type veth peer name br-ns2

Assign veth to namespaces

$ ip link set veth-ns1 netns ns1
$ ip link set veth-ns2 netns ns2
$ ip link set br-ns1 master br-net
$ ip link set br-ns2 master br-net

Assign IP address to veth within namespaces

$ ip -n ns1 addr add 192.168.55.2/24 dev veth-ns1
$ ip -n ns2 addr add 192.168.55.3/24 dev veth-ns2

Bring UP veth interfaces within Namespaces

$ ip -n ns1 link set veth-ns1 up
$ ip -n ns2 link set veth-ns2 up

Bring UP bridge veth in the local host

$ ip link set dev br-ns1 up
$ ip link set dev br-ns2 up

Configure default route within namespaces

$ ip -n ns1 route add default via 192.168.55.1 dev veth-ns1 
$ ip -n ns2 route add default via 192.168.55.1 dev veth-ns2

Enable IP forward in the host

$ sysctl -w net.ipv4.ip_forward=1

Configure MASQUERADE in the host for 192.168.55.0/24 subnet

$ iptables -t nat -A POSTROUTING -s 192.168.55.0/24 ! -o br-net -j MASQUERADE

Control groups (cgroups) fundamentals

Create cgroups directory

$ mkdir -p /mycg/{memory,cpusets,cpu}

Mount cgroups directory

$ mount -t cgroup -o memory none /mycg/memory
$ mount -t cgroup -o cpu,cpuacct none /mycg/cpu
$ mount -t cgroup -o cpuset none /mycg/cpusets

Create new directories under CPU controller

mkdir -p /mycg/cpu/user{1..3}

Assign CPU shares to every user (This example uses 1vCPU)

# 2048 / (2048 + 512 + 80) = 77%
$ echo 2048 > /mycg/cpu/user1/cpu.shares
# 512 / (2048 + 512 + 80) = 19%
$ echo 512 > /mycg/cpu/user2/cpu.shares
# 80 / (2048 + 512 + 80) = 3%
$ echo 80 > /mycg/cpu/user3/cpu.shares

Create artificial load

$ cat /dev/urandom &> /dev/null &
$ PID1=$!
$ cat /dev/urandom &> /dev/null &
$ PID2=$!
$ cat /dev/urandom &> /dev/null &
$ PID2=$!

Assign process to every user

$ echo $PID1 > /mycg/cpu/user1/tasks
$ echo $PID2 > /mycg/cpu/user2/tasks
$ echo $PID3 > /mycg/cpu/user3/tasks

Monitoring process

$ top

Create a container from scratch

Download and extract debian container fs from docker

$ docker pull debian
$ docker save debian -o debian.tar
$ mkdir debian_layer
$ mkdir -p fs/{lower,upper,work,merged}
$ tar xf debian.tar -C debian_layer
$ find debian_layer -name 'layer.tar' -exec tar xf {} -C fs/lower \;

Create bridge interface

$ ip netns add cnt
$ ip link add br-cnt type bridge
$ ip addr add 192.168.22.1/24 brd + dev br-cnt
$ ip link set br-cnt up
$ sysctl -w net.ipv4.ip_forward=1
$ iptables -t nat -I POSTROUTING 1 -s 192.168.22.0/24 ! -o br-cnt -j MASQUERADE

Create overlay Filesystem from debian container fs

$ mount -vt overlay -o lowerdir=./fs/lower,upperdir=./fs/upper,workdir=./fs/work none ./fs/merged

Mounting Virtual File Systems

$ mount -v --bind /dev ./fs/merged/dev

Launch process namespace within fs/merged fs

$ unshare --fork --pid --net=/var/run/netns/cnt chroot ./fs/merged \
/usr/bin/env -i PATH=/bin:/usr/bin:/sbin:/usr/sbin TERM="$TERM" \
/bin/bash --login +h
# Mount proc within container
$ mount -vt proc proc /proc

Connect the container with br-cnt

$ ip link add veth-cnt type veth peer name br-veth-cnt
$ ip link set veth-cnt netns cnt
$ ip link set br-veth-cnt master br-cnt
$ ip link set br-veth-cnt up
$ ip -n cnt addr add 192.168.22.2/24 dev veth-cnt
$ ip -n cnt link set lo up
$ ip -n cnt link set veth-cnt up
$ ip -n cnt route add default via 192.168.22.1 dev veth-cnt
$ ip netns exec cnt ping -c 3 1.1.1.1

Mount cgroup

$ mkdir /sys/fs/cgroup/memory/cnt
$ echo 10000000 > /sys/fs/cgroup/memory/cnt/memory.limit_in_bytes
$ echo 0 > /sys/fs/cgroup/memory/cnt/memory.swappiness
$ CHILD_PID=$(lsns -t pid | grep "[/]bin/bash --login +h" | awk '{print $4}')
$ echo $CHILD_PID > /sys/fs/cgroup/memory/cnt/tasks

Run commands within container

$ apt update
$ apt install nginx procps curl -y
$ nginx
$ curl 127.0.0.1:80
$ curl 192.168.22.2:80 # from host
$ cat <( </dev/zero head -c 15m) <(sleep 15) | tail

Clean all

$ umount /proc # within container
$ exit # within container
$ umount -R ./fs/merged
$ ip link del br-veth-cnt
$ ip link del br-cnt
$ ip netns del cnt # grep cnt /proc/mounts

Inspect Namespaces within a docker container

Install docker CE

$ curl -fsSL https://get.docker.com -o install_docker.sh
$ less install_docker.sh # optional
$ sh install_docker.sh
$ usermod -aG docker $USER
$ newgrp docker # Or logout and login

Inspect Docker Network

$ docker network create mynet

Inspect bridge network, see subnet using IP

$ BR_NAME=$(ip link | grep -v '@' | awk '/br-/{gsub(":",""); print $2}')
$ ip addr show ${BR_NAME}

Inspect Docker bridge network, see subnet using docker

$ docker network inspect mynet | grep Subnet

Run an nginx web server

$ docker run --name nginx --net mynet -d --rm -p 8080:80 nginx

Inspect network namespace from nginx container

Create symlink from /proc to /var/run/netns

$ CONTAINER_ID=$(docker container ps | awk '/nginx/{print $1}')
$ CONTAINER_PID=$(docker inspect -f '{{.State.Pid}}' ${CONTAINER_ID})
$ mkdir -p /var/run/netns/
$ ln -sfT /proc/${CONTAINER_PID}/ns/net /var/run/netns/${CONTAINER_ID}

Check network interface within namespace

$ ip netns list
$ ip -n ${CONTAINER_ID} link show eth0

Check IP address of nginx container

$ ip -n ${CONTAINER_ID} addr show eth0
$ docker container inspect nginx | grep IPAddress

Check port forwarding from 8080 to 80

$ iptables -t nat -nvL

Inspect cgroups in a docker container

$ docker run --name test_cg --memory=10m --cpus=.1 -it --rm ubuntu

See cgroup fs hierarchy

$ CONTAINER_ID=$(docker container ps --no-trunc | awk '/test_cg/{print $1}')
$ tree /sys/fs/cgroup/{memory,cpu}/docker/${CONTAINER_ID}

See attached task to container cgroup

$ docker container top test_cg | tail -n 1 | awk '{print $2}' # container parent PID
$ cat /sys/fs/cgroup/{memory,cpu}/docker/${CONTAINER_ID}/tasks # the same as container parent PID

Monitoring the container

$ docker container stats test_cg

Generate CPU load

$ cat /dev/urandom &> /dev/null

Generate Memory load

$ cat <( </dev/zero head -c 50m) <(sleep 30) | tail

Inspect overlay fs in a docker container

$ docker run --name test_overlayfs -it --rm debian

NOTE: The merged layer is the actual container Filesystem

Inspect lower layers with tree and less

$ docker container inspect test_overlayfs -f '{{.GraphDriver.Data.LowerDir}}' | awk 'BEGIN{FS=":"}{for (i=1; i<= NF; i++) print $i}' | while read low; do tree -L 2 $low; done | less

Inspect upper layer (It’s empty)

$ docker container inspect test_overlayfs -f '{{.GraphDriver.Data.UpperDir}}' | while read upper; do tree $upper; done | less

Run command withing the container

$ apt update && apt install nmap -y

Inspect (again) upper layer (now it’s not empty)

$ docker container inspect test_overlayfs -f '{{.GraphDriver.Data.UpperDir}}' | while read upper; do tree $upper; done | less

Inspect docker process namespace

$ docker run --name test_ps -it --rm ubuntu

Launch process within container

$ sleep 600 &
$ sleep 600 &
$ sleep 600 &
$ sleep 600 &
$ sleep 600 &
$ top

See container tree process from container

$ CONTAINER_PID=$(docker container top test_ps | sed -n '2p' | awk '{print $2}')
$ ps f -g ${CONTAINER_PID}

List PID namespaces

$ lsns -t pid

See process using docker

$ docker container top test_ps

Conclusion

Source code

Engineer || MSc student || DevOps in progress