Dark Mode
Light Mode

Docker Background Concepts

January 26, 2019 • ☕️ 6 min read

DISCLAIMER: This post is neither written by docker maintainer nor a kernel hacker.

So I think containers as a median(approx) between chroot and a VM. Why?

chroot Command that changes the root directory for the current running process and its children(if exists). The program now enters a jail where it can’t access files and commands outside the env directory tree.

How we do this? The idea is that you create a directory tree where you copy or link in all the system files needed for a process to run. then use the chroot system call to change the root directory to be at the base of this new tree and start the process running in that chrooted environment. So now it can’t maliciously read or write to those locations as it can’t actually reference paths outside the modified root 🎉. If seen logically this is kernel level virtualization as it helps us to create multiple isolated instances of the host OS. Do it to believe it: let’s disappear htop or ls or curl or any command that’s not built in bash by default.

$ mkdir shit
$ cp /bin/bash ./shit
$ ldd /bin/bash (In your case dir can be different ..in my case it was lib/x86_64-linux-gnu and lib/lib64)
$ mkdir -p ./shit/lib/x86_64-linux-gnu/
$ mkdir -p ./shit/lib64/
$ cp /lib/x86_64-linux-gnu/{libncurses.so.5,libtinfo.so.5,libdl.so.2,libc.so.6} ./shit/lib/x86_64-linux-gnu/
$ cp /lib64/ld-linux-x86-64.so.2 ./shit/lib64/
$ chroot ./shit (you need to have chroot command in bash)
$ pwd (you should get / and not /home/something/else if your root has been correctly set)
$ ls (this should not run and give something like bash: ls: command not found why? running other than built in bash command fails as bash can't find the command)
$ exit (get out of shit)

VM : Its too big a topic to explain in this post. Get an overview from picture borrowed from internet so that you get an idea on how it virtualizes and stuff.

Components behind Docker :

Namespaces : Just say you are running a NATS server and a Go binary and file beat watcher for logs and other stuff in a single system environment..what about isolation between services in case the file beat watcher is built by a notorious person called He8he8!. Namespaces provide isolation to eliminate this risk. Another simple example you submit a problem on TopCoder and you can simply do anything in that if it compiles and runs. So would you be able to bring their server down?? No.

Linux has 6 namespaces:

Mount Namespaces:- Process in different mount namespaces can have different views of the fs hierarchy. Ever had a need to mount a remote fs over ssh onto a local system and used something like sshfs. Haven’t looked over the source code but it would have been using mount() and umount() system calls and performed operations that affected just the mount namespace associated with the calling process.

UTS Namespaces:- With uname -a we get network node hostname like hyfr.local. With UTS namespaces instead of single utsname containing hostname and other params, a process can request its copy of the uts info by cloning. The data will be cloned but any further changes in original copy or parent won’t be seen by processes which aren’t its children. Also can understand it something like a virtual server running with its new uts namespace. In the context of the docker, each container can have its own hostname and NIS(Network information system) domain name.

IPC namespaces:- Used to isolate certain interprocess communication resources. Each namespace has its own IPC identifiers and POSIX message queue fs(Message queues allow for efficient, priority driven IPC mechanism with multiple reader/writer scenario).

Process Namespaces :- Typically processes run in a parent-child hierarchy possibly in a single process tree i.e PID 1. Process namespace can help in spawning your very own PID 1 within any of your processes which is a child of the root PID 1. So the namespace can be used to have the same set of PIDs in a different namespace. In the case of the docker, this beast can be used suspend/resume the set of processes in the container and migrating the container to a new host while the processes inside the container maintain same PIDs! so how does the /proc stuff handled in this case? You can view a namespace with /proc/somepid/ns/childpids file.

Network namespaces:- These make containers to have its own network devices, IP addresses, IP routing tables, /proc/net etc. How does this help? So, we can multiple containerized servers running on the same host machine with each server listening on its respective port 80 in its own network namespace.

User namespaces:- UID is assigned by Linux to identify each user and resources it can access. There is a concept of groups GID which helps in associating users for some common resources as we have in IAM in aws. Using this namespace a process can be run by UID 1 outside or with UID 100 inside. Remember these UIDs may have different permissions which in turn affects the process. UID can be seen at /etc/passwd and GID from /etc/group

Control Groups - CGs is a kernel feature that does accounting, limits and isolates CPU, memory, I/O and network usage between one or more processes. Controllers manage system resources in K-ary tree structure where each node can have no more than k nodes so different hierarchies can exist in control groups. Why different hierarchies can’t it be like processes? No. Each control group is attached to a set of subsystems where each subsystem does its dedicated handling for a particular resource like CPU time, pids etc. Do it to believe :

$ cat /proc/cgroups
#subsys_name  hierarchy num_cgroups enabled
cpuset  2 1 1
cpu 3 1 1
cpuacct 3 1 1
memory  0 1 0
devices 4 1 1
freezer 5 1 1
net_cls 6 1 1
blkio 7 1 1
perf_event  8 1 1
net_prio  6 1 1

$ cd /sys/fs/cgroup && ls
blkio  cpu,cpuacct  freezer   net_prio
cpu  cpuset       net_cls   perf_event
cpuacct  devices      net_cls,net_prio  systemd

To create a CG make a subdirectory in any of the above subsystems and add a pid of task to tasks file. Like above if I made freezer

$ cd freezer && ls
cgroup.clone_children  cgroup.sane_behavior  release_agent
cgroup.procs         notify_on_release     tasks

A tasks file is automatically created in which we can add pids. Another way is to use libcgroup lib.

Union file system :- Everything in Linux is a file even processes (/proc/somepid) is an example of that. Ufs represents fs by grouping directories and files in branches where each branch is stacked on top of each other like cheese slices between a sandwich 🤤 Likewise reads and writes works. In case of reads evaluation starts at layer 1 if that not matches then 2… but write only happen at topmost layer only, unionfs shows the contents of the all the file system appear as if they all have same directory structure and when there is an operation to copy file to top layer then the layer includes mirror of the directory structure of the layers being used below. In case of docker, these fs layers are images and images can be layered on top of each other(docker-compose.yml).

FROM ubuntu:15.04
COPY . /shit
RUN make /shit
CMD run /shit/shit.bf

Each of the above commands creates a layer. FROM create a Ubuntu layer -> COPY adds files from current dir -> RUN build an app using make -> CMD specifies the command to run the app. Each layer is only has a small delta from other each other.

Copy on write: if anyone process wants to modify or write to the data, only then does the operating system make a copy of the data for that process to use. Only the process that needs to write has access to the data copy otherwise all other processes which just need to read continue to use the original data. Docker uses this with both images and containers. When we launch an image to run a container docker engine never makes a full copy of the stored image instead tracks the changes/writes and just before a write is performed in the running container a copy of that file is placed in the writeable layer of the container and that’s where copy on write played a role i.e A copy operation was deferred to first write. How does this help? If a container would have been spawned whole image copy had to be created giving time and space overhead so.It saved us time and storage using CoW (moohhh)

Fun: The letter G looks like a spinning arrow.

The fault in my articles, they don't close themselves.

Discuss on Twitter