Containers & Virtualization

By: Aayush Pokharel

Kathmandu University - Software Freedom Day 2024

About Me

  • Aayush Pokharel
  • KU BSc.CS (2019)
  • Infrastructure Engr.
  • STARTsmall Pvt. Ltd.

Today's Topic of Discussion

  • Pre-requisite: Linux Processes
  • What are Containers?
  • What are Containers made out of?
  • Container == Docker?
    • Docker
  • How do Containers differ from Virtual Machines?

Pre-requisite: Linux Boot stages

  • These stages are grouped into 4 steps
  1. System startup
  2. bootloader stage
  3. kernel stage
  4. init process

Pre-requisite: init process

  • It is the first process loaded by the kernel
  • It has process id of 1
  • Works sequentially
  • May block IO
  • Modern system have Systemd

Pre-requisite: System Calls

  • fork()
  • exec()

are System Calls used for Process Management.

fork()

  • duplicate the calling process to create a new process
  • After calling fork(), two processes exist
    • parent process (original)
    • child process (newly created)
  • child process is an exact copy of the parent process
  • child process has a different Process ID (PID)

exec()

  • Used in replacing the current process with a new one
  • Loading and executing a new program
  • Does not create a new process (has same PID)

Example Workflow

  1. Parent calls fork().
  2. Child process is created.
  3. Child process calls exec() to run a new program.
  4. Parent continues its execution separately from the child.

Containers

What are containers?

TL;DR

There is no Such Thing as "Container"*

Fun Fact

  • The Linux kernel does not explicitly reference "containers" in the same way that modern containerization technologies like Docker and Kubernetes use the term.

Then?

It has mentions of two features

  • Namespaces
  • Control Groups (cgroups)

So, what are Containers?

Where do they come from?

Containers, in layman's terms, are simply Linux processes.

All the magic happens when we use Namespaces and cgroups.

Demo | List all Processes

Type this command in Linux Terminal

ps aux

Demo | Processes in a namespace

sudo unshare --pid --fork --mount-proc bash

Before fork():

[ Terminal Shell ] --> runs unshare

After fork():

[ Terminal Shell ] (parent)
|
[ Child Process ] --> isolated in new namespace

After exec():

[ Terminal Shell ] (parent)
|
[ Bash Shell ] --> runs inside new namespace

Types of Namespaces

  1. Hostname Namespace
  2. Process Id Namespace
  3. File System Namespace
  4. Network Interface Namespace
  5. Inter Process Communication (IPC) Namespace
  6. User Namespace

1. Hostname Namespace

What it does
  • Isolates the hostname and domain name.
  • Containers can have their own unique hostnames.
  • Allows containers to act as separate "hosts" on a network.

2. Process ID (PID) Namespace

What it does
  • Isolates process IDs.
  • Each container starts its processes from PID 1.
  • Containers cannot see or interact with processes on the host or in other containers.

3. File System (Mount) Namespace

What it does
  • Isolates filesystem mount points.
  • Containers have their own root filesystem (/) view.
  • Prevents containers from accessing the host’s or other containers’ file systems.

4. Network Interface Namespace

What it does
  • Isolates network interfaces, IP addresses, and routing tables.
  • Containers can have their own IP addresses and network stacks.
  • Multiple containers can use the same port without conflict.

5.IPC Namespace

What it does
  • Isolates IPC mechanisms (shared memory, semaphores, message queues).
  • Prevents containers from using IPC to communicate with processes on the host or other containers.

6. User Namespace

What it does
  • Isolates user and group IDs (UIDs/GIDs) inside the container.
  • Allows processes inside the container to have different privileges from their host counterparts.

UID Mapping:

  • Maps the root user (UID 0) inside the container to a non-root user (e.g., UID 1001) on the host.
  • The container sees the process as root (UID 0), but the host sees it as a non-root user (e.g., UID 1001 or 100000).
Docker Compose | Container Orchestration
services:
  nginx:
    image: nginx:latest
    container_name: nginx_container
    hostname: nginx-host    
    userns_mode: "host"     
    ipc: "private"          
    pid: "private"          
    network_mode: "bridge"  
    volumes:                
      - ./nginx/html:/usr/share/nginx/html:ro
    ports:
      - "8080:80"           
    restart: unless-stopped
    security_opt:
      - "no-new-privileges:true"  

What is Control Groups?

Where does Control Groups (cGroups) Come into place?

Control Groups

  • A Linux kernel feature for managing and limiting resource usage (CPU, memory, I/O, network) of a group of processes.
  • Processes are grouped into hierarchical control groups (cgroups), managed as a unit.

How Control Groups Work

1. Resource Management

  • CPU: Control CPU usage (e.g., CPU time shares, CPU core limits).
  • Memory: Limit memory usage (e.g., prevent processes from using more than a certain amount of RAM).
  • Disk I/O: Throttle or prioritize disk read/write operations.
  • Network: Limit or control network bandwidth usage.

2. Hierarchy and Inheritance

  • Cgroups are hierarchical;
  • child groups inherit resource limits from parent groups.
  • Enables fine-grained resource allocation within a hierarchy of processes.

Create and Manage cgroups

Cgroups can be created and configured manually using tools like

  • cgcreate, cgset, cgexec
  • systemd
  • Docker

Resource Limiting Example

Limit CPU usage to 50%

cgcreate -g cpu:/mygroup
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us

Does Container mean Docker?

What is Docker?

1. Packaging and Distribution
  • Images: Self-contained packages for applications.
  • Docker Hub: Central registry for sharing images.
2. Simplified Application Environment Setup
  • Dockerfile: Text file defining image build process.
  • Build Automation: Automated image building.
3. Container Orchestration
  • Docker Compose: For multi-container applications.
  • Integration with Kubernetes: For large-scale container management.
4. Networking
  • Virtual networks: Simplifying container communication.
  • Service discovery: Automatic DNS-based discovery.
5. Volumes
  • Persistent storage: For data beyond container lifecycles
6. Security Enhancements
  • Seccomp profiles: Restricting system calls.
  • MAC profiles: AppArmor/SELinux.
  • Capabilities: Limiting process privileges.
7. Ease of Use
  • Docker CLI: Simple commands for managing containers, images, networks, and volumes.
8. Cross-Platform Support
  • Docker Desktop: Running Linux containers on Windows and macOS.

Containers Vs. Virtual Machines

Thank you!

This presentation can be found at:

present.aayushpokharel.com/containers_n_virtualization.md

How does Linux Boot? - It's a multi stage process. - Dependent on Computer Architecture - But has similar stages and software components

Neso Academy youtube demonstrate behavoiur of fork and exec

Typical Usage - fork() is often followed by exec() in the child process. - This allows a parent process to create a new child process - and then the child can execute a new program.

--pid: Creates a new PID namespace. --fork: Forks the bash shell as a new child process. --mount-proc: Mounts a new /proc filesystem for the new PID namespace. --net: Isolates the network interfaces. --uts: Isolates the hostname and domain name. --ipc: Isolates inter-process communication. --mount: Creates a new mount namespace.

/proc is a virtual filesystem representing processes.

Effect on containers Inside a container, the processes appear to be the only processes on the system. The first process in the container will always be PID 1, making it appear like it's the root process of the system, even though it is running alongside other containers on the same host. Example in containers: Inside a container, a process like nginx might have PID 1, but on the host, it may have a different PID like 23045. The container is isolated from the host’s process tree.

Effect on containers: Containers can have their own isolated file systems, and they can see only the files that are part of their environment. This allows containers to have different root directories (/), with access to only specific files, without exposing the entire host filesystem.

Effect on containers: Containers can have their own isolated network stacks. They can have their own IP addresses, ports, and even virtual interfaces. Containers might communicate with each other over a bridge network but remain isolated from the host's networking. This is a key feature that allows multiple containers to run services on the same port (e.g., port 80 for HTTP) without conflict. Example in containers: Container A could have an IP like 10.0.0.2, while Container B could have 10.0.0.3, and both could use port 80, but their traffic is handled separately from each other and the host.

Effect on containers: This ensures that processes inside a container cannot use IPC to communicate with processes on the host or in other containers, providing isolation for processes that rely on shared memory or other forms of inter-process communication. Example in containers: If two containers are running separate databases, they can’t interfere with each other by accessing each other's shared memory segments or semaphores.

# Hostname Namespace - Custom hostname inside the container # User Namespace - Use the host's user namespace (or "default" for UID/GID remapping) # IPC Namespace - Isolated Inter-Process Communication for the container # PID Namespace - Isolated process IDs # Network Interface Namespace - Isolated network stack using bridge mode # File System Namespace - Mount the host directory to container's /usr/share/nginx/html # Expose port 80 on container as 8080 on host # Prevent privilege escalation inside the container

Why Cgroups Are Important: Resource Isolation: Ensures each group (e.g., container) has its own resource limits, improving performance isolation. Fair Resource Distribution: Allocates resources fairly between groups to avoid resource starvation. Resource Accounting: Tracks resource usage for monitoring and optimization. Security: Prevents resource exhaustion attacks (e.g., Denial of Service) by limiting usage.

--- ## **3. Subsystems (Controllers)** - **cpu:** Controls CPU scheduling and usage. - **memory:** Limits and tracks memory usage. - **blkio:** Controls access to block devices (disk I/O). - **net_cls:** Manages network packet classification. - **devices:** Controls access to hardware devices.

cgcreate, cgset, cgexec: Command-line utilities for creating, configuring, and running processes in cgroups. Systemd: Used in modern Linux systems for managing services and their associated cgroups. Docker/Kubernetes: Use cgroups internally for resource limitation and isolation in containers.