7.5 Linux Internals

Table of Contents

Overview: Why Study Linux Internals?

Linux internals are the mechanisms beneath the familiar commands and configuration files: how the kernel represents processes, memory, files, and resources; how it isolates workloads; how it enforces limits and performs scheduling.

This chapter gives you a conceptual map of the major internal subsystems you’ll explore in the following Linux Internals chapters:

Process lifecycle
Memory management
Signals and IPC
Namespaces
cgroups

You’ll see how these pieces fit together at a high level, with enough detail to recognize what belongs to which subsystem and how to investigate behavior on a running system. Deep dives into each topic are in the dedicated child chapters.

User space vs Kernel space

Linux is structurally split into two privilege domains:

User space

Where regular processes run (your shells, servers, tools).
Limited privileges; cannot access hardware directly.
Interacts with the kernel via system calls.

Kernel space

Highly privileged; runs the Linux kernel and its modules.
Direct access to hardware devices and memory.
Responsible for process management, memory, I/O, networking, and security enforcement.

Transitions between user space and kernel space are tightly controlled:

A user-space program invokes a system call (e.g. read, write, fork, execve, clone).
The CPU switches from user mode to kernel mode.
The kernel performs the requested operation, enforcing permissions and limits.
Control returns to user space, usually with a return value in a CPU register.

Being able to think in terms of user vs kernel space is essential when debugging performance or strange behavior: is the issue in your application code, in how it calls the kernel, or in how the kernel handles the request?

System Calls and the API Surface

From user space, Linux is mainly experienced through:

System calls (syscalls) – the narrow privileged interface to the kernel.
The C library (libc) – wraps syscalls, provides standard APIs (fopen, printf, etc.).
Virtual filesystems – like /proc and /sys, which expose kernel state in a file-like form.

Typical relationships:

fopen() → eventually calls openat() syscall.
malloc() → uses brk() and/or mmap() syscalls under the hood.
pthread_create() → often leads to clone() and related syscalls.

You can inspect which syscalls a program uses:

With strace:

  strace -f -o trace.log ./your_program

By looking at manual pages: man 2 open, man 2 clone (section 2 is for syscalls).

In later internals chapters, you’ll see that almost every key concept (process creation, memory mapping, IPC, namespaces, cgroups) is expressed via specific syscalls.

Core Kernel Abstractions

Linux provides a relatively small set of powerful abstractions:

Task (process / thread)
Address space
File descriptor
Virtual filesystem and inodes
Namespaces
Control groups (cgroups)

Understanding these abstractions helps you reason about behavior across many tools and subsystems.

Tasks: Processes and Threads

Internally, Linux represents everything that executes as a task (often referred to as a task_struct in kernel code).

From a user-space perspective:

A process has:

Its own virtual address space.
One or more threads of execution.

A thread shares the same address space and a lot of state with its sibling threads, but has its own registers and stack.

Linux exposes tasks via:

/proc/<pid>/ – per-task views of memory maps, open files, limits, and more.
Tools like ps, top, htop, ps -L, ps -T (threads), pidstat.

In the Process lifecycle chapter you’ll see how these tasks are created, scheduled, and destroyed, and how they move through states like running, sleeping, and zombie.

Address Spaces and Virtual Memory

Each process sees its own virtual address space:

The kernel maps this virtual space to actual physical memory pages.
Some regions might be file-backed (mmap of shared libraries), others anonymous (heap, stack).

Conceptually:

Code (text segment)
Data/BSS (global variables)
Heap (grows upwards)
Stack (grows downwards)
Memory-mapped regions (shared libs, mapped files)

The Memory management chapter explores:

How pages are allocated and reclaimed.
Page caches, swapping, overcommit behavior.
Tools like /proc/<pid>/smaps, pmap, vmstat, and perf.

File Descriptors and the Unified I/O Model

Linux uses a unified model for I/O: “everything is a file” is not literally true, but everything looks like a file descriptor:

Regular files
Directories (through syscalls like openat and getdents)
Sockets
Pipes and FIFOs
Character and block devices
eventfd, signalfd, timerfd, and other kernel interfaces

In user space, these show up as integers: 0, 1, 2 (stdin, stdout, stderr), and higher numbers for additional descriptors.

Key internals points:

The kernel tracks an open file table, file positions, and permission checks.
Many “special” kernel interfaces live under /proc and /sys and are used via normal read/write operations.
Multiplexing mechanisms like select, poll, epoll operate over file descriptors.

This abstraction is central when later chapters talk about signals and IPC (pipes, sockets, eventfd) and performance.

Virtual Filesystems: `/proc` and `/sys` as Views Into the Kernel

Two special pseudo-filesystems expose internal kernel state:

/proc – process and general kernel information

/proc/<pid>/cmdline, /proc/<pid>/status, /proc/<pid>/fd/
/proc/meminfo, /proc/cpuinfo, /proc/sys/

/sys – sysfs; hierarchical view of devices, drivers, and subsystems

/sys/block/, /sys/class/net/, /sys/fs/cgroup/, /sys/devices/

These are not stored on disk. Instead:

Reading a file like /proc/meminfo triggers kernel code to generate the contents dynamically.
Writing to specific /proc/sys or /sys paths can change kernel parameters at runtime (subject to permission checks).

Internals-wise, virtual filesystems allow the kernel to expose state and controls uniformly as files and directories, making it easy for tooling and scripting.

Scheduling, Latency, and Preemption

Linux uses preemptive multitasking:

The scheduler decides which task runs on which CPU at a given time.
Tasks have priorities and scheduling policies (e.g. SCHED_NORMAL, SCHED_FIFO, SCHED_RR, SCHED_DEADLINE).
On multiprocessor systems, tasks can migrate between CPUs unless pinned.

Important notions for internals:

Context switch – saving the CPU state of one task and loading another’s.
Preemption – a running task can be interrupted so another can run.
Kernel preemption – whether the kernel itself can be preempted in the middle of handling a syscall or interrupt depends on configuration (e.g. PREEMPT, PREEMPT_RT for real-time).

The scheduler interacts with almost every internal subsystem:

CPU-bound tasks compete for CPU time.
I/O-bound tasks sleep while waiting for I/O completion and then wake up.
cgroups can control CPU shares and quotas per group of tasks (covered in the cgroups chapter).

Tools often used when exploring these internals include schedstat, perf sched, and tracing frameworks like ftrace or bpftrace.

Resource Limits, Capabilities, and Security Hooks

Linux enforces safety and isolation via:

Resource limits (rlimits) – per-process caps on things like:

Maximum number of open file descriptors (RLIMIT_NOFILE)
Maximum stack size (RLIMIT_STACK)
Maximum CPU time (RLIMIT_CPU)

Linux capabilities – fine-grained privileges split from the all-powerful root user ID.

Example capabilities: CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, CAP_NET_ADMIN.
Tools: capsh, getcap, setcap.

Security frameworks – e.g. SELinux, AppArmor, seccomp-bpf.

These add hooks into the kernel’s decision points: “may this task open that file?”, “may it call this syscall?”, and so on.

While the detailed configuration is covered elsewhere in the course, from an internals perspective:

The kernel maintains per-task credential structures (UIDs, GIDs, capabilities).
Access checks run at syscall boundaries and filesystem operations.
With seccomp, the kernel can allow/deny syscalls based on user-defined BPF filters.

Understanding that these checks are centralized in the kernel helps explain why many operations “fail” with EPERM or EACCES even when they look legal from an application’s perspective.

Namespaces: Multiple Views of the Same Kernel

Namespaces let the kernel provide isolated views of global resources to different sets of tasks:

PID namespaces – separate process ID trees.
Mount namespaces – separate filesystem mount views.
UTS namespaces – separate hostnames and domain names.
Network namespaces – separate network stacks.
IPC, user, and cgroup namespaces.

Key internal idea:

The kernel still runs a single instance, but tasks are attached to one or more namespaces.
Operations like “show me all processes” or “list network interfaces” look at the namespace context of the caller.

Namespaces are crucial for containers. The Namespaces chapter will cover:

How namespaces are created (via clone, unshare, setns).
How they interact with other subsystems like cgroups and capabilities.

cgroups: Controlling and Accounting Resources

Control groups (cgroups) are how Linux:

Accounts resource usage per group of tasks.
Enforces limits.
Applies priorities and constraints.

They are arranged hierarchically:

A cgroup hierarchy attaches a specific controller (e.g. cpu, memory, io).
Each cgroup in that hierarchy has tasks and configuration files (in /sys/fs/cgroup/).
Resource policies are inherited downwards unless overridden.

From an internals perspective:

Each controller hooks into relevant kernel subsystems (scheduler, memory manager, I/O scheduler).
Decisions about “who gets memory” or “who runs on CPU now” are informed by the cgroup configuration attached to each task.

The cgroups chapter will explain:

The cgroup v2 unified hierarchy.
How tools like systemd and container runtimes manage cgroups.
How to inspect and debug resource policies.

Signals and IPC in the Kernel

Linux supports several inter-process communication (IPC) mechanisms:

Signals – lightweight asynchronous notifications (e.g. SIGTERM, SIGKILL, SIGCHLD).
Pipes and FIFOs – unidirectional byte streams.
UNIX domain sockets – local endpoint communication, can pass file descriptors.
POSIX shared memory and semaphores.
Message queues, netlink sockets, and more.

At the internals level:

The kernel maintains per-task signal queues and delivery rules.
For some IPC (pipes, sockets), data is moved via kernel buffers.
For shared memory, processes map the same physical pages into their virtual address spaces.

The Signals and IPC chapter will explain:

How signal delivery interacts with process states (sleeping, running).
How synchronous vs asynchronous signals behave.
How to trace IPC and signals with tools like strace, perf, bpftrace, and /proc.

Observability: Looking Inside Without Crashing the System

Linux offers rich observability mechanisms to inspect internals without modifying kernel code:

/proc and /sys (as discussed earlier).
Tracing frameworks:

ftrace – in-kernel function and event tracer.
perf – performance and profiling interface.
eBPF-based tools – dynamic instrumentation of kernel and user space with BPF programs.

Debugging aids:

sysrq (magic SysRq key) for emergency kernel actions.
Kernel logs (dmesg, journalctl -k).

From an internals perspective, these mechanisms use:

Dedicated circular buffers (e.g. ring buffers for tracing).
Hook points in the scheduler, IRQ handlers, and key subsystem paths.
BPF verifier and JIT compilation to run small, safe programs in the kernel.

Being comfortable with these tools is essential when you start exploring process lifecycle, memory behavior, and resource isolation in depth.

How the Internals Topics Connect

The remaining chapters in this Linux Internals section zoom in on specific subsystems:

Process lifecycle

How tasks are created, scheduled, and destroyed.
State transitions and relationships between processes.

Memory management

Page tables, caches, swapping.
How the kernel decides what stays in RAM.

Signals and IPC

How asynchronous events and communication primitives are implemented.

Namespaces

How isolated views of global resources are created and maintained.

cgroups

How resource accounting, limits, and priorities are enforced.

Together, these topics form a coherent picture:

A task (process/thread) exists in certain namespaces, with specific credentials and capabilities, attached to some cgroups, using memory and I/O, and communicating via signals and IPC.
The kernel mediates all of this, enforcing limits and providing observability.

As you go through the child chapters, try to always tie the details back to these core abstractions: tasks, address spaces, file descriptors, namespaces, and cgroups, all under the control of a single running kernel.

7.5.1 Process lifecycle

7.5.2 Memory management

7.5.3 Signals and IPC

7.5.4 Namespaces

7.5.5 cgroups