7.5.1 Process lifecycle

Table of Contents

Overview of the Process Lifecycle

In Linux, a process moves through a series of well-defined phases: creation, execution, possible blocking, and termination, plus various intermediate states (zombie, stopped, traced). This chapter focuses on how these states are represented and managed in the kernel, and how user space can observe and influence them.

High-level pipeline:

$$
\text{fork/clone/vfork} \rightarrow \text{execve} \rightarrow \text{running/blocked} \rightarrow \text{exit} \rightarrow \text{reaped}
$$

Each step has specific kernel data structures, syscalls, and observable side effects.

Process Creation

`fork()` and copy-on-write

The classic way to create a new process is fork():

The child gets:

A copy of the parent’s virtual address space (implemented via copy-on-write pages, not a full physical copy).
Duplicated file descriptors (same underlying open file objects).
Same current working directory, environment, signal dispositions, etc.

Differences:

Child gets a new PID.
Child’s ppid (parent PID) is set to the parent.
Return value of fork():

Parent: returns child PID (> 0)
Child: returns 0
Failure: -1 in parent, no child created

Copy-on-write (CoW) means both processes initially share physical pages marked as read-only; when either writes, the kernel allocates a new page and updates that process’s page table.

`vfork()`

vfork() is an optimization for the common pattern:

fork()
execve() in the child soon after

Differences:

Parent is suspended until the child calls execve() or _exit().
Child runs in the parent’s address space (no separate copy yet).
Child must not:

Modify variables that the parent will use later.
Call functions that may use the stack in unspecified ways.

Modern Linux + fork() with CoW often make vfork() unnecessary, but it still exists and is used in some performance-critical code (like shells or posix_spawn implementations).

`clone()` and process vs thread semantics

clone() is the low-level primitive backing both processes and threads. It allows fine-grained control over what is shared with the parent, via flags like:

CLONE_VM — share the same memory space.
CLONE_FS — share filesystem info (cwd, root).
CLONE_FILES — share file descriptor table.
CLONE_SIGHAND — share signal handlers.
CLONE_PARENT — share the same parent as caller.
CLONE_THREAD — put the new task in the same thread group.

Threads in Linux are just tasks (kernel task_structs) that share certain resources. A “thread” is typically created by:

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | ...)

This yields:

Same PID namespace group (same tgid / thread group ID).
Distinct TIDs (thread IDs) visible as /proc/<tgid>/task/<tid>.

User-space libraries (e.g. pthread_create()) wrap clone() to provide POSIX threads.

Executing a New Program: `execve()`

After fork()/clone(), the child often calls execve() (or a wrapper like execl(), execvp()):

execve(const char filename, char const argv[], char *const envp[])

What execve() does:

Replaces the calling process’s memory image with a new program:

Old code, data, heap, stack mappings are removed.
New program image (ELF binary, interpreter for scripts) is mapped.

PID does not change: same process identity, new program.
Preserves:

Open file descriptors (unless FD_CLOEXEC is set).
Some signal dispositions (except those reset by exec).
UID/GID (subject to setuid/setgid bits).
Current working directory and environment (unless modified).

From the lifecycle point of view:

fork() → “new process with same program”
execve() → “same process, new program”

execve() is atomic with respect to signals: either a new image is fully in place, or the old one continues; there is no halfway state visible to user space.

Process States

Linux tracks a process via a task_struct that includes a state and related flags. You can observe these from user space:

/proc/<pid>/status — human-readable overview.
/proc/<pid>/stat — numeric fields, including state.
ps, top, htop — higher-level tools.

Common states (as reported by ps in the STAT column):

R — Running (or runnable: on CPU or in run queue).
S — Sleeping (interruptible sleep).
D — Uninterruptible sleep (usually waiting for I/O).
T — Stopped (via signal) or traced (debugger).
Z — Zombie (terminated, not yet reaped).
I — Idle kernel thread (sometimes shown on newer kernels).

Running vs runnable

Internally, there is a distinction:

TASK_RUNNING: either currently executing on a CPU or ready to run in a run queue.
The scheduler picks from runnable tasks based on policy and priority.

From user space, both are shown as R.

Interruptible sleep (`S`)

TASK_INTERRUPTIBLE:

Process is waiting for some condition (I/O, timer, event).
Can be interrupted by signals.

Example:

read() on a pipe with no data.
select()/poll() waiting for events.

Uninterruptible sleep (`D`)

TASK_UNINTERRUPTIBLE:

Typically used for I/O waits that must not be interrupted (e.g., some disk operations, certain kernel subsystems).
Signals are not delivered until the state changes.

If many processes are stuck in D state, it often indicates hardware or low-level driver issues.

Stopped and traced (`T`)

Stopped:

Caused by signals like SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU.
Process is not running and not scheduled until a SIGCONT.

Traced:

Debugger using ptrace() can stop a process at breakpoints.
ps may show these as T as well.

Zombie and Orphan Processes

Zombie processes

A zombie is a process that has:

Called _exit(), so the kernel has:

Freed most resources (memory mappings, open files).
Recorded its exit status and resource-usage info.

But:

Its entry in the process table (task_struct or minimal representation) remains.
Parent has not yet called wait()/waitpid()/waitid().

Lifecycle around exit:

Child calls _exit(status) (or returns from main).
Kernel:

Marks process as TASK_DEAD internally.
Stores exit code and stats.
Sends SIGCHLD to the parent.

Process becomes a zombie (Z in ps).
Parent:

Calls wait*() syscall.
Kernel returns collected info and fully removes the child entry.

Zombie disappears; process is “reaped”.

A few short-lived zombies are normal. Persistent zombies usually indicate:

A parent process that never calls wait*() on its children.
Application bugs in process management code.

Orphan processes and `init` (or `systemd`) re-parenting

A process becomes an orphan when its parent exits before it does:

Kernel re-parents the child to PID 1 (traditionally init, often systemd now).
The new parent is responsible for reaping the process when it terminates.

This prevents permanent zombies: eventually, PID 1 will wait() on adopted children.

Scheduling, Preemption, and Context Switch

High-level scheduling flow

At any point, a process is either:

On a CPU (running).
In a run queue (runnable, waiting for CPU).
Blocked (various wait states).
Stopped/zombie.

The scheduler:

Chooses which runnable task to run next on each CPU.
Uses policies and priorities.

Common scheduling policies (user visible)

SCHED_NORMAL (also SCHED_OTHER):

Default, time-sharing for regular processes.

SCHED_BATCH, SCHED_IDLE:

For background or very low-priority work.

Real-time:

SCHED_FIFO — strict priority, FIFO within priority.
SCHED_RR — round-robin among same-priority tasks.

You can inspect and adjust them with:

chrt, schedtool, nice, renice.

Context switching

A context switch is the core mechanism that moves the CPU from one process (or thread) to another:

Saves CPU register state, program counter, stack pointer of current task.
Loads register state of next scheduled task.
Updates MMU / page tables (CR3 on x86) if switching between different address spaces.

Context switches happen when:

The running process:

Voluntarily yields (e.g., blocking on I/O, sched_yield()).
Uses up its time slice (for non-real-time policy).

A higher-priority task becomes runnable (wake-up).

You can see context switch metrics in:

/proc/<pid>/status (voluntary_ctxt_switches, nonvoluntary_ctxt_switches).
System-wide via vmstat, pidstat, perf, etc.

Blocking, Waking, and Wait Queues

Blocking

When a process cannot continue (e.g., I/O not ready), it typically:

Moves to a sleeping state (TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE).
Is added to a wait queue associated with the resource (file, socket, event).
Scheduler picks another runnable task.

Examples of blocking operations:

Reading from a pipe or socket with no data (in blocking mode).
Waiting for a child process: wait*().
Waiting for a lock (mutex, semaphore).

Wait queues and wake-ups

Internally, many kernel objects have wait queues. Generic flow:

To sleep:

Process adds itself to the wait queue.
Changes state to sleep.
Calls the scheduler.

To wake:

When the event happens (I/O completes, lock released), kernel:

Wakes one or more tasks from the wait queue.
Moves them to runnable state.
They will eventually be scheduled.

From user space:

You see the effect as transitions between R, S, D states.
Tools like strace can show what syscalls are causing blocking.

Signals in the Lifecycle

Signals are a key control mechanism for process state transitions.

Common signals affecting the lifecycle:

SIGKILL (9):

Immediate, cannot be caught or ignored.
Forces process exit.

SIGTERM (15):

Default “polite” request to terminate.
Process can handle, cleanup, or ignore it.

SIGINT (2):

Typically sent by Ctrl+C in a terminal.

SIGSTOP, SIGTSTP:

Stop (suspend) the process.

SIGCONT:

Resume a stopped process.

SIGCHLD:

Delivered to parent when a child stops or terminates.

Lifecycle edges driven by signals:

running → stopped:

kill -STOP <pid> or terminal job control.

stopped → running:

kill -CONT <pid> or fg/bg in the shell.

running → exiting:

SIGTERM, SIGINT, SIGKILL, etc.

You can see pending and blocked signals in /proc/<pid>/status.

Exit, Termination, and Reaping

Process termination paths

A process can terminate via:

Normal return from main():

Equivalent to calling exit(status).

Explicit calls:

exit(status) — user space library function.
_exit(status) / _Exit(status) — direct syscall; no stdio flush or atexit handlers.

Signals:

Fatal signals like SIGKILL, SIGSEGV, SIGABRT (unless caught or ignored, where allowed).

Exit status is an 8-bit value (0–255) visible to the parent via wait*().

Parent waiting for children

The parent typically calls:

pid_t pid = wait(int *wstatus);
pid_t pid = waitpid(pid_t pid, int *wstatus, int options);
int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);

This:

Collects:

Child PID and exit status.
Resource usage (with wait3/wait4).

Removes the zombie from the process table.

If the parent never waits:

Zombies accumulate until:

Parent exits (then PID 1 adopts and reaps), or
System resources like PID space become pressure points (extreme/broken case).

Namespaces and the Lifecycle

Although full namespaces behavior belongs elsewhere, they affect identity during the lifecycle:

PID namespaces:

A process can have different PIDs in different namespaces.
PID 1 in a namespace behaves as an “init” for that namespace:

Reaps orphans.
Receives signals targeted at the container as a whole.

clone() with CLONE_NEWPID:

Creates a new PID namespace.
First child becomes PID 1 in that namespace.

This means:

From the host, lifecycle is tracked via host PIDs.
From inside a container / namespace, processes appear with local PIDs and local parent/child relationships.

Observing Lifecycle from User Space

A few practical ways to see lifecycle states and transitions.

`/proc` views

For a given PID:

/proc/<pid>/stat:

Field 3: state (e.g., R, S, D, T, Z).
Many other fields: parent PID, group ID, CPU usage, etc.

/proc/<pid>/status:

Human-readable; includes State:, PPid:, Threads:, voluntary_ctxt_switches, etc.

/proc/<pid>/task/:

Per-thread entries.

`ps` and `top`

Examples:

ps -o pid,ppid,state,cmd — quick parent/child and state overview.
ps axjf — process tree with relationships.
top / htop:

Show dynamic process lists, states, CPU usage, and transitions in real time.

Using `strace` to observe lifecycle events

strace -f -p <pid>:

Attach to a running process and observe syscalls like fork, clone, execve, wait4, exit_group.

strace -f ./program:

Watch process creation (clone) and program replacement (execve).

This is useful for understanding when and how processes:

Spawn children.
Replace themselves via execve.
Wait for exits.

Summary of the Lifecycle Transitions

At a high level:

Creation:

fork(), clone(), vfork():

New task_struct.
Inherits or shares various resources.

Program replacement (optional):

execve():

New program image, same PID.

Running and blocking:

Scheduler moves process between:

running/runnable (R).
sleeping (S/D).
stopped/traced (T).

Termination:

exit() / _exit() or fatal signal.
Process releases resources; becomes zombie.

Reaping:

Parent (or PID 1) calls wait*().
Kernel removes zombie entry; lifecycle ends.

Understanding these steps at the kernel level is the foundation for deeper topics like memory management, signals & IPC, namespaces, and cgroups, which all interact tightly with the process lifecycle.

Comments

Please login to add a comment.

Don't have an account? Register now!