Table of Contents
Overview of the Process Lifecycle
In Linux, a process moves through a series of well-defined phases: creation, execution, possible blocking, and termination, plus various intermediate states (zombie, stopped, traced). This chapter focuses on how these states are represented and managed in the kernel, and how user space can observe and influence them.
High-level pipeline:
$$
\text{fork/clone/vfork} \rightarrow \text{execve} \rightarrow \text{running/blocked} \rightarrow \text{exit} \rightarrow \text{reaped}
$$
Each step has specific kernel data structures, syscalls, and observable side effects.
Process Creation
`fork()` and copy-on-write
The classic way to create a new process is fork():
- The child gets:
- A copy of the parent’s virtual address space (implemented via copy-on-write pages, not a full physical copy).
- Duplicated file descriptors (same underlying open file objects).
- Same current working directory, environment, signal dispositions, etc.
- Differences:
- Child gets a new PID.
- Child’s
ppid(parent PID) is set to the parent. - Return value of
fork(): - Parent: returns child PID (
> 0) - Child: returns
0 - Failure:
-1in parent, no child created
Copy-on-write (CoW) means both processes initially share physical pages marked as read-only; when either writes, the kernel allocates a new page and updates that process’s page table.
`vfork()`
vfork() is an optimization for the common pattern:
fork()execve()in the child soon after
Differences:
- Parent is suspended until the child calls
execve()or_exit(). - Child runs in the parent’s address space (no separate copy yet).
- Child must not:
- Modify variables that the parent will use later.
- Call functions that may use the stack in unspecified ways.
Modern Linux + fork() with CoW often make vfork() unnecessary, but it still exists and is used in some performance-critical code (like shells or posix_spawn implementations).
`clone()` and process vs thread semantics
clone() is the low-level primitive backing both processes and threads. It allows fine-grained control over what is shared with the parent, via flags like:
CLONE_VM— share the same memory space.CLONE_FS— share filesystem info (cwd, root).CLONE_FILES— share file descriptor table.CLONE_SIGHAND— share signal handlers.CLONE_PARENT— share the same parent as caller.CLONE_THREAD— put the new task in the same thread group.
Threads in Linux are just tasks (kernel task_structs) that share certain resources. A “thread” is typically created by:
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | ...)
This yields:
- Same PID namespace group (same
tgid/ thread group ID). - Distinct TIDs (thread IDs) visible as
/proc/<tgid>/task/<tid>.
User-space libraries (e.g. pthread_create()) wrap clone() to provide POSIX threads.
Executing a New Program: `execve()`
After fork()/clone(), the child often calls execve() (or a wrapper like execl(), execvp()):
execve(const char filename, char const argv[], char *const envp[])
What execve() does:
- Replaces the calling process’s memory image with a new program:
- Old code, data, heap, stack mappings are removed.
- New program image (ELF binary, interpreter for scripts) is mapped.
- PID does not change: same process identity, new program.
- Preserves:
- Open file descriptors (unless
FD_CLOEXECis set). - Some signal dispositions (except those reset by exec).
- UID/GID (subject to setuid/setgid bits).
- Current working directory and environment (unless modified).
From the lifecycle point of view:
fork()→ “new process with same program”execve()→ “same process, new program”
execve() is atomic with respect to signals: either a new image is fully in place, or the old one continues; there is no halfway state visible to user space.
Process States
Linux tracks a process via a task_struct that includes a state and related flags. You can observe these from user space:
/proc/<pid>/status— human-readable overview./proc/<pid>/stat— numeric fields, including state.ps,top,htop— higher-level tools.
Common states (as reported by ps in the STAT column):
R— Running (or runnable: on CPU or in run queue).S— Sleeping (interruptible sleep).D— Uninterruptible sleep (usually waiting for I/O).T— Stopped (via signal) or traced (debugger).Z— Zombie (terminated, not yet reaped).I— Idle kernel thread (sometimes shown on newer kernels).
Running vs runnable
Internally, there is a distinction:
TASK_RUNNING: either currently executing on a CPU or ready to run in a run queue.- The scheduler picks from runnable tasks based on policy and priority.
From user space, both are shown as R.
Interruptible sleep (`S`)
TASK_INTERRUPTIBLE:- Process is waiting for some condition (I/O, timer, event).
- Can be interrupted by signals.
- Example:
read()on a pipe with no data.select()/poll()waiting for events.
Uninterruptible sleep (`D`)
TASK_UNINTERRUPTIBLE:- Typically used for I/O waits that must not be interrupted (e.g., some disk operations, certain kernel subsystems).
- Signals are not delivered until the state changes.
- If many processes are stuck in
Dstate, it often indicates hardware or low-level driver issues.
Stopped and traced (`T`)
- Stopped:
- Caused by signals like
SIGSTOP,SIGTSTP,SIGTTIN,SIGTTOU. - Process is not running and not scheduled until a
SIGCONT. - Traced:
- Debugger using
ptrace()can stop a process at breakpoints. psmay show these asTas well.
Zombie and Orphan Processes
Zombie processes
A zombie is a process that has:
- Called
_exit(), so the kernel has: - Freed most resources (memory mappings, open files).
- Recorded its exit status and resource-usage info.
- But:
- Its entry in the process table (
task_structor minimal representation) remains. - Parent has not yet called
wait()/waitpid()/waitid().
Lifecycle around exit:
- Child calls
_exit(status)(or returns frommain). - Kernel:
- Marks process as
TASK_DEADinternally. - Stores exit code and stats.
- Sends
SIGCHLDto the parent. - Process becomes a zombie (
Zinps). - Parent:
- Calls
wait*()syscall. - Kernel returns collected info and fully removes the child entry.
- Zombie disappears; process is “reaped”.
A few short-lived zombies are normal. Persistent zombies usually indicate:
- A parent process that never calls
wait*()on its children. - Application bugs in process management code.
Orphan processes and `init` (or `systemd`) re-parenting
A process becomes an orphan when its parent exits before it does:
- Kernel re-parents the child to PID 1 (traditionally
init, oftensystemdnow). - The new parent is responsible for reaping the process when it terminates.
This prevents permanent zombies: eventually, PID 1 will wait() on adopted children.
Scheduling, Preemption, and Context Switch
High-level scheduling flow
At any point, a process is either:
- On a CPU (running).
- In a run queue (runnable, waiting for CPU).
- Blocked (various wait states).
- Stopped/zombie.
The scheduler:
- Chooses which runnable task to run next on each CPU.
- Uses policies and priorities.
Common scheduling policies (user visible)
SCHED_NORMAL(alsoSCHED_OTHER):- Default, time-sharing for regular processes.
SCHED_BATCH,SCHED_IDLE:- For background or very low-priority work.
- Real-time:
SCHED_FIFO— strict priority, FIFO within priority.SCHED_RR— round-robin among same-priority tasks.
You can inspect and adjust them with:
chrt,schedtool,nice,renice.
Context switching
A context switch is the core mechanism that moves the CPU from one process (or thread) to another:
- Saves CPU register state, program counter, stack pointer of current task.
- Loads register state of next scheduled task.
- Updates MMU / page tables (
CR3on x86) if switching between different address spaces.
Context switches happen when:
- The running process:
- Voluntarily yields (e.g., blocking on I/O,
sched_yield()). - Uses up its time slice (for non-real-time policy).
- A higher-priority task becomes runnable (wake-up).
You can see context switch metrics in:
/proc/<pid>/status(voluntary_ctxt_switches,nonvoluntary_ctxt_switches).- System-wide via
vmstat,pidstat,perf, etc.
Blocking, Waking, and Wait Queues
Blocking
When a process cannot continue (e.g., I/O not ready), it typically:
- Moves to a sleeping state (
TASK_INTERRUPTIBLEorTASK_UNINTERRUPTIBLE). - Is added to a wait queue associated with the resource (file, socket, event).
- Scheduler picks another runnable task.
Examples of blocking operations:
- Reading from a pipe or socket with no data (in blocking mode).
- Waiting for a child process:
wait*(). - Waiting for a lock (mutex, semaphore).
Wait queues and wake-ups
Internally, many kernel objects have wait queues. Generic flow:
- To sleep:
- Process adds itself to the wait queue.
- Changes state to sleep.
- Calls the scheduler.
- To wake:
- When the event happens (I/O completes, lock released), kernel:
- Wakes one or more tasks from the wait queue.
- Moves them to runnable state.
- They will eventually be scheduled.
From user space:
- You see the effect as transitions between
R,S,Dstates. - Tools like
stracecan show what syscalls are causing blocking.
Signals in the Lifecycle
Signals are a key control mechanism for process state transitions.
Common signals affecting the lifecycle:
SIGKILL(9):- Immediate, cannot be caught or ignored.
- Forces process exit.
SIGTERM(15):- Default “polite” request to terminate.
- Process can handle, cleanup, or ignore it.
SIGINT(2):- Typically sent by Ctrl+C in a terminal.
SIGSTOP,SIGTSTP:- Stop (suspend) the process.
SIGCONT:- Resume a stopped process.
SIGCHLD:- Delivered to parent when a child stops or terminates.
Lifecycle edges driven by signals:
running→stopped:kill -STOP <pid>or terminal job control.stopped→running:kill -CONT <pid>orfg/bgin the shell.running→exiting:SIGTERM,SIGINT,SIGKILL, etc.
You can see pending and blocked signals in /proc/<pid>/status.
Exit, Termination, and Reaping
Process termination paths
A process can terminate via:
- Normal return from
main(): - Equivalent to calling
exit(status). - Explicit calls:
exit(status)— user space library function._exit(status)/_Exit(status)— direct syscall; no stdio flush or atexit handlers.- Signals:
- Fatal signals like
SIGKILL,SIGSEGV,SIGABRT(unless caught or ignored, where allowed).
Exit status is an 8-bit value (0–255) visible to the parent via wait*().
Parent waiting for children
The parent typically calls:
pid_t pid = wait(int *wstatus);pid_t pid = waitpid(pid_t pid, int *wstatus, int options);int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
This:
- Collects:
- Child PID and exit status.
- Resource usage (with
wait3/wait4). - Removes the zombie from the process table.
If the parent never waits:
- Zombies accumulate until:
- Parent exits (then PID 1 adopts and reaps), or
- System resources like PID space become pressure points (extreme/broken case).
Namespaces and the Lifecycle
Although full namespaces behavior belongs elsewhere, they affect identity during the lifecycle:
- PID namespaces:
- A process can have different PIDs in different namespaces.
- PID 1 in a namespace behaves as an “init” for that namespace:
- Reaps orphans.
- Receives signals targeted at the container as a whole.
clone()withCLONE_NEWPID:- Creates a new PID namespace.
- First child becomes PID 1 in that namespace.
This means:
- From the host, lifecycle is tracked via host PIDs.
- From inside a container / namespace, processes appear with local PIDs and local parent/child relationships.
Observing Lifecycle from User Space
A few practical ways to see lifecycle states and transitions.
`/proc` views
For a given PID:
/proc/<pid>/stat:- Field 3: state (e.g.,
R,S,D,T,Z). - Many other fields: parent PID, group ID, CPU usage, etc.
/proc/<pid>/status:- Human-readable; includes
State:,PPid:,Threads:,voluntary_ctxt_switches, etc. /proc/<pid>/task/:- Per-thread entries.
`ps` and `top`
Examples:
ps -o pid,ppid,state,cmd— quick parent/child and state overview.ps axjf— process tree with relationships.top/htop:- Show dynamic process lists, states, CPU usage, and transitions in real time.
Using `strace` to observe lifecycle events
strace -f -p <pid>:- Attach to a running process and observe syscalls like
fork,clone,execve,wait4,exit_group. strace -f ./program:- Watch process creation (
clone) and program replacement (execve).
This is useful for understanding when and how processes:
- Spawn children.
- Replace themselves via
execve. - Wait for exits.
Summary of the Lifecycle Transitions
At a high level:
- Creation:
fork(),clone(),vfork():- New
task_struct. - Inherits or shares various resources.
- Program replacement (optional):
execve():- New program image, same PID.
- Running and blocking:
- Scheduler moves process between:
running/runnable(R).sleeping(S/D).stopped/traced(T).- Termination:
exit()/_exit()or fatal signal.- Process releases resources; becomes zombie.
- Reaping:
- Parent (or PID 1) calls
wait*(). - Kernel removes zombie entry; lifecycle ends.
Understanding these steps at the kernel level is the foundation for deeper topics like memory management, signals & IPC, namespaces, and cgroups, which all interact tightly with the process lifecycle.