Table of Contents
Why Error Handling Matters in Shell Scripts
Shell scripts often glue many commands together. Any one of them can fail due to missing files, permissions, network issues, or user input errors. Without explicit error handling you get:
- Silent failures (script continues doing the wrong thing)
- Partial work (some steps succeeded, others didn’t)
- Corrupted state (half-written files, partially applied changes)
The goal of error handling is not only to detect failures, but to:
- Fail fast and loudly when appropriate
- Clean up temporary resources
- Provide meaningful diagnostics
- Optionally recover or retry where it’s safe
This chapter assumes you already understand basic script structure, variables, and functions.
Exit Status and Failure Conventions
Every command returns an exit status in $?. By convention:
0means success- Non‑zero means some kind of failure
You rarely read $? directly in advanced scripts; instead you let conditionals and the shell’s own behavior do most of the work.
Examples:
cmd
status=$?
if [ "$status" -ne 0 ]; then
echo "cmd failed with status $status" >&2
fiMost conditionals automatically use exit statuses:
if cmd; then ... firuns thethenbranch ifcmdreturns0.&&and||chains also use exit statuses (see below).
Controlling Script Exit on Errors
`set -e` / `set -o errexit`
set -e tells the shell: “exit the script if any simple command fails (returns non‑zero), with some exceptions.”
set -e # or: set -o errexit
cp file1 file2
echo "Copy succeeded"
If cp fails, the script terminates immediately.
Traps and Pitfalls of `set -e`
set -e is useful, but its behavior is nuanced:
- It does not trigger on commands:
- in
if,while,untilconditions - after
!(negation) - in
cmd1 && cmd2orcmd1 || cmd2where the shell expects failures - It may be ignored in subshells depending on how they’re invoked
Example:
set -e
if grep pattern file; then # `grep` failure does not exit script
echo "found"
fiBecause of these subtleties, many experienced scripters:
- Use
set -etogether with other options (see below), and - Prefer explicit checks and
|| exitpatterns for critical commands.
Making `set -e` Safer: `-u`, `-o pipefail`, and `-E`
Often combined options (in Bash):
set -euo pipefail
# or more explicitly:
set -o errexit -o nounset -o pipefail-u(set -o nounset): treat use of unset variables as an error.-o pipefail: propagate the first failing command in a pipeline.-E: preserve ERR traps across functions and command substitutions (Bash only, explained later).
pipefail is especially important: without it, only the exit status of the last command in a pipeline is visible.
set -e
false | true
echo "Still running" # script continues; pipeline exit is 0 (= true)
With set -o pipefail, the pipeline returns non‑zero if any element fails:
set -eo pipefail
false | true
echo "Never reached" # script exits due to 'false'Patterns for Checking Errors
Using `&&` and `||`
&& runs the next command only if the previous succeeded; || runs the next command only if the previous failed.
Common patterns:
- Only run follow‑up on success:
build_app && deploy_app- Exit on failure with a message:
do_critical_thing || {
echo "do_critical_thing failed" >&2
exit 1
}- Provide a default on failure:
value=$(some_command) || value="default"
Be careful with cmd || exit 1 in scripts that are meant to continue for non‑critical errors. Reserve this for truly fatal conditions.
Explicit `if` Checks
For clarity, especially in longer scripts:
if ! do_step; then
echo "ERROR: do_step failed" >&2
# maybe clean up or skip, depending on the situation
exit 1
fiThis is more verbose but easier to read and extend (e.g., logging, retries).
Handling Errors in Functions
Advanced scripts usually encapsulate tasks into functions. Error handling inside functions determines how failures propagate.
Returning Status vs Exiting
Two main styles:
- Functions return a status and caller decides:
deploy() {
rsync ...
# rsync's status becomes the function's status
}
if ! deploy; then
echo "Deploy failed" >&2
exit 1
fi- Functions exit the whole script on fatal errors:
fatal() {
echo "FATAL: $*" >&2
exit 1
}
deploy() {
rsync ... || fatal "rsync failed"
}
deployStyle 1 is more flexible and testable; style 2 is simpler for small scripts.
Consistent Return Codes
For predictable error handling:
- Reserve
0for success. - Use specific, documented codes for common error types when needed.
Example:
# 0 = success, 1 = general error, 2 = invalid usage
backup() {
if [ "$#" -ne 1 ]; then
echo "Usage: backup DIR" >&2
return 2
fi
# ...
}Callers can make decisions based on specific codes.
Trapping and Handling Failures
The `trap` Builtin
trap lets you run commands when certain signals or pseudo-signals occur, including:
EXIT— when the script is about to exit (for any reason)ERR— when a command fails (subject toerrexitbehavior)INT— on Ctrl+C (SIGINT)TERM— on termination (SIGTERM)
Cleanup on Exit
A common pattern is to do cleanup (temporary files, locks) in an EXIT trap:
tmpdir=$(mktemp -d)
cleanup() {
rm -rf "$tmpdir"
}
trap cleanup EXIT
# use $tmpdir for work...
This cleanup runs whether the script succeeds, hits an error, or is exited manually (except for catastrophic terminations like kill -9).
Using `ERR` for Centralized Error Handling
In Bash, with set -e and set -E, you can centralize error reporting:
set -Eeuo pipefail
error_handler() {
local exit_code=$?
local line_no=$1
echo "ERROR on line $line_no (exit code $exit_code)" >&2
}
trap 'error_handler $LINENO' ERRNotes:
$?inside the handler gives the failing command’s status.$LINENOexpands to the current line number when the trap is set; to correctly capture it, you pass it as an argument as shown.set -EensuresERRis inherited by functions in Bash.
With this pattern, any unhandled failure triggers a standardized diagnostic.
Trapping Interrupts (Ctrl+C)
To avoid leaving half-done operations:
cleanup() {
echo "Cleaning up..." >&2
# remove temp files, unlock things, etc.
}
trap cleanup INT TERM
long_running_task
If the user presses Ctrl+C, cleanup runs before the script exits.
Pipeline Error Handling
Without pipefail, only the last command’s exit status is visible:
grep pattern file | sort | uniq
echo $? # status of 'uniq', not 'grep'To correctly handle errors anywhere in the pipeline:
- Enable
pipefail:
set -o pipefail- Optionally check
PIPESTATUS(Bash-specific array):
set -o pipefail
grep pattern file | sort | uniq
status_grep=${PIPESTATUS[0]}
status_sort=${PIPESTATUS[1]}
status_uniq=${PIPESTATUS[2]}
if [ "$status_grep" -ne 0 ]; then
echo "grep failed with $status_grep" >&2
fi
For most scripts, set -o pipefail plus set -e (or manual if checks) is sufficient.
Validating Inputs and Preconditions
Many errors can be prevented by upfront checks. This is part of error handling too.
Checking Arguments
Use explicit argument validation:
usage() {
echo "Usage: $0 INPUT_FILE OUTPUT_DIR" >&2
exit 2
}
[ "$#" -eq 2 ] || usageChecking Files and Commands
Use test operators before doing critical work:
input=$1
if [ ! -f "$input" ]; then
echo "ERROR: $input does not exist or is not a regular file" >&2
exit 1
fi
if ! command -v jq >/dev/null 2>&1; then
echo "ERROR: 'jq' is required but not installed" >&2
exit 1
fiBy failing fast here, you avoid confusing errors deeper in the script.
Retrying and Recovery
Sometimes failures are transient (network hiccups, temporary locks). Instead of exiting immediately, you may want to retry.
Simple Retry Loop
retry() {
local attempts=$1; shift
local delay=$1; shift
local i
for ((i=1; i<=attempts; i++)); do
if "$@"; then
return 0
fi
echo "Attempt $i failed; retrying in $delay seconds..." >&2
sleep "$delay"
done
echo "ERROR: command failed after $attempts attempts: $*" >&2
return 1
}
# Usage:
retry 5 2 curl -fsS "https://example.com/api"Rollback on Failure
For operations that change state (e.g., deployment), plan a rollback:
deploy_new_version() {
backup_existing
if ! install_new; then
echo "Install failed, rolling back..." >&2
restore_backup
return 1
fi
}Rollback logic can get complex; scripts should keep it as straightforward and idempotent as possible.
Logging and Diagnostics
Good error handling includes good messages:
- Direct error messages to stderr:
>&2 - Include context: what failed, which resource, maybe which user or host
- Use consistent prefixes (e.g.,
ERROR:,WARN:,INFO:)
Example helper:
log_error() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $*" >&2
}
do_step || log_error "do_step failed with status $?"
When combined with trap 'error_handler $LINENO' ERR, you can get both context and line numbers.
Defensive Patterns and Best Practices
- Prefer explicit checks over magic behavior
Don’t rely solely onset -e; clearly handle critical failures. - Treat warnings differently from errors
Not everything needs to abort the script; document which conditions are fatal. - Fail early
Validate inputs and environment at the top of the script. - Keep error paths simple
Complex cleanup or rollback code can introduce new bugs; keep it straightforward. - Test error paths
Intentionally cause failures (fake missing files, failing commands) to see how the script behaves.
By combining exit status conventions, set options, traps, and structured patterns (retries, cleanup, validation), you can write shell scripts that behave predictably even when things go wrong.