4.1.2 Error handling

Why Error Handling Matters in Shell Scripts

Shell scripts often glue many commands together. Any one of them can fail due to missing files, permissions, network issues, or user input errors. Without explicit error handling you get:

Silent failures (script continues doing the wrong thing)
Partial work (some steps succeeded, others didn’t)
Corrupted state (half-written files, partially applied changes)

The goal of error handling is not only to detect failures, but to:

Fail fast and loudly when appropriate
Clean up temporary resources
Provide meaningful diagnostics
Optionally recover or retry where it’s safe

This chapter assumes you already understand basic script structure, variables, and functions.

Exit Status and Failure Conventions

Every command returns an exit status in $?. By convention:

0 means success
Non‑zero means some kind of failure

You rarely read $? directly in advanced scripts; instead you let conditionals and the shell’s own behavior do most of the work.

Examples:

cmd
status=$?
if [ "$status" -ne 0 ]; then
    echo "cmd failed with status $status" >&2
fi

Most conditionals automatically use exit statuses:

if cmd; then ... fi runs the then branch if cmd returns 0.
&& and || chains also use exit statuses (see below).

Controlling Script Exit on Errors

`set -e` / `set -o errexit`

set -e tells the shell: “exit the script if any simple command fails (returns non‑zero), with some exceptions.”

set -e  # or: set -o errexit
cp file1 file2
echo "Copy succeeded"

If cp fails, the script terminates immediately.

Traps and Pitfalls of `set -e`

set -e is useful, but its behavior is nuanced:

It does not trigger on commands:

in if, while, until conditions
after ! (negation)
in cmd1 && cmd2 or cmd1 || cmd2 where the shell expects failures

It may be ignored in subshells depending on how they’re invoked

Example:

set -e
if grep pattern file; then   # `grep` failure does not exit script
    echo "found"
fi

Because of these subtleties, many experienced scripters:

Use set -e together with other options (see below), and
Prefer explicit checks and || exit patterns for critical commands.

Making `set -e` Safer: `-u`, `-o pipefail`, and `-E`

Often combined options (in Bash):

set -euo pipefail
# or more explicitly:
set -o errexit -o nounset -o pipefail

-u (set -o nounset): treat use of unset variables as an error.
-o pipefail: propagate the first failing command in a pipeline.
-E: preserve ERR traps across functions and command substitutions (Bash only, explained later).

pipefail is especially important: without it, only the exit status of the last command in a pipeline is visible.

set -e
false | true
echo "Still running"   # script continues; pipeline exit is 0 (= true)

With set -o pipefail, the pipeline returns non‑zero if any element fails:

set -eo pipefail
false | true
echo "Never reached"   # script exits due to 'false'

Patterns for Checking Errors

Using `&&` and `||`

&& runs the next command only if the previous succeeded; || runs the next command only if the previous failed.

Common patterns:

Only run follow‑up on success:

  build_app && deploy_app

Exit on failure with a message:

  do_critical_thing || {
      echo "do_critical_thing failed" >&2
      exit 1
  }

Provide a default on failure:

  value=$(some_command) || value="default"

Be careful with cmd || exit 1 in scripts that are meant to continue for non‑critical errors. Reserve this for truly fatal conditions.

Explicit `if` Checks

For clarity, especially in longer scripts:

if ! do_step; then
    echo "ERROR: do_step failed" >&2
    # maybe clean up or skip, depending on the situation
    exit 1
fi

This is more verbose but easier to read and extend (e.g., logging, retries).

Handling Errors in Functions

Advanced scripts usually encapsulate tasks into functions. Error handling inside functions determines how failures propagate.

Returning Status vs Exiting

Two main styles:

Functions return a status and caller decides:

   deploy() {
       rsync ...
       # rsync's status becomes the function's status
   }
   if ! deploy; then
       echo "Deploy failed" >&2
       exit 1
   fi

Functions exit the whole script on fatal errors:

   fatal() {
       echo "FATAL: $*" >&2
       exit 1
   }
   deploy() {
       rsync ... || fatal "rsync failed"
   }
   deploy

Style 1 is more flexible and testable; style 2 is simpler for small scripts.

Consistent Return Codes

For predictable error handling:

Reserve 0 for success.
Use specific, documented codes for common error types when needed.

Example:

# 0 = success, 1 = general error, 2 = invalid usage
backup() {
    if [ "$#" -ne 1 ]; then
        echo "Usage: backup DIR" >&2
        return 2
    fi
    # ...
}

Callers can make decisions based on specific codes.

Trapping and Handling Failures

The `trap` Builtin

trap lets you run commands when certain signals or pseudo-signals occur, including:

EXIT — when the script is about to exit (for any reason)
ERR — when a command fails (subject to errexit behavior)
INT — on Ctrl+C (SIGINT)
TERM — on termination (SIGTERM)

Cleanup on Exit

A common pattern is to do cleanup (temporary files, locks) in an EXIT trap:

tmpdir=$(mktemp -d)
cleanup() {
    rm -rf "$tmpdir"
}
trap cleanup EXIT
# use $tmpdir for work...

This cleanup runs whether the script succeeds, hits an error, or is exited manually (except for catastrophic terminations like kill -9).

Using `ERR` for Centralized Error Handling

In Bash, with set -e and set -E, you can centralize error reporting:

set -Eeuo pipefail
error_handler() {
    local exit_code=$?
    local line_no=$1
    echo "ERROR on line $line_no (exit code $exit_code)" >&2
}
trap 'error_handler $LINENO' ERR

Notes:

$? inside the handler gives the failing command’s status.
$LINENO expands to the current line number when the trap is set; to correctly capture it, you pass it as an argument as shown.
set -E ensures ERR is inherited by functions in Bash.

With this pattern, any unhandled failure triggers a standardized diagnostic.

Trapping Interrupts (Ctrl+C)

To avoid leaving half-done operations:

cleanup() {
    echo "Cleaning up..." >&2
    # remove temp files, unlock things, etc.
}
trap cleanup INT TERM
long_running_task

If the user presses Ctrl+C, cleanup runs before the script exits.

Pipeline Error Handling

Without pipefail, only the last command’s exit status is visible:

grep pattern file | sort | uniq
echo $?   # status of 'uniq', not 'grep'

To correctly handle errors anywhere in the pipeline:

Enable pipefail:

   set -o pipefail

Optionally check PIPESTATUS (Bash-specific array):

   set -o pipefail
   grep pattern file | sort | uniq
   status_grep=${PIPESTATUS[0]}
   status_sort=${PIPESTATUS[1]}
   status_uniq=${PIPESTATUS[2]}
   if [ "$status_grep" -ne 0 ]; then
       echo "grep failed with $status_grep" >&2
   fi

For most scripts, set -o pipefail plus set -e (or manual if checks) is sufficient.

Validating Inputs and Preconditions

Many errors can be prevented by upfront checks. This is part of error handling too.

Checking Arguments

Use explicit argument validation:

usage() {
    echo "Usage: $0 INPUT_FILE OUTPUT_DIR" >&2
    exit 2
}
[ "$#" -eq 2 ] || usage

Checking Files and Commands

Use test operators before doing critical work:

input=$1
if [ ! -f "$input" ]; then
    echo "ERROR: $input does not exist or is not a regular file" >&2
    exit 1
fi
if ! command -v jq >/dev/null 2>&1; then
    echo "ERROR: 'jq' is required but not installed" >&2
    exit 1
fi

By failing fast here, you avoid confusing errors deeper in the script.

Retrying and Recovery

Sometimes failures are transient (network hiccups, temporary locks). Instead of exiting immediately, you may want to retry.

Simple Retry Loop

retry() {
    local attempts=$1; shift
    local delay=$1; shift
    local i
    for ((i=1; i<=attempts; i++)); do
        if "$@"; then
            return 0
        fi
        echo "Attempt $i failed; retrying in $delay seconds..." >&2
        sleep "$delay"
    done
    echo "ERROR: command failed after $attempts attempts: $*" >&2
    return 1
}
# Usage:
retry 5 2 curl -fsS "https://example.com/api"

Rollback on Failure

For operations that change state (e.g., deployment), plan a rollback:

deploy_new_version() {
    backup_existing
    if ! install_new; then
        echo "Install failed, rolling back..." >&2
        restore_backup
        return 1
    fi
}

Rollback logic can get complex; scripts should keep it as straightforward and idempotent as possible.

Logging and Diagnostics

Good error handling includes good messages:

Direct error messages to stderr: >&2
Include context: what failed, which resource, maybe which user or host
Use consistent prefixes (e.g., ERROR:, WARN:, INFO:)

Example helper:

log_error() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $*" >&2
}
do_step || log_error "do_step failed with status $?"

When combined with trap 'error_handler $LINENO' ERR, you can get both context and line numbers.

Defensive Patterns and Best Practices

Prefer explicit checks over magic behavior
Don’t rely solely on set -e; clearly handle critical failures.
Treat warnings differently from errors
Not everything needs to abort the script; document which conditions are fatal.
Fail early
Validate inputs and environment at the top of the script.
Keep error paths simple
Complex cleanup or rollback code can introduce new bugs; keep it straightforward.
Test error paths
Intentionally cause failures (fake missing files, failing commands) to see how the script behaves.

By combining exit status conventions, set options, traps, and structured patterns (retries, cleanup, validation), you can write shell scripts that behave predictably even when things go wrong.

Comments

Please login to add a comment.

Don't have an account? Register now!