4.1.2 Error handling

Table of Contents

Why Error Handling Matters in Shell Scripts

Error handling in shell scripts is about detecting when something goes wrong, reacting to it in a controlled way, and making sure your script does not silently produce incorrect results. Even simple scripts benefit from basic error checking, and larger automation can become dangerous if they assume that every command always succeeds.

Without explicit error handling, Bash and other shells normally continue running the next command even if the previous one failed. This default behavior is convenient for interactive use, but risky in automation. The central idea of error handling is to notice failures as they happen and decide what to do next, such as aborting, retrying, or using a fallback.

Always assume that any external command, file operation, or network call can fail, and design your script to detect and handle that failure.

Exit Status and `$?`

Every command in the shell finishes with an exit status. An exit status is a number in the range $0$ to $255$ that indicates success or failure. By convention:

Exit code 0 means success.
Any nonzero exit code means some kind of failure.

The most direct way to check the result of the last command is the special variable $?. Immediately after a command runs, $? holds its exit status. It is overwritten after each command, so you must read it before running another command.

Example:

cp /source/file /dest/file
status=$?
echo "cp exit code was $status"

You can compare this status and react:

cp /source/file /dest/file
if [ $? -ne 0 ]; then
    echo "Copy failed" >&2
    exit 1
fi

This pattern is useful when you want full control over how each failure is handled. However, constantly checking $? becomes verbose, which leads to more concise patterns such as using if or || directly.

The exit status $? is only valid for the immediately preceding command. Any other command or subshell evaluated first will overwrite it.

Using `if`, `&&`, and `||` for Error Checks

In shell scripts, control structures are closely tied to exit codes. The if statement runs its body when the command after if returns 0. Operators && and || execute conditional follow-up commands based on command success or failure.

A typical pattern is:

if some_command; then
    echo "some_command succeeded"
else
    echo "some_command failed" >&2
fi

Here the if directly tests the exit status of some_command. You do not need to use $? explicitly.

Operator && runs the following command only if the previous one succeeded:

mkdir -p /backup && echo "Backup directory ready"

Operator || runs the following command only if the previous one failed:

cp important.txt /backup/ || echo "Backup failed" >&2

A common idiom is to abort on failure:

cp important.txt /backup/ || exit 1

You can also combine them:

some_command && echo "OK" || echo "Failed"

You must be careful with such combinations. The second part after || will run if the entire left side has a nonzero exit code. If the echo "OK" command were to fail, the right part would run even though some_command succeeded. For simple logging this is usually acceptable, but for critical decisions you often want a more explicit structure with if.

When you chain commands with cmd && success || failure, remember that failure runs if either cmd fails or success fails. Use if when you need certainty about which command failed.

Exiting with Meaningful Codes

When your script finishes, it should return an exit code that indicates success or failure to whatever called it. This is useful when scripts are chained together or run from other tools.

By default, a script exits with the exit status of the last command that ran. You can override this with the exit builtin:

if ! do_critical_thing; then
    echo "Critical step failed" >&2
    exit 2
fi
echo "All good"
exit 0

Using different exit codes for different error conditions makes automation easier. For example, you can define:

0 for success.
1 for a generic error.
2 for misuse of shell builtins or wrong arguments.
Higher numbers for specific failures, such as missing configuration, network failure, or permission problems.

There is no universal standard for all scripts, but using a small set of consistent codes within a project improves clarity.

`set -e` and `errexit`

The shell can be instructed to stop a script when a command fails. The most common option is set -e, also known as errexit. When errexit is active, the script will immediately exit if any simple command returns a nonzero status, with some important exceptions.

You can enable it like this:

set -e

or equivalently:

set -o errexit

With set -e turned on, this script will exit as soon as cp fails:

set -e
cp source.txt /restricted/directory
echo "This will not run if cp fails"

This seems attractive because you do not have to check every command manually. However, set -e has subtle rules. For example it does not apply in some parts of compound commands and is affected by if, while, until, and logical operators.

Examples of situations where set -e does not cause an exit:

A failing command used directly in an if condition.
A failing command in cmd || fallback where the failure is expected and handled.

Example:

set -e
if grep -q "pattern" file.txt; then
    echo "Pattern found"
else
    echo "Pattern not found"
fi

Here grep returning nonzero for "not found" does not abort the script, because it is part of an if test. This behavior is intentional, but it means you need to understand how Bash interprets failures in different contexts.

Because of these subtleties, many experienced script authors either avoid set -e in complex scripts or use it with clearly documented patterns.

set -e can cause scripts to exit in unexpected places, especially in complex control structures. If you use it, test your script thoroughly with both success and failure scenarios.

`set -u`, `set -o pipefail`, and `set -E`

Error handling in Bash is often strengthened with other options alongside set -e. A common group of settings is sometimes referred to as a defensive shell style.

set -u or set -o nounset treats the use of an unset variable as an error. Instead of silently expanding to an empty string, Bash prints an error and, if errexit is active, exits.

Example:

set -u
echo "User is $USERNAME"

If USERNAME is not set, the script will detect it. This is useful to catch typos and missing environment variables early.

set -o pipefail affects pipelines. Normally, a pipeline such as cmd1 | cmd2 | cmd3 returns the exit status of the last command only. If cmd1 fails but cmd3 succeeds, the pipeline appears successful.

With pipefail enabled, the pipeline’s exit status is the value of the rightmost command that failed with a nonzero status, or 0 if all succeed.

Example:

set -o pipefail
grep "pattern" file.txt | sort | uniq

If grep fails because file.txt is missing, the entire pipeline will report failure, even if sort and uniq run.

set -E modifies how ERR traps behave within functions and subshells, and is relevant when you start using traps for error handling. It makes error traps propagate more predictably into function contexts.

A commonly seen pattern at the start of robust scripts is:

set -euo pipefail

This is a shorthand that combines:

-e for errexit.
-u for nounset.
-o pipefail for strict pipeline error detection.

Some shells other than Bash may not support all three exactly the same way, so this pattern is most reliable in Bash itself.

The combination set -euo pipefail aggressively treats many conditions as fatal errors. Use it only after confirming your script’s logic and testing how it behaves when individual commands fail.

Handling Errors in Pipelines

Pipelines are common in shell scripts, but they complicate error handling. Without pipefail, only the exit code of the last command is visible in $?. This can hide failures earlier in the pipeline.

Consider:

grep "pattern" file.txt | sort
echo "Status: $?"

If file.txt does not exist, grep will fail. However, sort may still succeed with no input and return 0. The status printed will be 0, and your script may assume everything went well.

With pipefail:

set -o pipefail
grep "pattern" file.txt | sort
echo "Status with pipefail: $?"

Now the status reflects the first failing command in the pipeline, which is more accurate for error detection.

When you need to distinguish which command failed, you can inspect the PIPESTATUS array in Bash, which holds the exit status of each command in the most recent pipeline.

Example:

grep "pattern" file.txt | sort | uniq
echo "grep: ${PIPESTATUS[0]} sort: ${PIPESTATUS[1]} uniq: ${PIPESTATUS[2]}"

You can then handle each case differently, for example treating "no matches found" as a nonfatal condition while considering a missing file fatal.

Using `trap` and `ERR` for Centralized Error Handling

The trap builtin allows you to specify commands that run automatically when certain signals or events occur. For error handling, the special ERR pseudo-signal is important. It triggers when a command in the script returns a nonzero status, subject to some conditions.

You can define a central error handler like this:

set -E
trap 'echo "Error on line $LINENO"; exit 1' ERR

Whenever a command fails, Bash runs the trap handler, prints the message with the line number, and exits. This pattern avoids repeating similar checks after each command.

You can extend this handler with more information, such as the last command’s status:

set -E
trap 'status=$?; echo "Error $status on line $LINENO"; exit $status' ERR

This retains the original exit code, which is important when other tools inspect why a script failed.

There are details to consider:

Some failures that are part of control flow, such as a failing test in an if, may not trigger the ERR trap in all situations.
When errexit is active, an ERR trap can run just before the script exits on a failure, which is a useful place to perform cleanup.

You can also trap other signals such as INT (Ctrl+C) or TERM to perform cleanup or logging before exiting. Although this is not strictly about errors in commands, it is part of robust error handling for real-world scripts.

When using trap ERR, test how it interacts with functions, subshells, and conditional constructs in your script. Use set -E in Bash to ensure the trap runs inside functions as expected.

Designing Fail-Fast vs Fail-Soft Behavior

Error handling is not always about exiting immediately. Sometimes you want a script to keep going, skip bad items, and finish what it can. Other times a single failure should abort the entire process.

Fail-fast behavior is appropriate for:

Critical setup steps, such as mounting a filesystem or obtaining a lock.
Operations where partial success is worse than no action at all, such as applying a database migration.

In these cases, the script should immediately exit on error, often with a clear message and nonzero exit code.

Fail-soft behavior is appropriate for:

Processing large sets of independent items, such as many files or hosts.
Cleanup tasks where some items might already be absent.

For fail-soft designs, you might:

Count failures and report them at the end.
Log each failure but continue processing the rest.
Retry certain operations a fixed number of times.

An example of fail-soft handling:

errors=0
for f in *.txt; do
    cp "$f" /backup/ || {
        echo "Failed to back up $f" >&2
        errors=$((errors + 1))
    }
done
if [ "$errors" -gt 0 ]; then
    echo "$errors files failed to back up" >&2
    exit 1
fi

Here the script continues and records problems instead of stopping at the first error.

Decide explicitly whether each part of your script should fail fast or fail soft. Apply consistent patterns, such as immediate exit for critical steps and counted errors for batch operations.

Cleaning Up on Errors

Many scripts create temporary files, temporary directories, or lock files. If a script ends on an error, leaving these behind can be harmful. Robust error handling includes cleanup when something goes wrong.

A common pattern is to create resources and register a trap that removes them on exit, whether that exit is normal or due to an error.

Example:

tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' EXIT
# Use "$tmpdir" for intermediate work

The trap on EXIT runs whether your script succeeds or fails or is terminated by a call to exit. Combining this with ERR or set -e ensures that errors do not leave stray files.

If you use signals such as INT and TERM to handle interruption by the user or the system, you can add them to the trap list so that cleanup runs even when the user presses Ctrl+C.

Always arrange for temporary files, lock files, and similar resources to be removed on both success and failure, usually with a trap on EXIT or on specific signals.

Logging and Error Messages

Error handling is not complete without clear messages. A script that fails silently is hard to debug. Simple practices make problems easier to diagnose:

Print error messages to standard error using >&2.
Include context, such as the file name, command, or function that failed.
Optionally include an error code or line number.

Example of a small helper:

error() {
    echo "ERROR: $*" >&2
}

4.1.2 Error handling

Why Error Handling Matters in Shell Scripts

Exit Status and `$?`

Using `if`, `&&`, and `||` for Error Checks

Exiting with Meaningful Codes

`set -e` and `errexit`

`set -u`, `set -o pipefail`, and `set -E`

Handling Errors in Pipelines

Using `trap` and `ERR` for Centralized Error Handling

Designing Fail-Fast vs Fail-Soft Behavior

Cleaning Up on Errors

Logging and Error Messages

Comments

Where to Move