17.3 Running applications on clusters

From Login to Results: The End-to-End Flow

Running on a cluster follows a fairly standard pattern, regardless of site:

Prepare your code and inputs (usually elsewhere: laptop, dev node, or login node).
Stage data and executables to the cluster filesystem.
Request resources and submit a job to the scheduler (e.g. SLURM).
Monitor while it is queued and running.
Inspect output, handle errors, and iterate.
Archive or clean up results.

This chapter focuses on how that flow looks in practice on a typical HPC cluster.

Typical roles of the main systems:

Local machine: editing code, prototyping, visualization, small tests.
Login node: compilation, job script creation, light interactive tests.
Compute nodes: actual production runs via the scheduler.
Storage: project space, scratch, shared data repositories.

Preparing to Run: Executables and Inputs

Before submitting anything:

Ensure your executable is built for the cluster environment:

Use the cluster’s compilers and libraries (often via module load).
Build on a login node or dedicated build node, not your laptop, if architectures differ.

Confirm runtime dependencies:

MPI version, math libraries, GPU libraries, Python environments, etc.

Organize input data and configuration files:

Use clear directory structures:

code/, inputs/, scripts/, results/, logs/ etc.

Avoid large numbers of tiny input files in a single directory if possible.

A typical project layout for running on a cluster:

my_project/
  code/
    main.cpp
    ...
  build/
    my_app            # compiled executable
  inputs/
    config.in
    initial_state.dat
  scripts/
    run_weak_scaling.slurm
    run_strong_scaling.slurm
  results/
    test/
    production/

Choosing How to Run: Interactive vs Batch

Most clusters support two main execution styles.

Interactive jobs

Use interactive jobs for short tests, debugging, and exploratory work.

Request an allocation that drops you into a shell on a compute node:

With SLURM:

    salloc -N 1 -n 4 -t 00:30:00 --partition=debug
    srun ./my_app input.dat

Characteristics:

You get a prompt on a compute node.
Commands you run there consume the resources of that allocation.
Good for:

Tuning thread/MPI counts
Checking environment and paths
Debugging with gdb, profilers, etc.

Site-specific examples (conceptually similar even if commands differ):

salloc (SLURM)
qsub -I (PBS/ Torque)
bsub -Is (LSF)

Batch jobs

Batch jobs are the normal way to run real workloads.

You write a job script that:

Describes the resources you need.
Describes what commands to run.

Submit the script to the scheduler and it runs when resources become available.

Skeleton batch flow (SLURM):

sbatch scripts/run_simulation.slurm   # submit job
squeue -u $USER                       # watch queue
cat slurm-123456.out                  # read output after completion

Practical Job Script Structure

The exact syntax depends on the scheduler; here we use SLURM as a concrete example, but the structure is similar in other systems.

A typical job script has four main parts:

Shebang: which shell to use.
Scheduler directives: resources, time, partition, account, etc.
Environment setup: modules, variables, working directory.
Execution commands: srun, mpirun, or an application launcher.

Example batch script for a simple MPI job:

#!/bin/bash
#SBATCH --job-name=mpi_test
#SBATCH --output=logs/mpi_test_%j.out
#SBATCH --error=logs/mpi_test_%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --partition=standard
#SBATCH --account=my_project
# 1. Environment setup
module purge
module load gcc/13.2.0
module load openmpi/5.0.0
cd $SLURM_SUBMIT_DIR   # directory where 'sbatch' was called
# 2. Run the application
srun ./build/my_app inputs/config.in

Key practical points:

Naming and logs:

Use %j for job ID in filenames to avoid collisions.
Separate stdout and stderr (--output vs --error) for easier debugging.

Reproducibility:

Print environment info at start:

    echo "Job ID: $SLURM_JOB_ID"
    echo "Running on nodes:"
    scontrol show hostnames "$SLURM_JOB_NODELIST"
    module list
    git rev-parse HEAD 2>/dev/null || echo "Not a git repo"

Exit on errors (optional):

  set -euo pipefail

This helps catch missing files or environment issues early.

Running Different Types of Applications

Clusters run a wide variety of codes. The launch pattern depends on the parallel model.

Serial (single-core) applications

Even serial programs should usually be run via the scheduler.

Script:

#SBATCH --job-name=serial_example
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
module load gcc/13.2.0
./build/serial_app inputs/config.in

Notes:

Often you still want --time, --mem (if your site supports it) to reserve enough memory.
Good for running many independent serial tasks via job arrays (see below).

OpenMP / threaded applications

Threaded codes use multiple cores within a node.

Key practical points:

Match --cpus-per-task to the number of threads you intend to use.
Set OMP_NUM_THREADS (or equivalent) appropriately.
Bind threads to cores if your site recommends it (e.g. OMP_PROC_BIND=true).

Example:

#SBATCH --job-name=openmp_example
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:30:00
module load gcc/13.2.0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
./build/openmp_app inputs/config.in

MPI applications

MPI codes use multiple processes, often across nodes.

Practical checklist:

Use the system-provided MPI that matches the launcher (srun, mpirun, mpiexec).
Request total tasks as nodes × tasks-per-node.
Avoid launching MPI processes manually with ssh.

Example:

#SBATCH --job-name=mpi_example
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=02:00:00
module load openmpi/5.0.0
srun ./build/mpi_app inputs/config.in

Some systems prefer mpirun or mpiexec:

mpirun -np $SLURM_NTASKS ./build/mpi_app inputs/config.in

Follow your site’s recommendation to avoid conflicts.

Hybrid MPI + OpenMP

Combine both models to exploit nodes with many cores.

Key practical choices:

MPI ranks per node × threads per rank = total cores per node.
Match this to the actual hardware (e.g. 2 sockets × 32 cores each).

Example:

#SBATCH --job-name=hybrid_example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4        # 4 MPI ranks per node
#SBATCH --cpus-per-task=8          # 8 threads per rank
#SBATCH --time=02:00:00
module load openmpi/5.0.0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
srun ./build/hybrid_app inputs/config.in

GPU-accelerated applications

You must explicitly request GPUs and typically load CUDA or other GPU stacks.

Example for a single GPU per task:

#SBATCH --job-name=gpu_example
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus=1
#SBATCH --time=01:00:00
#SBATCH --partition=gpu
module load cuda/12.2
module load gcc/13.2.0
nvidia-smi   # sanity check
./build/gpu_app inputs/config.in

Some clusters use --gres=gpu:1 instead of --gpus=1, or special GPU partitions. Always check local documentation.

Managing Many Runs: Job Arrays and Sweeps

Real workloads often require many similar runs:

Parameter sweeps.
Ensemble simulations.
Uncertainty quantification, Monte Carlo, or training multiple ML models.

Job arrays

Job arrays let you launch many nearly identical jobs in one command.

Conceptual pattern (SLURM):

#!/bin/bash
#SBATCH --job-name=array_example
#SBATCH --array=0-9
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
module load gcc/13.2.0
PARAM_FILE=params.txt
PARAM=$(sed -n "$((SLURM_ARRAY_TASK_ID+1))p" "$PARAM_FILE")
./build/my_app --param "$PARAM"

Notes:

SLURM_ARRAY_TASK_ID differs for each array element.
You can use it to select:

Lines in a parameter file.
Different input directories.
Different random seeds.

Customize outputs:

  #SBATCH --output=logs/run_%A_%a.out

where %A is the master array job ID and %a is the task ID.

Manual parameter sweeps within a job

Sometimes you prefer to run multiple cases within a single allocation to reduce queue overhead:

for param in 0.1 0.2 0.4 0.8; do
  echo "Running with param=$param"
  ./build/my_app --param "$param" > "results/param_${param}.out"
done

This is useful for cheap or quick runs, but make sure total run time stays within your wall-clock limit.

Resource Requests in Practice

How you request resources affects:

Queue wait time.
Performance.
Fair-share accounting.

Key choices:

Number of nodes: often 1 for development, more for strong/weak scaling tests.
Cores per node: match hardware and parallelization strategy.
Memory:

Per node or per CPU (site-specific flags like --mem or --mem-per-cpu).
Over-requesting can increase wait times.

Wall time:

Reasonable estimates reduce queue delay and limit job pre-emption or termination.
Use test runs and scaling experiments to refine estimates.

Partition/queue:

Small, short jobs: debug partition (fast turnaround, limited time).
Production runs: standard/long partitions.
GPU jobs: GPU-specific partition.

Practical pattern: start with small test runs (short time, 1 node) to:

Validate correctness.
Roughly measure performance.
Refine final resource request for larger production jobs.

Working with the Cluster Filesystem

How you manage data has a big impact on reliability and performance.

Choosing directories

Most clusters have distinct storage areas:

Home: small, backed up; good for scripts, configs, and small data.
Project / work: larger, often shared within a project; good for code and medium results.
Scratch: large, high-performance, not backed up; meant for temporary results and big I/O.

Typical running pattern:

Put code, scripts, parameter files in home or project space.
Run jobs from a working directory on scratch:

   cd /scratch/$USER/my_project/run1
   srun ./my_app ...

After completion, copy important results back to project/home before scratch is purged.

Avoiding filesystem pitfalls

Do not run large jobs from your home directory if it’s not designed for heavy I/O.
Avoid writing:

Huge numbers of small files.
Files in shared top-level directories.

For multi-process runs:

Prefer each rank writing to its own file or use parallel I/O libraries, instead of all ranks writing to the same file concurrently.

Use descriptive names:

  results/
    case_A/
    case_B/
    scaling_test_32nodes/

Monitoring, Debugging, and Restarting Jobs

Your jobs will not always behave as expected. Practical handling is crucial.

Monitoring in queue and during runtime

Tools commonly available:

Queue status:

squeue -u $USER
sacct -j <jobid>

Node status and load:

After logging in to a compute node:

top, htop, nvidia-smi (for GPUs)

Job logs:

Tail logs during or after the run:

    tail -f logs/my_job_123456.out

Include periodic progress messages in your application or wrapper scripts:

echo "Starting at $(date)"
./build/my_app ...
echo "Finished at $(date)"

Common runtime issues

Symptoms and practical responses:

Job killed due to time limit:

Message in output or scheduler logs.
Increase --time based on observed runtime.
Add internal checkpoints and restart capability if possible.

Out-of-memory:

Errors like "killed" or ENOMEM.
Request more memory (--mem or --mem-per-cpu) or reduce problem size.

GPU or MPI errors:

Often due to mismatched modules or incorrect launch commands.
Verify module list, launcher (srun vs mpirun), and node type (GPU vs CPU partition).

Checkpointing and restarts in practice

Many HPC applications support checkpoint/restart:

Enable checkpoints by:

Passing specific flags.
Setting options in input files.

Integrate with wall time:

Write a checkpoint some minutes before wall time is reached.

For restart runs:

Create a separate job script that:

Reads from the latest checkpoint.
Writes new outputs to a new directory.

Example (using a hypothetical app):

# First run
srun ./my_app --input init.in --checkpoint checkpoint.dat
# Restart run
srun ./my_app --restart checkpoint.dat

Scaling Up: From Test to Production

Transitioning from tiny tests to full-scale runs requires a deliberate process.

Typical progression:

Functional test:

1 node, very small problem size, short wall time.

Performance sanity check:

1 node, realistic problem size.
Confirm no severe bottlenecks (I/O, CPU idle, missing vectorization/GPU usage).

Scaling experiments:

Vary nodes or GPUs:

1, 2, 4, 8 nodes, measuring runtime and efficiency.

Decide where scaling benefits flatten or reverse.

Production plan:

Choose problem size and number of nodes based on scaling results.
Estimate wall time with margin (e.g. 20–30% buffer).
Submit final production jobs.

Keep all scripts and logs from test and scaling stages; they are useful for:

Troubleshooting.
Future users in your group.
Methodology sections in publications.

Practical Patterns and Tips

A few small habits significantly improve your experience:

Use a single entry script per run:

  # run_caseA.sh
  #!/bin/bash
  set -euo pipefail
  module purge
  module load my_software_stack
  CASE=caseA
  WORKDIR=/scratch/$USER/my_project/$CASE
  mkdir -p "$WORKDIR"
  cd "$WORKDIR"
  cp ~/my_project/inputs/$CASE/* .
  srun ~/my_project/build/my_app config.in
  cp -r . ~/my_project/results/$CASE

Then call this entry script from your job script.

Record configuration:

Copy job scripts, parameter files, and code version info into the results directory.

Avoid running heavy commands on login nodes:

For large preprocessing or postprocessing, use small interactive allocations.

Read site docs:

Partitions/queues, limits, site-specific launch wrappers, filesystem rules, and example scripts.

Putting It Together: A Minimal End-to-End Example

Imagine you want to run an MPI simulation on 4 nodes with 32 tasks per node.

Compile on the login node:

   module load gcc/13.2.0 openmpi/5.0.0
   mkdir -p build && cd build
   mpicxx ../code/main.cpp -O3 -o my_app

Create a job script scripts/run_production.slurm:

   #!/bin/bash
   #SBATCH --job-name=prod_sim
   #SBATCH --output=logs/prod_sim_%j.out
   #SBATCH --error=logs/prod_sim_%j.err
   #SBATCH --nodes=4
   #SBATCH --ntasks-per-node=32
   #SBATCH --time=04:00:00
   #SBATCH --partition=standard
   #SBATCH --account=my_project
   set -euo pipefail
   module purge
   module load gcc/13.2.0 openmpi/5.0.0
   cd $SLURM_SUBMIT_DIR
   # Run from scratch
   RUN_DIR=/scratch/$USER/prod_run_${SLURM_JOB_ID}
   mkdir -p "$RUN_DIR"
   cp inputs/config_prod.in "$RUN_DIR"
   cp build/my_app "$RUN_DIR"
   cd "$RUN_DIR"
   srun ./my_app config_prod.in
   # Save results
   mkdir -p $SLURM_SUBMIT_DIR/results/prod
   cp -r . $SLURM_SUBMIT_DIR/results/prod/run_${SLURM_JOB_ID}

Submit:

   cd ~/my_project
   sbatch scripts/run_production.slurm

Monitor:

   squeue -u $USER
   tail -f logs/prod_sim_123456.out    # replace with actual job ID

Inspect results in results/prod/run_<jobid>/ after completion.

This pattern—compile, stage data, request resources with a carefully written job script, monitor, and archive results—is the core of running applications on clusters in practice.

Comments

Please login to add a comment.

Don't have an account? Register now!

17.3 Running applications on clusters

From Login to Results: The End-to-End Flow

Preparing to Run: Executables and Inputs

Choosing How to Run: Interactive vs Batch

Interactive jobs

Batch jobs

Practical Job Script Structure

Running Different Types of Applications

Serial (single-core) applications

OpenMP / threaded applications

MPI applications

Hybrid MPI + OpenMP

GPU-accelerated applications

Managing Many Runs: Job Arrays and Sweeps

Job arrays

Manual parameter sweeps within a job

Resource Requests in Practice

Working with the Cluster Filesystem

Choosing directories

Avoiding filesystem pitfalls

Monitoring, Debugging, and Restarting Jobs

Monitoring in queue and during runtime

Common runtime issues

Checkpointing and restarts in practice

Scaling Up: From Test to Production

Practical Patterns and Tips

Putting It Together: A Minimal End-to-End Example

Comments

Where to Move