Table of Contents
From Login to Results: The End-to-End Flow
Running on a cluster follows a fairly standard pattern, regardless of site:
- Prepare your code and inputs (usually elsewhere: laptop, dev node, or login node).
- Stage data and executables to the cluster filesystem.
- Request resources and submit a job to the scheduler (e.g. SLURM).
- Monitor while it is queued and running.
- Inspect output, handle errors, and iterate.
- Archive or clean up results.
This chapter focuses on how that flow looks in practice on a typical HPC cluster.
Typical roles of the main systems:
- Local machine: editing code, prototyping, visualization, small tests.
- Login node: compilation, job script creation, light interactive tests.
- Compute nodes: actual production runs via the scheduler.
- Storage: project space, scratch, shared data repositories.
Preparing to Run: Executables and Inputs
Before submitting anything:
- Ensure your executable is built for the cluster environment:
- Use the cluster’s compilers and libraries (often via
module load). - Build on a login node or dedicated build node, not your laptop, if architectures differ.
- Confirm runtime dependencies:
- MPI version, math libraries, GPU libraries, Python environments, etc.
- Organize input data and configuration files:
- Use clear directory structures:
code/,inputs/,scripts/,results/,logs/etc.- Avoid large numbers of tiny input files in a single directory if possible.
A typical project layout for running on a cluster:
my_project/
code/
main.cpp
...
build/
my_app # compiled executable
inputs/
config.in
initial_state.dat
scripts/
run_weak_scaling.slurm
run_strong_scaling.slurm
results/
test/
production/Choosing How to Run: Interactive vs Batch
Most clusters support two main execution styles.
Interactive jobs
Use interactive jobs for short tests, debugging, and exploratory work.
- Request an allocation that drops you into a shell on a compute node:
- With SLURM:
salloc -N 1 -n 4 -t 00:30:00 --partition=debug
srun ./my_app input.dat- Characteristics:
- You get a prompt on a compute node.
- Commands you run there consume the resources of that allocation.
- Good for:
- Tuning thread/MPI counts
- Checking environment and paths
- Debugging with gdb, profilers, etc.
Site-specific examples (conceptually similar even if commands differ):
salloc(SLURM)qsub -I(PBS/ Torque)bsub -Is(LSF)
Batch jobs
Batch jobs are the normal way to run real workloads.
- You write a job script that:
- Describes the resources you need.
- Describes what commands to run.
- Submit the script to the scheduler and it runs when resources become available.
Skeleton batch flow (SLURM):
sbatch scripts/run_simulation.slurm # submit job
squeue -u $USER # watch queue
cat slurm-123456.out # read output after completionPractical Job Script Structure
The exact syntax depends on the scheduler; here we use SLURM as a concrete example, but the structure is similar in other systems.
A typical job script has four main parts:
- Shebang: which shell to use.
- Scheduler directives: resources, time, partition, account, etc.
- Environment setup: modules, variables, working directory.
- Execution commands:
srun,mpirun, or an application launcher.
Example batch script for a simple MPI job:
#!/bin/bash
#SBATCH --job-name=mpi_test
#SBATCH --output=logs/mpi_test_%j.out
#SBATCH --error=logs/mpi_test_%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --partition=standard
#SBATCH --account=my_project
# 1. Environment setup
module purge
module load gcc/13.2.0
module load openmpi/5.0.0
cd $SLURM_SUBMIT_DIR # directory where 'sbatch' was called
# 2. Run the application
srun ./build/my_app inputs/config.inKey practical points:
- Naming and logs:
- Use
%jfor job ID in filenames to avoid collisions. - Separate stdout and stderr (
--outputvs--error) for easier debugging. - Reproducibility:
- Print environment info at start:
echo "Job ID: $SLURM_JOB_ID"
echo "Running on nodes:"
scontrol show hostnames "$SLURM_JOB_NODELIST"
module list
git rev-parse HEAD 2>/dev/null || echo "Not a git repo"- Exit on errors (optional):
set -euo pipefailThis helps catch missing files or environment issues early.
Running Different Types of Applications
Clusters run a wide variety of codes. The launch pattern depends on the parallel model.
Serial (single-core) applications
Even serial programs should usually be run via the scheduler.
Script:
#SBATCH --job-name=serial_example
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
module load gcc/13.2.0
./build/serial_app inputs/config.inNotes:
- Often you still want
--time,--mem(if your site supports it) to reserve enough memory. - Good for running many independent serial tasks via job arrays (see below).
OpenMP / threaded applications
Threaded codes use multiple cores within a node.
Key practical points:
- Match
--cpus-per-taskto the number of threads you intend to use. - Set
OMP_NUM_THREADS(or equivalent) appropriately. - Bind threads to cores if your site recommends it (e.g.
OMP_PROC_BIND=true).
Example:
#SBATCH --job-name=openmp_example
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:30:00
module load gcc/13.2.0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
./build/openmp_app inputs/config.inMPI applications
MPI codes use multiple processes, often across nodes.
Practical checklist:
- Use the system-provided MPI that matches the launcher (
srun,mpirun,mpiexec). - Request total tasks as
nodes × tasks-per-node. - Avoid launching MPI processes manually with
ssh.
Example:
#SBATCH --job-name=mpi_example
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=02:00:00
module load openmpi/5.0.0
srun ./build/mpi_app inputs/config.in
Some systems prefer mpirun or mpiexec:
mpirun -np $SLURM_NTASKS ./build/mpi_app inputs/config.inFollow your site’s recommendation to avoid conflicts.
Hybrid MPI + OpenMP
Combine both models to exploit nodes with many cores.
Key practical choices:
- MPI ranks per node × threads per rank = total cores per node.
- Match this to the actual hardware (e.g. 2 sockets × 32 cores each).
Example:
#SBATCH --job-name=hybrid_example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4 # 4 MPI ranks per node
#SBATCH --cpus-per-task=8 # 8 threads per rank
#SBATCH --time=02:00:00
module load openmpi/5.0.0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
srun ./build/hybrid_app inputs/config.inGPU-accelerated applications
You must explicitly request GPUs and typically load CUDA or other GPU stacks.
Example for a single GPU per task:
#SBATCH --job-name=gpu_example
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus=1
#SBATCH --time=01:00:00
#SBATCH --partition=gpu
module load cuda/12.2
module load gcc/13.2.0
nvidia-smi # sanity check
./build/gpu_app inputs/config.in
Some clusters use --gres=gpu:1 instead of --gpus=1, or special GPU partitions. Always check local documentation.
Managing Many Runs: Job Arrays and Sweeps
Real workloads often require many similar runs:
- Parameter sweeps.
- Ensemble simulations.
- Uncertainty quantification, Monte Carlo, or training multiple ML models.
Job arrays
Job arrays let you launch many nearly identical jobs in one command.
Conceptual pattern (SLURM):
#!/bin/bash
#SBATCH --job-name=array_example
#SBATCH --array=0-9
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
module load gcc/13.2.0
PARAM_FILE=params.txt
PARAM=$(sed -n "$((SLURM_ARRAY_TASK_ID+1))p" "$PARAM_FILE")
./build/my_app --param "$PARAM"Notes:
SLURM_ARRAY_TASK_IDdiffers for each array element.- You can use it to select:
- Lines in a parameter file.
- Different input directories.
- Different random seeds.
- Customize outputs:
#SBATCH --output=logs/run_%A_%a.out
where %A is the master array job ID and %a is the task ID.
Manual parameter sweeps within a job
Sometimes you prefer to run multiple cases within a single allocation to reduce queue overhead:
for param in 0.1 0.2 0.4 0.8; do
echo "Running with param=$param"
./build/my_app --param "$param" > "results/param_${param}.out"
doneThis is useful for cheap or quick runs, but make sure total run time stays within your wall-clock limit.
Resource Requests in Practice
How you request resources affects:
- Queue wait time.
- Performance.
- Fair-share accounting.
Key choices:
- Number of nodes: often 1 for development, more for strong/weak scaling tests.
- Cores per node: match hardware and parallelization strategy.
- Memory:
- Per node or per CPU (site-specific flags like
--memor--mem-per-cpu). - Over-requesting can increase wait times.
- Wall time:
- Reasonable estimates reduce queue delay and limit job pre-emption or termination.
- Use test runs and scaling experiments to refine estimates.
- Partition/queue:
- Small, short jobs: debug partition (fast turnaround, limited time).
- Production runs: standard/long partitions.
- GPU jobs: GPU-specific partition.
Practical pattern: start with small test runs (short time, 1 node) to:
- Validate correctness.
- Roughly measure performance.
- Refine final resource request for larger production jobs.
Working with the Cluster Filesystem
How you manage data has a big impact on reliability and performance.
Choosing directories
Most clusters have distinct storage areas:
- Home: small, backed up; good for scripts, configs, and small data.
- Project / work: larger, often shared within a project; good for code and medium results.
- Scratch: large, high-performance, not backed up; meant for temporary results and big I/O.
Typical running pattern:
- Put code, scripts, parameter files in home or project space.
- Run jobs from a working directory on scratch:
cd /scratch/$USER/my_project/run1
srun ./my_app ...- After completion, copy important results back to project/home before scratch is purged.
Avoiding filesystem pitfalls
- Do not run large jobs from your home directory if it’s not designed for heavy I/O.
- Avoid writing:
- Huge numbers of small files.
- Files in shared top-level directories.
- For multi-process runs:
- Prefer each rank writing to its own file or use parallel I/O libraries, instead of all ranks writing to the same file concurrently.
- Use descriptive names:
results/
case_A/
case_B/
scaling_test_32nodes/Monitoring, Debugging, and Restarting Jobs
Your jobs will not always behave as expected. Practical handling is crucial.
Monitoring in queue and during runtime
Tools commonly available:
- Queue status:
squeue -u $USERsacct -j <jobid>- Node status and load:
- After logging in to a compute node:
top,htop,nvidia-smi(for GPUs)- Job logs:
- Tail logs during or after the run:
tail -f logs/my_job_123456.outInclude periodic progress messages in your application or wrapper scripts:
echo "Starting at $(date)"
./build/my_app ...
echo "Finished at $(date)"Common runtime issues
Symptoms and practical responses:
- Job killed due to time limit:
- Message in output or scheduler logs.
- Increase
--timebased on observed runtime. - Add internal checkpoints and restart capability if possible.
- Out-of-memory:
- Errors like "killed" or
ENOMEM. - Request more memory (
--memor--mem-per-cpu) or reduce problem size. - GPU or MPI errors:
- Often due to mismatched modules or incorrect launch commands.
- Verify
module list, launcher (srunvsmpirun), and node type (GPU vs CPU partition).
Checkpointing and restarts in practice
Many HPC applications support checkpoint/restart:
- Enable checkpoints by:
- Passing specific flags.
- Setting options in input files.
- Integrate with wall time:
- Write a checkpoint some minutes before wall time is reached.
- For restart runs:
- Create a separate job script that:
- Reads from the latest checkpoint.
- Writes new outputs to a new directory.
Example (using a hypothetical app):
# First run
srun ./my_app --input init.in --checkpoint checkpoint.dat
# Restart run
srun ./my_app --restart checkpoint.datScaling Up: From Test to Production
Transitioning from tiny tests to full-scale runs requires a deliberate process.
Typical progression:
- Functional test:
- 1 node, very small problem size, short wall time.
- Performance sanity check:
- 1 node, realistic problem size.
- Confirm no severe bottlenecks (I/O, CPU idle, missing vectorization/GPU usage).
- Scaling experiments:
- Vary nodes or GPUs:
- 1, 2, 4, 8 nodes, measuring runtime and efficiency.
- Decide where scaling benefits flatten or reverse.
- Production plan:
- Choose problem size and number of nodes based on scaling results.
- Estimate wall time with margin (e.g. 20–30% buffer).
- Submit final production jobs.
Keep all scripts and logs from test and scaling stages; they are useful for:
- Troubleshooting.
- Future users in your group.
- Methodology sections in publications.
Practical Patterns and Tips
A few small habits significantly improve your experience:
- Use a single entry script per run:
# run_caseA.sh
#!/bin/bash
set -euo pipefail
module purge
module load my_software_stack
CASE=caseA
WORKDIR=/scratch/$USER/my_project/$CASE
mkdir -p "$WORKDIR"
cd "$WORKDIR"
cp ~/my_project/inputs/$CASE/* .
srun ~/my_project/build/my_app config.in
cp -r . ~/my_project/results/$CASEThen call this entry script from your job script.
- Record configuration:
- Copy job scripts, parameter files, and code version info into the results directory.
- Avoid running heavy commands on login nodes:
- For large preprocessing or postprocessing, use small interactive allocations.
- Read site docs:
- Partitions/queues, limits, site-specific launch wrappers, filesystem rules, and example scripts.
Putting It Together: A Minimal End-to-End Example
Imagine you want to run an MPI simulation on 4 nodes with 32 tasks per node.
- Compile on the login node:
module load gcc/13.2.0 openmpi/5.0.0
mkdir -p build && cd build
mpicxx ../code/main.cpp -O3 -o my_app- Create a job script
scripts/run_production.slurm:
#!/bin/bash
#SBATCH --job-name=prod_sim
#SBATCH --output=logs/prod_sim_%j.out
#SBATCH --error=logs/prod_sim_%j.err
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=04:00:00
#SBATCH --partition=standard
#SBATCH --account=my_project
set -euo pipefail
module purge
module load gcc/13.2.0 openmpi/5.0.0
cd $SLURM_SUBMIT_DIR
# Run from scratch
RUN_DIR=/scratch/$USER/prod_run_${SLURM_JOB_ID}
mkdir -p "$RUN_DIR"
cp inputs/config_prod.in "$RUN_DIR"
cp build/my_app "$RUN_DIR"
cd "$RUN_DIR"
srun ./my_app config_prod.in
# Save results
mkdir -p $SLURM_SUBMIT_DIR/results/prod
cp -r . $SLURM_SUBMIT_DIR/results/prod/run_${SLURM_JOB_ID}- Submit:
cd ~/my_project
sbatch scripts/run_production.slurm- Monitor:
squeue -u $USER
tail -f logs/prod_sim_123456.out # replace with actual job ID- Inspect results in
results/prod/run_<jobid>/after completion.
This pattern—compile, stage data, request resources with a carefully written job script, monitor, and archive results—is the core of running applications on clusters in practice.