Kahibaro
Discord Login Register

5.4 Writing job scripts

Overview

On most HPC systems you do not run heavy computations interactively. Instead you describe what you want in a small text file called a job script. The scheduler reads this file, reserves the requested resources, sets up the environment, and runs your commands in batch mode.

This chapter focuses on how to write such scripts, with SLURM as the concrete example. Other batch systems look different, but the basic ideas are very similar.

Structure of a Typical Job Script

A job script is a shell script, usually written for bash. It contains three main parts. The first line selects the shell. The next block gives scheduler options in a special comment format. The remainder contains the commands that actually do the work, which could be loading modules, setting variables, and launching your program.

A minimal SLURM script typically looks like this:

#!/bin/bash
#SBATCH --job-name=mytest
#SBATCH --time=00:10:00
#SBATCH --partition=short
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --output=slurm-%j.out
echo "Running on host $(hostname)"
echo "Job started at $(date)"
./my_program input.dat
echo "Job finished at $(date)"

The lines that begin with #SBATCH are not normal comments. They are interpreted by SLURM when you submit the script with sbatch, and they define the requested resources and some job metadata.

Shebang and Shell Choice

The very first line of the script is the shebang. It tells the system which interpreter should execute the script. On most clusters you will use:

#!/bin/bash

Some clusters may recommend a specific path such as #!/usr/bin/env bash for portability. You should follow your site documentation, but the role of this line is always the same. It is not a scheduler directive. If you omit it, the script may run under an unexpected shell, which can break commands that depend on bash features.

SLURM Directives in Job Scripts

After the shebang, you specify job parameters as special comments. In SLURM, every directive starts with #SBATCH followed by an option. These correspond closely to the command line options of sbatch, but embedding them in the script makes it self contained and easier to reuse.

For example, the command:

sbatch --time=01:00:00 --ntasks=4 myscript.sh

is equivalent to placing the two lines:

#SBATCH --time=01:00:00
#SBATCH --ntasks=4

inside myscript.sh and calling simply:

sbatch myscript.sh

If both are used, explicit command line options usually override those in the script. Site policies can also enforce or limit certain options.

Important rule: In SLURM job scripts, scheduler settings must be on lines that start with #SBATCH. A plain # comment or an option after a command will not be interpreted as a directive.

Naming and Describing the Job

Job names and user visible identifiers help you keep track of many submitted jobs. You configure these with directives near the top of the script.

The most common options are:

--job-name sets a short human readable name for the job. This appears in queue listings. For example,

#SBATCH --job-name=fft_benchmark

--output and --error control where standard output and standard error go. A common pattern is to use SLURM placeholders that expand when the job runs.

For example:

#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

Here %x expands to the job name and %j to the job ID. The directory logs must exist when the job starts, otherwise the job will fail to open its log files.

If you omit --error, SLURM usually combines error with output by default. Many users explicitly separate them to simplify debugging.

You can also assign your job to an account or project if your site uses accounting. A typical directive is:

#SBATCH --account=project123

The exact account name and whether it is required are cluster specific.

Requesting Time and Basic Resources

Every job script must specify how long the job may run and how many resources it needs. You express these as directives. If the job exceeds the requested time limit, the scheduler will terminate it.

The wall clock time limit is set with --time in days-hours:minutes:seconds or hours:minutes:seconds format. For example:

#SBATCH --time=02:30:00

This requests 2 hours and 30 minutes of wall clock time. For very short tests, you could write:

#SBATCH --time=5:00

for 5 minutes. You should request a realistic upper bound, not an extreme overestimate.

Memory is usually requested with --mem for the whole node allocation, or with --mem-per-cpu for each CPU. For example:

#SBATCH --mem=32G

requests 32 gigabytes per node. Alternatively:

#SBATCH --mem-per-cpu=2G

requests 2 gigabytes for every CPU allocated. Your site documentation will clarify which option is recommended.

Partition or queue selection uses --partition. For example:

#SBATCH --partition=standard

Some clusters use names like short, long, gpu, or debug. You must choose one that matches your job type and your access permissions.

Important rule: If your job exceeds its requested wall time or memory limit, the scheduler will usually stop it without warning. Always set --time and memory directives conservatively but not excessively.

CPUs, Tasks, and Nodes in Scripts

A central part of any job script is describing how much compute capacity your application needs and how it uses it. In SLURM there is a basic distinction between tasks, CPUs per task, and nodes.

--ntasks is the number of independent tasks. For MPI applications, this is typically the number of processes you will launch. For example:

#SBATCH --ntasks=16

--cpus-per-task is the number of CPU cores that each task will use. For hybrid MPI plus OpenMP jobs, this defines the number of threads per rank. For example:

#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8

requests 4 tasks and gives each task 8 cores, for a total of 32 cores. Often you also want --nodes to specify how many physical nodes to spread these tasks across, for example:

#SBATCH --nodes=2

Some sites recommend describing the resource layout more explicitly with --ntasks-per-node or --ntasks-per-socket. For example:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

Here the scheduler will allocate 2 nodes with 8 tasks on each. The best combination depends on the hardware and on how your application scales.

You should not confuse tasks with threads in shared memory models. Threads are configured inside your program, often through environment variables, while tasks are defined by the scheduler. The interaction between these is covered more in hybrid programming material, but you already control the task side of it in the job script.

GPU and Accelerator Requests in Scripts

If your program uses GPUs or other accelerators, the job script must request them explicitly. Otherwise the scheduler may place your job on nodes without GPUs, or it may deny access to the accelerators.

A typical directive for one GPU per node looks like:

#SBATCH --gres=gpu:1

Here --gres stands for generic resource. The name gpu may be refined on some systems, such as gpu:tesla:2 for two GPUs of a specific type. The exact syntax is cluster specific.

Many GPU partitions require both a GPU request and a constraint or partition that selects GPU nodes. For example:

#SBATCH --partition=gpu
#SBATCH --gres=gpu:4

You then configure how your code uses those GPUs within the script body, for example by setting environment variables or passing flags to your program. The scheduler only ensures that the requested number of GPUs is available to the job.

Working Directory and File Paths

When the scheduler starts your script on the compute node, the current working directory may not be what you expect. Some clusters start in your home directory, others in the directory from which you ran sbatch. To avoid surprises, most users move to the desired directory explicitly.

A typical pattern is:

cd $SLURM_SUBMIT_DIR

The environment variable SLURM_SUBMIT_DIR holds the directory where you launched sbatch. By changing into it at the start of your script, you ensure that relative paths behave as they did when you tested the commands interactively.

If you want output to go to a particular data or scratch directory, you can change to that instead. For example:

cd /scratch/$USER/myproject

and then call programs with relative names. This approach keeps large I/O off your home directory if the policies recommend that.

You should be consistent about relative versus absolute paths inside job scripts, especially for input data and output locations. A script that assumes the wrong working directory can fail in subtle ways, such as overwriting files in an unintended location.

Loading Software and Setting the Environment

Once the scheduler has allocated resources, the script must set up the software environment before starting your application. Most HPC clusters use a module system. Within the script, you load modules exactly as you would on the command line, but you must remember to put those commands in the script itself. The batch environment does not automatically inherit everything from your interactive session.

A typical pattern is:

module purge
module load gcc/12.2
module load openmpi/4.1
module load fftw/3.3

module purge clears any existing modules, so the script starts with a clean environment. The subsequent module load lines ensure that the correct compiler, MPI library, and numerical libraries are in place.

Environmental variables used by your application should also be set in the script. For example:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

or:

export MYAPP_CONFIG=/path/to/config.yaml

Do not rely on interactive shell startup files such as .bashrc to provide needed exports. Noninteractive batch shells may not read them, or the cluster may configure them differently. Self contained job scripts are easier to reproduce and debug.

Running Parallel Applications in Scripts

The core purpose of a job script is to launch your program on the allocated resources. For parallel applications the way you invoke the executable must match how you requested resources. SLURM provides helper commands that automatically use the allocated tasks and nodes.

For many MPI based programs, the recommended command is srun. For example:

srun ./my_mpi_program input.dat

Here srun starts one MPI rank per task by default, spread across the nodes assigned to the job. This associates the application cleanly with the job allocation.

Some sites recommend mpirun or mpiexec instead, depending on the MPI implementation. You can still express the number of processes using SLURM variables, but using srun often avoids mismatches.

For shared memory parallel programs that use threads, you usually run the executable directly, but you must ensure the environment is consistent with --cpus-per-task. A simple pattern is:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./my_openmp_program

For hybrid MPI plus threads, a common structure is:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./my_hybrid_program

The scheduler manages the tasks, cores, and sometimes GPU bindings. The application manages threads inside each rank according to the environment variables.

Using SLURM Environment Variables

When SLURM starts your job, it defines a variety of environment variables that you can use to make your script more robust and informative. These variables are read like any other shell variable.

Some frequently useful ones are:

$SLURM_JOB_ID holds the numerical ID of the job. It is useful in messages and filenames.

$SLURM_JOB_NAME holds the job name from --job-name.

$SLURM_NTASKS gives the total number of tasks allocated.

$SLURM_CPUS_PER_TASK contains the number of CPUs per task.

$SLURM_JOB_NODELIST shows which nodes are assigned.

You might use them as follows:

echo "Job $SLURM_JOB_ID ($SLURM_JOB_NAME) running on:"
echo "$SLURM_JOB_NODELIST"
echo "Tasks: $SLURM_NTASKS, CPUs per task: $SLURM_CPUS_PER_TASK"

These diagnostics are often the first thing to add when you are debugging a new job script. They confirm that the scheduler understood your requests and that the environment matches your expectations.

Organizing Output and Logging in Scripts

A good job script does not only run the program, it also records enough information to understand what happened later. Structured logging helps you distinguish between runs, reproduce results, and debug problems.

You already saw how to direct standard output and error to files with --output and --error. Inside the script, adding timestamps and configuration summaries often pays off.

For example:

echo "Job started at $(date)"
echo "Running on $SLURM_JOB_NODELIST"
echo "Git commit: $(git rev-parse HEAD 2>/dev/null || echo unknown)"
echo "Command line: $0 $@"

You can also place simulation outputs in job specific subdirectories. A common pattern is:

RUN_DIR="run_${SLURM_JOB_ID}"
mkdir -p "$RUN_DIR"
cd "$RUN_DIR"
srun ../my_simulation > stdout.log 2>stderr.log

Here, each submitted job writes into its own directory named by the job ID, which reduces risk of overwriting previous results. If you combine this with log file names that include %j, you get both file level and directory level separation.

Simple Serial Job Script Example

To make the structure concrete, here is a small example of a purely serial job, suitable for testing.

#!/bin/bash
#SBATCH --job-name=serial_test
#SBATCH --time=00:05:00
#SBATCH --partition=debug
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --output=logs/%x-%j.out
module purge
module load gcc/12.2
cd $SLURM_SUBMIT_DIR
echo "Job $SLURM_JOB_ID started at $(date) on $(hostname)"
./serial_program input.txt > result.txt
echo "Job $SLURM_JOB_ID finished at $(date)"

This requests one task on one core, with a small amount of memory and a short time limit. It loads a compiler module, changes to the submission directory, and runs the program. The output goes into result.txt, while diagnostic messages and echo statements go into the SLURM output file in the logs directory.

Simple MPI Job Script Example

A basic MPI job script extends this by requesting multiple tasks and using srun.

#!/bin/bash
#SBATCH --job-name=mpi_pi
#SBATCH --time=00:10:00
#SBATCH --partition=standard
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=1G
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
module purge
module load gcc/12.2
module load openmpi/4.1
cd $SLURM_SUBMIT_DIR
echo "Job $SLURM_JOB_ID started at $(date)"
echo "Nodes: $SLURM_JOB_NODELIST"
echo "Tasks: $SLURM_NTASKS"
srun ./mpi_pi_solver --iterations 100000000
echo "Job $SLURM_JOB_ID finished at $(date)"

This script gives the scheduler enough information to allocate 2 nodes with 8 tasks each, sets up the MPI environment, and launches the application cleanly across all tasks.

Common Pitfalls When Writing Job Scripts

Several mistakes occur frequently when users start creating job scripts. Being aware of them helps you avoid frustrating debugging sessions.

One common issue is forgetting to load necessary modules or libraries inside the script. The interactive environment where you compiled and tested may not be the same as the batch environment. The fix is to add module load lines explicitly.

Another frequent error is mismatching resource requests and the actual run command. For instance, asking for many tasks in the directives but running a serial program without srun, or vice versa, can lead to idle cores or unbalanced runs.

Incorrect working directories also cause silent failures. Scripts that assume they start in a particular path might not find input files. Adding cd $SLURM_SUBMIT_DIR or an explicit cd to your data directory solves this.

Users also sometimes omit time or memory requests, leaving them at site defaults that may be too small. When the job hits those limits, it is killed without completing, which can look like a program crash. Writing realistic --time and memory directives avoids this problem.

Finally, mixing #SBATCH directives with other content on the same line is incorrect. Each directive must be on its own line, and those lines must come before any non comment commands for predictable behavior.

From Prototype Script to Reusable Template

As you gain experience, it is useful to turn a working prototype job script into a reusable template. This often means parameterizing a few settings, such as the number of tasks or the input dataset, and keeping the rest fixed.

A simple approach uses shell variables near the top of the script:

NTASKS=16
TIME="01:00:00"
INPUT="case1.dat"

You cannot use shell variables directly in #SBATCH lines, because the scheduler parses those lines before the shell runs. Instead you set them for the run command only:

#SBATCH --time=01:00:00
#SBATCH --ntasks=16

and change the directives manually when you need a different configuration. If you want fully parameterized submission, you can generate job scripts from a higher level driver script, but that is a separate topic.

Within a single script, you can still use variables to switch between executable names, data files, and output directories. A small set of stable templates for serial, MPI, GPU, and hybrid jobs, adapted to your cluster, can significantly reduce the time you spend writing new scripts.

Summary

Writing job scripts is about codifying, in a reproducible text file, how your program should run on the cluster. You specify job metadata, wall time, memory, CPUs, tasks, nodes, and optional accelerators through scheduler directives, then set up the environment and launch your application with the appropriate run command. A clear structure, explicit module loads, controlled working directories, and informative logging make scripts easier to maintain and debug.

With these elements in place, you can reliably move from interactive experimentation to large scale batch runs that use the HPC system effectively.

Views: 31

Comments

Please login to add a comment.

Don't have an account? Register now!