Kahibaro
Discord Login Register

Submitting jobs

Understanding job submission in practice

Submitting a job is the moment when your prepared work, resources, and script are turned into a task managed by the scheduler. In an HPC environment, you almost never run heavy computations directly on a login node. Instead, you describe what you need in a job script, ask the scheduler to run it, and let the system decide when and where it runs. This chapter focuses on how to perform that submission step, what happens when you do, and how to avoid common mistakes specific to the submission process.

Where job submission happens

Job submission is done from a login node or an interactive front end. You will typically be in a shell session, inside some project directory that contains your code, input files, and a job script. The scheduler command line tool, such as sbatch on SLURM systems, is available there. You invoke it once per batch job, and the scheduler returns a job ID that uniquely identifies your submitted work.

It is important to submit jobs from a location in the filesystem that is accessible to the compute nodes that will execute the job. Usually, this means a shared home directory or project directory, not a local scratch directory on a login node that is not visible to the cluster. If your job starts but cannot see its input files because they sit in a nonshared location, the job will fail even if submission was successful.

Basic submission with SLURM

On a SLURM based system, the standard way to submit a batch job is with the sbatch command. The simplest usage looks like:

sbatch my_job.slurm

Here my_job.slurm is a text file that contains both SLURM directives and the shell commands that run your application. When you run sbatch, SLURM parses the directives, validates your request, registers the job in the queue, and returns a line with the numeric job ID.

Key rule for basic submission:
Use sbatch job_script from a login node, ensure the script is executable or explicitly interpreted, and always note the job ID that is printed.
This job ID is required for monitoring, modifying, and cancelling the job.

If the script is not executable, you must start it with a proper interpreter line, for example #!/bin/bash, so that SLURM knows how to run it. Many centers recommend always including an interpreter line and not relying on the file’s executable bit.

Combining script directives and command line options

When you submit a job, you can specify scheduler options in two places. One is inside the job script, as directives that start with #SBATCH. The other is on the sbatch command line itself. SLURM merges these sources to build the final resource request.

For example, your script may contain:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --ntasks=4
srun ./my_program

You can still adjust some options at submission time:

sbatch --job-name=test_run my_job.slurm

In this case, the time limit and number of tasks come from within the script, and the job name comes from the command line. If the same option is specified both in the script and on the command line, SLURM uses the value from the command line for that submission.

This combination mechanism allows you to maintain a general job script and override only a few parameters at submission. For example, you might experiment with different wall times or node counts without editing the script each time.

Job environments at submission time

When you submit a job, SLURM creates a job environment that is based on, but not identical to, your current login session. Some environment variables from your login shell, such as HOME and USER, are propagated. Others, including any variables set in the job script itself, only exist when the job is running on the compute node.

It is common to expect that modules loaded or environment variables set in your interactive session will automatically apply to the job. This is not always the case. Many centers recommend that you explicitly load modules and set environment variables inside the job script rather than relying on what is active while submitting.

For instance, instead of doing:

module load my_software
sbatch my_job.slurm

and assuming that my_software is loaded inside the job, you should place the module load command in my_job.slurm. Submission then becomes independent of your current interactive environment and more reproducible.

Output, error, and working directory

Part of submitting a job is implicitly defining where outputs and logs will go. By default, SLURM uses the directory from which you run sbatch as the job’s working directory. The scheduler will place output and error files there unless you override this behavior with directives.

For example, if you submit from /home/user/project, then relative paths in the job script such as ./input.dat and ./results.out are interpreted relative to that directory when the job runs. This means that moving into the correct directory before calling sbatch is an important part of the submission process.

It is also common to control the names of the output and error files at submission time rather than editing the script. If the script already contains directives for output file naming, you can still override them with equivalent command line options to sbatch. This is useful when you run multiple parameter variations from the same script and want each run to have uniquely named logs without modifying the script itself.

Interactive job submission

Not all work fits neatly into batch scripts. For debugging, quick tests, or exploratory analysis, interactive jobs are useful. Instead of describing commands in a file, you submit a request to obtain a shell on a compute node, and then you work inside that shell.

On SLURM systems, this is often done with srun or salloc. A common pattern is:

salloc --time=00:30:00 --ntasks=1 --mem=4G

After this command returns, you usually get a prompt on a node that has the resources you requested. Everything you run in that shell is part of the allocated job until you exit the shell. Although this feels interactive, it is still a form of job submission, and the same scheduler policies apply, including queueing, accounts, and limits.

Some sites also provide srun with flags to spawn an interactive session or an interactive terminal coupled with graphical applications. In all cases, you still submit a job, but instead of running a script, you get a command line environment on the compute nodes.

Submitting array jobs

When you must run the same program many times with small variations, array jobs provide a structured way to submit them as a single logical job. Instead of issuing dozens or hundreds of separate sbatch commands, you submit one job array that contains many subjobs, each with a different index.

On SLURM, this is done with an array specification. You can place it in the script:

#SBATCH --array=0-9

or provide it at submission time:

sbatch --array=0-9 my_array_job.slurm

When you submit such a job, SLURM creates one job ID for the array and a subjob for each array index. Inside the running job, you can query the index using a SLURM environment variable, for example SLURM_ARRAY_TASK_ID. Submission of arrays is particularly useful when you must perform parameter sweeps, process many files, or run Monte Carlo simulations.

Array submission also affects how you monitor and manage jobs. You receive a job ID with an array notation, and each subtask can be referenced individually if needed. Although this management aspect belongs to job monitoring and control, it is important at submission time to understand that a single command can generate many related job instances.

Handling submission errors and rejections

Not every attempt to submit a job succeeds. The scheduler may reject requests that violate site policies, such as requesting more memory or wall time than allowed, or using an account or partition that you are not authorized to access. Submission failures typically appear as immediate error messages from sbatch, and no job ID is assigned.

When this happens, the first step is to read the error message carefully in the terminal. Common causes include unknown partitions, invalid account names, syntax errors in directives, or malformed options on the command line. Correcting the script or the command and resubmitting is usually sufficient.

A subtler case occurs when the submission technically succeeds, so a job ID is assigned, but the job immediately fails during startup. In that case, the error will not appear at submission time, but in the job’s output or error files. This is not a submission failure but a runtime failure on the compute node. However, it often arises from mistakes made during submission, such as incorrect working directories or missing input files. Repeatedly checking the return message from sbatch and the first lines of the job’s output file after submission helps detect such problems early.

Submission troubleshooting rule:
If there is no job ID printed, the scheduler did not accept your job.
If there is a job ID but the job vanishes or fails immediately, inspect the job’s output and error files to identify problems that arise at launch time.

Submitting jobs under different configurations

A single job script can often be reused under several configurations by altering only submission time parameters. For instance, you may want to run the same program on different partitions, with different numbers of tasks, or with different QoS settings. Instead of maintaining multiple nearly identical scripts, you can use a base script with minimal fixed directives and then adjust the rest via sbatch.

For example, you could maintain a script that defines only the executable and basic environment setup. When you submit, you can select the partition, the job name, and the time limit:

sbatch --partition=short --time=00:30:00 --job-name=short_test base_job.slurm
sbatch --partition=long  --time=12:00:00 --job-name=overnight  base_job.slurm

This technique keeps your job management simpler and encourages clear separation between the computational work described in the script and the resource choices made at submission time. It also helps avoid mistakes where outdated resource requests remain in scripts and cause jobs to be queued or rejected unexpectedly.

Site specific submission policies

Every HPC center overlays its own policies on top of the scheduler. At submission time, these appear as required options or implicit defaults that affect how your job is handled. Examples include mandatory project or account codes, maximum job sizes, required partitions for certain workloads, or job submission limits per user.

Often, these rules are enforced by adding cluster specific options to your sbatch commands or directives to your scripts. Some centers provide wrapper commands that internally call sbatch with predefined options, or they may require you to specify an account using --account on every submission.

Before you automate job submission or write large numbers of scripts, it is important to consult your site documentation or example job scripts. Adhering to local policies at submission time reduces unexpected rejections, prevents you from blocking shared resources with oversized jobs, and keeps your usage within fair share limits.

Automating repeated submissions

Once you are comfortable with basic submission, it becomes natural to automate sequences of sbatch commands. While the design of such automation and workflow systems is discussed elsewhere, there are a few submission specific points to keep in mind.

If you loop over sbatch calls in a shell script to submit many jobs, each call returns a job ID. Capturing these IDs in variables or log files is valuable, because you may later need them for monitoring or for expressing dependencies between jobs. Some users parse the output of sbatch programmatically to build arrays of job IDs.

Another common extension is to use sbatch with dependency options. Although dependency logic is a distinct topic, the submission perspective involves chaining jobs so that one job begins only after another completes successfully. This transforms job submission into a directed sequence rather than a set of independent jobs submitted all at once.

Automation does not change the basic mechanics of submission. Every job still enters the scheduler with explicit resource requests, queue placement, and a job ID. However, when you submit many jobs automatically, small mistakes in scripts or sbatch options can propagate widely, so testing your automation with a small number of jobs first is considered good practice.

Summary of the submission workflow

Submitting jobs is the bridge between an interactive login session and the managed execution of work on the cluster. You place your commands and resource requests into a script, invoke sbatch or a similar tool from a suitable directory, and receive a job ID in return. You can refine behavior at submission time using command line options, interactive requests, and job arrays, while being mindful of environment handling, output locations, local policies, and possible errors.

Once a job is successfully submitted, it transitions into the scheduler’s queueing and execution processes, which are addressed in other chapters. The correctness and efficiency of that later execution depend heavily on how carefully you prepare and submit your jobs at this initial stage.

Views: 1

Comments

Please login to add a comment.

Don't have an account? Register now!