Kahibaro
Discord Login Register

Cancelling and modifying jobs

Why Cancelling and Modifying Jobs Matters

On a shared HPC cluster, you will routinely need to:

Doing this correctly avoids wasting resources, improves your own turnaround time, and is considered good cluster citizenship.

This chapter assumes you are using SLURM, as introduced earlier. Other schedulers have similar ideas but different commands.

Cancelling Jobs

Cancelling a Single Job

The basic command is:

scancel <job_id>

Typical workflow:

  1. Find your job ID:
   squeue -u $USER
  1. Cancel it:
   scancel 123456

If the job is running or pending, scancel will transition it to a completed (cancelled) state. Any allocated resources are released.

Cancelling Multiple Jobs at Once

Common patterns:

  scancel -u $USER
  scancel 12345 12346 12347
  scancel 20000        # whole array
  scancel 20000_3      # specific task in the array
  scancel 20000_[1-5]  # range of tasks

Use with care, especially scancel -u $USER, as it can kill long-running, important work.

Cancelling by Job State, Partition, or Name

You can be more selective:

  scancel -u $USER --state=PENDING
  scancel -u $USER -p debug
  scancel -u $USER --name=mytest

Combine filters to precisely target what you want to cancel.

What Happens When a Job Is Cancelled

Important behaviors:

This is one reason to implement checkpoint/restart in long jobs when possible.

Cancel vs. Other Forms of Stopping Work

Use scancel for anything under scheduler control.

Modifying Jobs: General Concepts

Once a job is submitted, there are limits to what you can change:

Cluster policy determines what is allowed; commands below may fail with “update is not permitted” depending on configuration.

Modifying Pending Jobs with `scontrol update`

The generic mechanism in SLURM is scontrol:

scontrol show job <job_id>      # inspect details
scontrol update JobId=<job_id> <Field>=<Value> ...

Common use cases for pending jobs:

Changing Job Name

scontrol update JobId=123456 Name=new_name

Why this helps:

Adjusting Start Time / Hold Status

  scontrol hold 123456
  scontrol release 123456

You can also adjust when a job becomes eligible:

scontrol update JobId=123456 StartTime=2025-12-12T23:00:00

Format requirements may vary; check your site’s documentation.

Modifying Time Limit (If Policy Allows)

On some systems, you may extend or reduce time for a pending job:

scontrol update JobId=123456 TimeLimit=02:00:00

Notes:

If the command fails, your site may simply not allow after-submission changes.

Modifying Running Jobs

Sites vary widely in what they allow.

Extending Time on a Running Job

If allowed, you can sometimes request a time extension:

scontrol update JobId=123456 TimeLimit=03:00:00

Typical constraints:

If an extension is denied, your job will end at the original limit; plan checkpointing accordingly.

Job Requeueing

Instead of canceling, you can requeue a job:

scontrol requeue 123456

Behavior:

This is useful when:

Some clusters allow automatic requeue on node failure via submission options; consult the SLURM chapter or site docs.

Interactive Sessions

If you started an interactive job (for example with salloc or srun --pty):

  exit     # from the interactive shell
  scancel <job_id>

Interactive sessions are easy to forget; always close them to avoid wasting allocation time.

Typical Workflows for Cancelling and Modifying

Correcting a Misconfigured Job

  1. Notice the problem (wrong script, wrong data path, etc.).
  2. Cancel:
   scancel 123456
  1. Fix the job script.
  2. Resubmit with sbatch.

Trying to “patch” a misconfigured job in place is usually more fragile than cancelling and resubmitting.

Reducing Queue Load from Experiments

When exploring parameters:

  1. Submit multiple candidate jobs.
  2. Monitor which ones you actually need.
  3. Cancel unnecessary ones promptly:
   scancel 20001 20002

Or, if you realize an entire batch of tests is pointless:

scancel -u $USER --name=experimentA

Handling Impending Time Limit

If your job is near its time limit:

  scontrol update JobId=123456 TimeLimit=10:00:00
    scontrol requeue 123456

Good Practices and Pitfalls

Good Practices

  scontrol show job 123456

Common Pitfalls

Summary of Key Commands

Use these tools to keep your jobs and the shared cluster running efficiently.

Views: 14

Comments

Please login to add a comment.

Don't have an account? Register now!