5.7 Cancelling and modifying jobs

Why Cancelling and Modifying Jobs Matters

On a shared HPC cluster, you will routinely need to:

Stop jobs that are misconfigured, stuck, or no longer needed.
Adjust jobs after submission (change resources, time limits, job name, dependencies, etc.).
Clean up your queue when experimenting with parameters.

Doing this correctly avoids wasting resources, improves your own turnaround time, and is considered good cluster citizenship.

This chapter assumes you are using SLURM, as introduced earlier. Other schedulers have similar ideas but different commands.

Cancelling Jobs

Cancelling a Single Job

The basic command is:

scancel <job_id>

Typical workflow:

Find your job ID:

   squeue -u $USER

Cancel it:

   scancel 123456

If the job is running or pending, scancel will transition it to a completed (cancelled) state. Any allocated resources are released.

Cancelling Multiple Jobs at Once

Common patterns:

Cancel all your jobs:

  scancel -u $USER

Cancel multiple explicit IDs:

  scancel 12345 12346 12347

Cancel a job array:

  scancel 20000        # whole array
  scancel 20000_3      # specific task in the array
  scancel 20000_[1-5]  # range of tasks

Use with care, especially scancel -u $USER, as it can kill long-running, important work.

Cancelling by Job State, Partition, or Name

You can be more selective:

Cancel only pending jobs:

  scancel -u $USER --state=PENDING

Cancel jobs in a particular partition:

  scancel -u $USER -p debug

Cancel by job name:

  scancel -u $USER --name=mytest

Combine filters to precisely target what you want to cancel.

What Happens When a Job Is Cancelled

Important behaviors:

Running jobs are sent a signal (by default SIGTERM, then later SIGKILL).
If your application handles signals, it can perform cleanup or checkpointing.
Output files up to the cancellation point remain on disk.
Pending jobs simply never start; no compute time is consumed.

This is one reason to implement checkpoint/restart in long jobs when possible.

Cancel vs. Other Forms of Stopping Work

scancel:

Scheduler-aware, preferred way to stop batch and array jobs.

kill:

Sends signals directly from a node to a process.
Should generally be avoided for batch jobs; can confuse accounting or leave allocations hanging in some configurations.

Ctrl+C:

Typically only affects interactive commands in your terminal, not scheduled batch jobs.

Use scancel for anything under scheduler control.

Modifying Jobs: General Concepts

Once a job is submitted, there are limits to what you can change:

Some fields can be modified in-place:

Time limit (in some clusters).
Job name.
Eligible time or priority-related settings (subject to policy).
Certain constraints (again, policy-dependent).

Other fields normally cannot be changed:

Number of nodes, tasks, CPUs per task, or memory.
Partition (on many clusters).
The script contents themselves.

Cluster policy determines what is allowed; commands below may fail with “update is not permitted” depending on configuration.

Modifying Pending Jobs with `scontrol update`

The generic mechanism in SLURM is scontrol:

scontrol show job <job_id>      # inspect details
scontrol update JobId=<job_id> <Field>=<Value> ...

Common use cases for pending jobs:

Changing Job Name

scontrol update JobId=123456 Name=new_name

Why this helps:

Keeps queues readable during parameter sweeps.
Helps you track what each job is doing without resubmitting.

Adjusting Start Time / Hold Status

Put a job on hold:

  scontrol hold 123456

Release a held job:

  scontrol release 123456

You can also adjust when a job becomes eligible:

scontrol update JobId=123456 StartTime=2025-12-12T23:00:00

Format requirements may vary; check your site’s documentation.

Modifying Time Limit (If Policy Allows)

On some systems, you may extend or reduce time for a pending job:

scontrol update JobId=123456 TimeLimit=02:00:00

Notes:

You typically can’t exceed partition or account limits.
Increasing time may lower priority or cause the job to be requeued in a different backfill slot.

If the command fails, your site may simply not allow after-submission changes.

Modifying Running Jobs

Sites vary widely in what they allow.

Extending Time on a Running Job

If allowed, you can sometimes request a time extension:

scontrol update JobId=123456 TimeLimit=03:00:00

Typical constraints:

Extensions may only be possible while some buffer time remains (e.g., more than 5–10 minutes before current end).
Limits may apply per user, per account, or per partition.

If an extension is denied, your job will end at the original limit; plan checkpointing accordingly.

Job Requeueing

Instead of canceling, you can requeue a job:

scontrol requeue 123456

Behavior:

If the job is running, it is stopped and put back as pending.
If the job is pending, it is resubmitted as pending.
The job uses the same script and most of the same options.

This is useful when:

You’ve fixed a transient problem (e.g., filesystem glitch) and want the job to run again from scratch.
Your executable can restart from checkpoints automatically.

Some clusters allow automatic requeue on node failure via submission options; consult the SLURM chapter or site docs.

Interactive Sessions

If you started an interactive job (for example with salloc or srun --pty):

Terminate the shell or program to free resources:

  exit     # from the interactive shell

Or cancel explicitly from another terminal:

  scancel <job_id>

Interactive sessions are easy to forget; always close them to avoid wasting allocation time.

Typical Workflows for Cancelling and Modifying

Correcting a Misconfigured Job

Notice the problem (wrong script, wrong data path, etc.).
Cancel:

   scancel 123456

Fix the job script.
Resubmit with sbatch.

Trying to “patch” a misconfigured job in place is usually more fragile than cancelling and resubmitting.

Reducing Queue Load from Experiments

When exploring parameters:

Submit multiple candidate jobs.
Monitor which ones you actually need.
Cancel unnecessary ones promptly:

   scancel 20001 20002

Or, if you realize an entire batch of tests is pointless:

scancel -u $USER --name=experimentA

Handling Impending Time Limit

If your job is near its time limit:

If extension is allowed:

  scontrol update JobId=123456 TimeLimit=10:00:00

Otherwise, implement checkpointing in your code and consider:

Allowing it to finish naturally.
Requeuing it later, if your workflow supports restarts:

    scontrol requeue 123456

Good Practices and Pitfalls

Good Practices

Cancel unused or clearly faulty jobs quickly.
Use informative job names; update them if that helps clarity.
Check your job’s status and limits regularly:

  scontrol show job 123456

Learn your site’s policies:

Which job fields are modifiable?
Are time extensions allowed? Under what conditions?

Common Pitfalls

Forgetting to cancel stuck or hung jobs that are producing no useful output but consuming resources.
Assuming you can change core counts or memory after submission; this usually requires cancellation and resubmission.
Using kill on compute nodes instead of scancel.
Leaving interactive sessions running for hours while idle.

Summary of Key Commands

Cancel jobs:

scancel <job_id>
scancel -u $USER
scancel <array_id>_[range]

Hold / release:

scontrol hold <job_id>
scontrol release <job_id>

Modify attributes:

scontrol show job <job_id>
scontrol update JobId=<job_id> Name=<new_name>
scontrol update JobId=<job_id> TimeLimit=HH:MM:SS
scontrol update JobId=<job_id> StartTime=<timestamp>

Requeue:

scontrol requeue <job_id>

Use these tools to keep your jobs and the shared cluster running efficiently.

Comments

Please login to add a comment.

Don't have an account? Register now!