Table of Contents
Why Cancelling and Modifying Jobs Matters
On a shared HPC cluster, you will routinely need to:
- Stop jobs that are misconfigured, stuck, or no longer needed.
- Adjust jobs after submission (change resources, time limits, job name, dependencies, etc.).
- Clean up your queue when experimenting with parameters.
Doing this correctly avoids wasting resources, improves your own turnaround time, and is considered good cluster citizenship.
This chapter assumes you are using SLURM, as introduced earlier. Other schedulers have similar ideas but different commands.
Cancelling Jobs
Cancelling a Single Job
The basic command is:
scancel <job_id>Typical workflow:
- Find your job ID:
squeue -u $USER- Cancel it:
scancel 123456
If the job is running or pending, scancel will transition it to a completed (cancelled) state. Any allocated resources are released.
Cancelling Multiple Jobs at Once
Common patterns:
- Cancel all your jobs:
scancel -u $USER- Cancel multiple explicit IDs:
scancel 12345 12346 12347- Cancel a job array:
scancel 20000 # whole array
scancel 20000_3 # specific task in the array
scancel 20000_[1-5] # range of tasks
Use with care, especially scancel -u $USER, as it can kill long-running, important work.
Cancelling by Job State, Partition, or Name
You can be more selective:
- Cancel only pending jobs:
scancel -u $USER --state=PENDING- Cancel jobs in a particular partition:
scancel -u $USER -p debug- Cancel by job name:
scancel -u $USER --name=mytestCombine filters to precisely target what you want to cancel.
What Happens When a Job Is Cancelled
Important behaviors:
- Running jobs are sent a signal (by default
SIGTERM, then laterSIGKILL). - If your application handles signals, it can perform cleanup or checkpointing.
- Output files up to the cancellation point remain on disk.
- Pending jobs simply never start; no compute time is consumed.
This is one reason to implement checkpoint/restart in long jobs when possible.
Cancel vs. Other Forms of Stopping Work
scancel:- Scheduler-aware, preferred way to stop batch and array jobs.
kill:- Sends signals directly from a node to a process.
- Should generally be avoided for batch jobs; can confuse accounting or leave allocations hanging in some configurations.
Ctrl+C:- Typically only affects interactive commands in your terminal, not scheduled batch jobs.
Use scancel for anything under scheduler control.
Modifying Jobs: General Concepts
Once a job is submitted, there are limits to what you can change:
- Some fields can be modified in-place:
- Time limit (in some clusters).
- Job name.
- Eligible time or priority-related settings (subject to policy).
- Certain constraints (again, policy-dependent).
- Other fields normally cannot be changed:
- Number of nodes, tasks, CPUs per task, or memory.
- Partition (on many clusters).
- The script contents themselves.
Cluster policy determines what is allowed; commands below may fail with “update is not permitted” depending on configuration.
Modifying Pending Jobs with `scontrol update`
The generic mechanism in SLURM is scontrol:
scontrol show job <job_id> # inspect details
scontrol update JobId=<job_id> <Field>=<Value> ...Common use cases for pending jobs:
Changing Job Name
scontrol update JobId=123456 Name=new_nameWhy this helps:
- Keeps queues readable during parameter sweeps.
- Helps you track what each job is doing without resubmitting.
Adjusting Start Time / Hold Status
- Put a job on hold:
scontrol hold 123456- Release a held job:
scontrol release 123456You can also adjust when a job becomes eligible:
scontrol update JobId=123456 StartTime=2025-12-12T23:00:00Format requirements may vary; check your site’s documentation.
Modifying Time Limit (If Policy Allows)
On some systems, you may extend or reduce time for a pending job:
scontrol update JobId=123456 TimeLimit=02:00:00Notes:
- You typically can’t exceed partition or account limits.
- Increasing time may lower priority or cause the job to be requeued in a different backfill slot.
If the command fails, your site may simply not allow after-submission changes.
Modifying Running Jobs
Sites vary widely in what they allow.
Extending Time on a Running Job
If allowed, you can sometimes request a time extension:
scontrol update JobId=123456 TimeLimit=03:00:00Typical constraints:
- Extensions may only be possible while some buffer time remains (e.g., more than 5–10 minutes before current end).
- Limits may apply per user, per account, or per partition.
If an extension is denied, your job will end at the original limit; plan checkpointing accordingly.
Job Requeueing
Instead of canceling, you can requeue a job:
scontrol requeue 123456Behavior:
- If the job is running, it is stopped and put back as pending.
- If the job is pending, it is resubmitted as pending.
- The job uses the same script and most of the same options.
This is useful when:
- You’ve fixed a transient problem (e.g., filesystem glitch) and want the job to run again from scratch.
- Your executable can restart from checkpoints automatically.
Some clusters allow automatic requeue on node failure via submission options; consult the SLURM chapter or site docs.
Interactive Sessions
If you started an interactive job (for example with salloc or srun --pty):
- Terminate the shell or program to free resources:
exit # from the interactive shell- Or cancel explicitly from another terminal:
scancel <job_id>Interactive sessions are easy to forget; always close them to avoid wasting allocation time.
Typical Workflows for Cancelling and Modifying
Correcting a Misconfigured Job
- Notice the problem (wrong script, wrong data path, etc.).
- Cancel:
scancel 123456- Fix the job script.
- Resubmit with
sbatch.
Trying to “patch” a misconfigured job in place is usually more fragile than cancelling and resubmitting.
Reducing Queue Load from Experiments
When exploring parameters:
- Submit multiple candidate jobs.
- Monitor which ones you actually need.
- Cancel unnecessary ones promptly:
scancel 20001 20002Or, if you realize an entire batch of tests is pointless:
scancel -u $USER --name=experimentAHandling Impending Time Limit
If your job is near its time limit:
- If extension is allowed:
scontrol update JobId=123456 TimeLimit=10:00:00- Otherwise, implement checkpointing in your code and consider:
- Allowing it to finish naturally.
- Requeuing it later, if your workflow supports restarts:
scontrol requeue 123456Good Practices and Pitfalls
Good Practices
- Cancel unused or clearly faulty jobs quickly.
- Use informative job names; update them if that helps clarity.
- Check your job’s status and limits regularly:
scontrol show job 123456- Learn your site’s policies:
- Which job fields are modifiable?
- Are time extensions allowed? Under what conditions?
Common Pitfalls
- Forgetting to cancel stuck or hung jobs that are producing no useful output but consuming resources.
- Assuming you can change core counts or memory after submission; this usually requires cancellation and resubmission.
- Using
killon compute nodes instead ofscancel. - Leaving interactive sessions running for hours while idle.
Summary of Key Commands
- Cancel jobs:
scancel <job_id>scancel -u $USERscancel <array_id>_[range]- Hold / release:
scontrol hold <job_id>scontrol release <job_id>- Modify attributes:
scontrol show job <job_id>scontrol update JobId=<job_id> Name=<new_name>scontrol update JobId=<job_id> TimeLimit=HH:MM:SSscontrol update JobId=<job_id> StartTime=<timestamp>- Requeue:
scontrol requeue <job_id>
Use these tools to keep your jobs and the shared cluster running efficiently.