5.7 Cancelling and modifying jobs

Table of Contents

Understanding When and Why to Change Jobs

Job schedulers are designed to manage shared resources fairly and efficiently. Once a job has been submitted, you are not completely locked in. In practice, you often need to stop jobs that behave unexpectedly or adjust their requested resources and runtime as you refine your workflow.

In this chapter, you will focus on what you can safely change, what usually cannot be changed, and how to do it correctly. The specific commands and options vary by scheduler, but the underlying ideas are similar across systems. Since this course uses SLURM as the main reference, examples will use SLURM commands and terminology. Other batch systems usually provide analogous functionality.

The most common reasons to cancel or modify jobs are incorrect resource requests, runaway jobs that consume excessive quota, jobs stuck in a dependency chain, and stepwise tuning of job parameters for performance or debugging.

Key principle: Only modify or cancel your own jobs, unless you have explicit administrative privileges. Never attempt to interfere with other users’ jobs.

Cancelling Jobs Safely

Cancelling a job means telling the scheduler to stop managing that job and to terminate its execution if it is running. This is one of the operations you will use most often while learning to work with an HPC cluster.

You typically cancel a job in three situations. First, if you realize after submission that the job was misconfigured, for example wrong partition, wrong script, or wrong input. Second, if the job is running, but clearly misbehaving, for example stuck, consuming the wrong files, or producing corrupted output. Third, when you want to clear the queue of old jobs, such as obsolete test runs or jobs that are no longer needed because a newer version exists.

In SLURM, you cancel a job with a command like scancel <jobid>. This is intended to be quick and decisive. For interactive jobs, cancelling may also be triggered by closing the terminal or interrupting the client program, but the explicit cancel command is more reliable and easier to track.

Rule: Cancel misconfigured or misbehaving jobs as soon as you are confident they should not continue. Letting them run wastes shared resources and may violate usage policies.

You should always confirm the effect of cancellation on your files. Some applications write output incrementally, so a cancelled job leaves partially complete but still useful results. Others only finalize files at the end, so cancelling guarantees that the current output is incomplete. Partially written files can create confusion later if not clearly marked or cleaned up.

Cancelling Jobs in Different States

The behavior of cancellation depends on the current state of the job. Job states and their basic meanings are introduced in earlier chapters concerning job submission and monitoring. Here, focus on how cancellation interacts with those states.

For jobs that have not started yet and are pending in the queue, cancellation is straightforward. The scheduler simply removes them from the queue and they never consume compute resources. This is the safest moment to correct mistakes, for example after spotting a typo in the time limit.

For running jobs, cancellation tells the scheduler to send a termination signal to all job processes. In a well behaved application, this signal can trigger cleanup actions, such as closing files or writing final checkpoints. How graceful this termination is depends entirely on the application. Some jobs stop almost immediately, others can take a short time to exit, especially if they handle signals and perform cleanup code.

Occasionally, a job may appear to be in a completing or completing-like state when cancelled. In such cases, cancellation may still shorten the finalization phase or it may be ignored if the scheduler judges that the job is already effectively finished. On most systems, repeated cancellation requests do not make this faster, and there is usually no benefit to repeatedly issuing them.

In rare situations, a job may appear stuck in cancelling state. This usually indicates that some processes have not terminated correctly and the scheduler is waiting for the system to clean them up. If this persists, you may need to contact support, especially if the job continues to hold resources.

Job Termination Signals and Application Behavior

Under the hood, cancellation results in signals being sent to job processes by the scheduler on behalf of the user. The exact signals and their sequence are scheduler specific, but many systems use a tiered approach. First, they send a request to terminate gracefully, such as SIGTERM. If processes do not exit within a configured timeout, a stronger signal such as SIGKILL is used, which the application cannot intercept or handle.

The important point is that some applications are aware of signals and respond politely. For example, a simulation code may catch SIGTERM, write a final checkpoint file, flush logs, and then exit. Others may not have any signal handling logic and will just terminate abruptly when the operating system forces them to stop.

For a user, this implies that you cannot automatically rely on cancellation to produce a usable restart point. It is the responsibility of the application developer to implement behavior on termination that makes restart possible. When you choose or write codes for long running jobs, it is worth understanding how they respond to job cancellation and to time limit expiry.

Best practice: For long runs, do not depend on manual cancellation as a normal way to stop jobs. Instead, use explicit checkpointing and normal application termination whenever possible.

Cancelling Job Arrays and Groups of Jobs

Job arrays and grouped jobs allow you to manage many related tasks together. Cancelling these collections efficiently can save considerable effort when you discover a conceptual problem that affects all tasks.

For job arrays, schedulers usually support cancelling the entire array at once. In SLURM, you can pass the array job identifier to scancel, and the scheduler will cancel all elements. This is appropriate if you realize that a common configuration mistake is present in all tasks or that the entire set is no longer needed.

You can also cancel only part of an array. Many schedulers allow you to specify a subset of task indices, so you can stop just the tasks that are failing or that cover an uninteresting parameter range. This is useful when you have already obtained enough results in one region of parameter space and want to free resources without discarding successful runs elsewhere in the array.

Some systems also support cancelling sets of jobs selected by user, account, partition, or other criteria. Such mass cancellations should be used with care, and on multi user systems they may be restricted to administrators. Before using wide patterns, always verify that you are not inadvertently targeting jobs that should continue to run.

Modifying Jobs Before They Start

Modifying a job means changing its properties after submission. What you can modify depends strongly on the scheduler configuration and the current job state. In general, it is easier and safer to change jobs that are still pending and have not started execution.

Schedulers often allow adjustments to certain fields of a pending job. Typical examples are the time limit, the partition, the quality of service class, and, sometimes, the number of nodes or cores, within policy limits. In SLURM, such changes are often possible with a command like scontrol update that operates on the job identifier and specific fields.

Modifying a pending job can save you from the full overhead of cancellation and resubmission, especially in systems where queues are long or submission triggers complex workflows. For example, if you realize you have requested too little time, you can increase the time limit if your policies allow it. Similarly, if you mistakenly chose a debug partition instead of a regular one, you may be able to move the job without crafting a new script.

There are, however, practical and policy limits. Many clusters restrict which fields can be changed and by how much, to maintain fairness and prevent circumvention of scheduling rules. Sometimes you can decrease resource requests more freely than you can increase them, since reductions are more likely to help the scheduler fit jobs on the cluster.

Guideline: Only modify jobs in ways that remain consistent with site policies and your original intent. If the necessary change is large, it is usually clearer to cancel and resubmit with a corrected script.

Modifying Resources of Running Jobs

Changing the resource allocation of a job that is already running is more complex and often not allowed. The scheduler must maintain a consistent view of which nodes, cores, and memory are reserved for each job. Dynamically increasing or decreasing that allocation can interfere with other jobs and with the internal algorithms of the scheduler.

Some schedulers provide limited support for expanding or shrinking running jobs, often under special reservation or advanced features such as dynamic resources or elastic jobs. These are typically used in specialized environments and require both scheduler support and application level support for resizing. For most everyday users, such functionality is not available or not recommended.

You may, however, be able to adjust some non critical attributes of a running job, such as its priority within your own jobs or optional metadata fields. These adjustments are usually cosmetic from a resource perspective and are meant for organizational convenience rather than for altering how much of the cluster the job consumes.

Because of these constraints, the practical approach to significant changes in resource needs for a running job is usually to let it finish if it is close to completion, or to cancel it and resubmit a revised job that requests appropriate resources and perhaps starts from a checkpoint.

Extending or Reducing Time Limits

Time limits are a special case of job modification that arise frequently. Initially, you estimate how long your job will need and request that much time at submission. If your estimate was wrong, you might want to change the limit.

On many systems, increasing the time limit of a pending job is possible within certain bounds. The scheduler may allow you to increase up to a per user or per partition maximum. The reasoning is that modifying your estimate before the job starts has no effect on running jobs and only influences future scheduling decisions.

Increasing the time limit of a running job is more sensitive. Some schedulers and clusters forbid it entirely. Others allow it in limited scenarios, often with the constraint that the job has not yet approached its original time boundary. The scheduler needs to ensure that this extension does not violate fairness or resource planning, since other jobs may have been scheduled under the assumption that your job would end at a particular time.

Reducing the time limit is typically less problematic. You may wish to do this if you realize the job will complete earlier than expected and you want the scheduler to have a more accurate view of resource availability. Some commands allow you to shorten the limit on pending or running jobs, which can help them fit in backfill windows and improve overall throughput.

Even when extension is possible, you should not rely on it as a normal pattern. A better practice is to develop more accurate runtime estimates through small test runs, benchmarking, or analysis of previous completed jobs. This also helps with planning budgets and understanding performance.

Changing Job Priorities and Dependencies

While fundamental scheduling policy is under administrative control, many schedulers provide limited ways for users to influence the relative ordering of their own jobs. This can be viewed as a kind of modification, but it does not change resource quantities.

Some systems allow you to slightly adjust the priority of your pending jobs relative to each other. For example, you may lower the importance of a long, exploratory job so that a short, deadline driven job can run earlier. Typically, you cannot elevate your jobs above those of other users through this mechanism, but you can reorder them within your own queue.

Dependencies, such as “start job B only after job A finishes successfully,” can sometimes be modified as well. If you realize that a dependency was mis-specified or that a prerequisite job is no longer relevant, adjusting or removing dependencies might help your workflow proceed. In other cases, it may be clearer to cancel the dependent jobs and resubmit them with correct relationships.

When altering dependencies, carefully verify that you are not creating cycles or leaving jobs in a state where they will never start. A job that depends on an earlier job that was cancelled or failed without any alternative path is effectively blocked. It is usually better to decide explicitly whether such jobs should be cancelled or updated with a new dependency.

Interactive Jobs and Graceful Exit

Interactive jobs provide a shell or environment on compute nodes, and their lifecycle often feels more like a local session than a batch job. However, they are still managed by the scheduler and count against resource usage.

You can usually terminate an interactive job by exiting the shell, closing the terminal, or sending an interrupt from the client. In addition, the standard cancellation command that you use for batch jobs applies to the interactive job identifier. If network disruptions or client side problems leave the interactive session unresponsive, cancelling the job explicitly ensures that resources on the compute nodes are released.

For interactive sessions that launch long running commands or parallel test runs, try to stop the application first in a controlled way. For example, send an interrupt to the program or use its own quit command. Only then terminate the job. This pattern helps ensure that temporary files and data are left in a consistent state.

Because interactive jobs often use valuable shared resources, leaving them idle is discouraged. If you are done with your session, exit cleanly rather than leaving a long running but idle job that simply awaits its time limit.

Handling Jobs That Exceed Time or Resources

Although this chapter focuses on user initiated cancellation and modification, it is important to understand what happens when the scheduler or system terminates a job due to policy limits. A job that exceeds its time limit or allocated memory may be killed automatically.

From the scheduler point of view, such terminations are similar to user cancellations, but from the application view they can be more abrupt, especially for hard memory limits. Your output logs and job status information will typically indicate that the job was cancelled due to time limit or out of memory errors.

After such an event, the recommended actions are to inspect logs to confirm the cause, adjust your job script to request more appropriate resources or refine the code to use resources more efficiently, and, if necessary, start a new job from a suitable checkpoint or from the beginning. Attempting to modify a job that has already been terminated is not meaningful, since the scheduler treats each job instance as immutable once it completes or is cancelled.

Understanding these automatic cancellations helps you distinguish between issues you control, such as incorrect resource requests, and issues that stem from application bugs or genuine resource insufficiency.

Policy Awareness and Responsible Job Management

Finally, both cancellation and modification are subject to site specific policies. Some clusters log all such operations for accounting and troubleshooting. Others limit how frequently or how drastically users may modify jobs. Reading the local documentation and paying attention to examples in provided templates is crucial.

From a collaborative perspective, responsible job management means that you cancel unneeded jobs promptly, monitor running jobs so that serious misbehavior is not left unattended, and avoid repeatedly modifying jobs in ways that confuse the scheduler or other users who share the system. Good habits in this area directly contribute to better throughput, shorter wait times, and a smoother experience for the entire user community.

Whenever you are uncertain whether a particular modification is allowed or advisable, it is usually better to cancel and resubmit with a clean configuration, or to ask local support for guidance. Over time, experience will help you predict when a small modification is safe and efficient, and when a fresh job submission is the more robust choice.

Comments

Please login to add a comment.

Don't have an account? Register now!