5.6 Monitoring jobs

Table of Contents

Watching Your Jobs While They Run

Monitoring jobs is about answering a few simple questions while your job is in the queue or running on the cluster. Is it waiting or executing. How much time and memory is it using. Is it behaving as expected or stuck. This chapter focuses on practical tools and habits for checking on jobs after you have submitted them with a batch system such as SLURM, without going back into submission details or scheduler theory.

Basic Status: Is My Job Queued or Running

Once you submit a job in SLURM using sbatch, you receive a numeric job ID. This ID is your main handle for monitoring.

The most common command is squeue, which lists jobs known to the scheduler. To filter to your own jobs, use

squeue -u $USER

This shows information such as job ID, job name, user, job state, time in the current state, and the nodes allocated. Typical states include pending (PD), running (R), completing (CG), and completed or failed, which may no longer appear in squeue once the scheduler has purged finished jobs.

To check a single job, you can filter by job ID:

squeue -j <jobid>

If this shows no output, and you are sure the ID is correct, your job has likely finished or been cancelled. At that point, your main clues will be in job accounting commands and in your job’s output and error files.

Understanding Job States and Reasons

When a job is pending, SLURM usually records a reason. You can request extended information, for example

squeue -u $USER -l

squeue -u $USER -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

where the %R field shows the reason. Typical reasons include insufficient resources, partition limits, job size too large for the selected partition, or user and account constraints.

It is important to distinguish between jobs that are pending because the cluster is busy and jobs that will never run because of configuration or request problems. If the reason suggests that limits are exceeded or the job violates constraints, you may need to adjust your requested resources in your job script rather than just wait.

For running jobs, the ST column is usually R. The TIME column shows how long the job has been in the running state, not wall clock since submission. By comparing this to the time limit you requested, you can estimate how much of the allocated time has already been used.

A job in state PD (pending) or R (running) is not guaranteed to complete successfully. Always check your output and error files after the job leaves the queue to confirm a clean run.

Getting Detailed Job Information with sacct and scontrol

Once jobs have started, and especially after they finish, sacct provides historical accounting information that squeue no longer shows. To see a summary of recent jobs, try

sacct -u $USER

For a particular job ID, with more detail, you can use

sacct -j <jobid> --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,MaxRSS,ReqMem

The exact fields available depend on your cluster configuration but typically include final state, exit code, elapsed runtime, and peak memory usage as reported by the scheduler.

scontrol show job <jobid> gives a detailed snapshot of a job’s configuration and status while it is pending or running. This can include requested and allocated resources, the current node list, time limits, and some internal scheduler flags. Although the output is verbose, you can quickly scan it for lines such as State=RUNNING, RunTime=, TimeLimit=, and NodeList= to confirm where and how the job is executing.

Locating and Interpreting Job Output and Error Files

When you submit a batch script, you typically specify output and error files via directives such as

#SBATCH -o myjob.out
#SBATCH -e myjob.err

or by using a combined output option. Monitoring jobs often means following these files as the job progresses.

To watch a file as it grows, use tail with follow mode:

tail -f myjob.out

This shows the last lines of the file and updates them in real time while the job is running. If your application prints progress messages or timestamps, tail -f is one of the simplest ways to check whether the job is advancing and where in its workflow it currently is.

When you suspect problems, check the error file in the same way. Compiler messages, runtime errors, and abort signals are usually written to the error stream. If you see the same message repeated many times, or nothing is written for a long time when you expect regular updates, this can hint at hangs, infinite loops, or early crashes that happen before your own logging begins.

After the job finishes, you can use tools like grep, head, and less to quickly inspect these logs. For example, you might

grep -i "error" myjob.err

to search for error messages, or

head myjob.out and tail myjob.out

to view initial setup messages and final completion messages.

Monitoring Resource Usage While Jobs Run

Effective monitoring in HPC is not only about whether a job is running but also how it uses the resources you requested. A job that is running but using far less CPU, memory, or I/O than expected may be inefficient or misconfigured, and a job that approaches limits is at risk of failure.

On many clusters, the simplest way to see live resource usage on allocated nodes is to log in to one of those nodes and use local monitoring tools. If your job is running on a node listed in NodeList from scontrol show job, and the system permits interactive logins to compute nodes, you might

ssh <node-name>

and then run tools like top or htop. Within top, you can sort by CPU or memory usage, check that the expected number of processes or threads are active, and confirm that CPUs are not idle when they should be busy.

If your application runs as multiple processes under MPI, each process may appear separately. For threaded programs, you may see a single process with multiple threads. You do not need to analyze this in depth here, but you should learn to recognize whether your program is using as many cores as you requested or sitting mostly idle.

danger
If you monitor compute nodes directly, avoid running heavy auxiliary applications that compete with your own jobs for CPU or memory. Monitoring itself should be lightweight and short-lived.

Some sites provide wrapper commands such as seff <jobid> that parse SLURM accounting data and show a compact summary of efficiency, including CPU utilization and average or maximum memory usage. When available, these are convenient tools for after-the-fact evaluation and can highlight jobs that requested far more resources than they actually used.

Time Limits and Approaching Job Expiry

SLURM enforces job time limits. When you submit a job you normally specify a time limit, and the system may also impose partition-wide maximums. Monitoring jobs includes keeping an eye on how close a running job is to its time limit.

squeue -j <jobid> -o "%.18i %.9P %.8j %.2t %.10M %.10l %.6D %R"

is a useful pattern, where %M is elapsed time and %l is the time limit. Comparing these two columns tells you how much time remains. If your job appears far from completion based on its own progress messages, but very close to the time limit, you may need to plan a restart strategy or adjust future submissions.

Some applications write their own estimates of remaining time, for example by reporting iteration numbers, time step counts, or percentage complete. By comparing those with the remaining wall time, you can judge whether the job is on track to finish in time or almost certain to hit the limit.

When a job exceeds its time limit, the scheduler terminates it with an appropriate signal, and you will usually find a message in the error file or in sacct indicating that the time limit was reached. Monitoring and reacting early can help you avoid wasted compute by adjusting workflows, checkpointing more frequently, or resizing the problem.

Detecting Stalled or Misbehaving Jobs

Not all running jobs are healthy. Some may enter deadlocks, infinite loops, or extremely slow phases. Monitoring helps detect such behavior.

Symptoms of a stalled job include:

No progress messages for an unexpectedly long period when the application normally reports regularly.

CPU usage close to zero for the main processes while wall time continues to increase.

Rapid growth of output or error files with repeated messages, which can indicate a failure that is being retried endlessly or an unintentional verbose loop.

To investigate, you can combine several tools. Use squeue or sacct to confirm that the job is still in running state. Examine the output and error logs for recent activity or repeated patterns. If allowed, log into the node and use top or ps to see which processes belong to your job and what they are doing.

If the job is clearly stuck and not making useful progress, it is often better to cancel it, adjust parameters or code, and resubmit, rather than allow it to consume wall time and resources unproductively.

Using Job Cancellation as a Monitoring Response

Monitoring is only useful if you act on what you see. When monitoring reveals that you mis-specified resources, submitted the wrong script, or started a job that is clearly failing, the correct response is often to cancel the job.

Although job cancellation itself is covered in a separate chapter, it is important to see it as one of the possible outcomes of monitoring. You watch the job, gather evidence about its behavior, then decide to let it run, adjust your future submissions, or actively terminate it and free the resources.

A typical cycle is:

Submit a job and record the job ID.

Use squeue, sacct, and log inspection to verify that it started correctly.

During the run, periodically check that progress is being made and that resource usage looks reasonable.

If the job finishes successfully, use accounting data and logs to evaluate efficiency and inform better job sizing in the future.

If the job misbehaves or is clearly misconfigured, cancel it and correct the issue before resubmitting.

This feedback loop between submission, monitoring, and adjustment is central to effective and responsible use of HPC clusters.

Putting It Together in a Routine

In practice, monitoring jobs becomes a simple routine:

Right after submission, use squeue -j <jobid> to confirm that the scheduler has accepted the job and placed it in the queue.

Once the job starts, use squeue and possibly scontrol show job to check which nodes it runs on and how long it has been running.

Follow the output file with tail -f to see application progress.

Optionally, if permitted, inspect the compute node briefly with top to verify CPU and memory usage.

After completion, use sacct or cluster-specific reporting tools to examine final state, runtime, and maximum memory. Combine this with log file inspection to confirm correct results.

By adopting such a workflow, you turn monitoring jobs from an occasional emergency check into a regular habit that improves reliability, efficiency, and your understanding of how your applications behave on an HPC system.

Comments

Please login to add a comment.

Don't have an account? Register now!