5.6 Monitoring jobs

Why monitoring matters

Once a job is submitted, the scheduler controls when and where it runs, but you are still responsible for:

Checking whether it actually started
Tracking resource usage (CPUs, memory, GPUs, time)
Detecting problems early (stuck, failing, oversubscribing resources)
Collecting information needed for performance tuning later

Monitoring is mostly about querying the scheduler and inspecting job-generated output.

This chapter focuses on typical SLURM-based systems, since they dominate HPC today. Other schedulers offer similar functionality with different commands and options.

Basic SLURM job status commands

`squeue`: listing running and pending jobs

squeue shows jobs known to the scheduler (pending, running, completing, etc.).

Basic usage to see your own jobs:

squeue -u $USER

Typical columns:

JOBID – unique job identifier
PARTITION – queue/partition name
NAME – job name
USER – job owner
ST – state (PD, R, CG, etc.)
TIME – wall clock time since start
NODES – number of nodes allocated
NODELIST(REASON) – node list, or reason if pending

Common states you will see:

PD – PENDING (waiting to start)
R – RUNNING
CG – COMPLETING (shutting down)
CA – CANCELLED
F – FAILED
CD – COMPLETED (finished successfully)
TO – TIMEOUT (ran out of requested wall time)

Filter for a specific job:

squeue -j 123456

Show more details (wide output):

squeue -u $USER -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"

`sacct`: looking at finished and historical jobs

squeue only shows active or very recently completed jobs. For past jobs (including failed or cancelled ones), use sacct.

Basic usage for today’s jobs:

sacct -u $USER

You can control the time range:

sacct -u $USER --starttime=2025-01-01

Useful columns:

sacct -j 123456 \
  --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,MaxRSS

Common fields:

State – final state (COMPLETED, FAILED, CANCELLED, TIMEOUT, …)
ExitCode – 0:0 usually means success; non-zero means some error
Elapsed – how long the job ran
MaxRSS – maximum resident set size (approx. max memory usage per task)

Note: on many systems, MaxRSS is only available after the job finishes.

Inspecting job details while running

`scontrol show job`

scontrol gives detailed information about a job’s allocation and configuration.

scontrol show job 123456

You might see:

Requested and allocated resources (CPUs, memory, nodes)
Node list
Current state and reason (for pending jobs)
Time limits and remaining time
Paths to standard output/error files

Reason is particularly useful for pending jobs:

Priority – waiting for higher-priority jobs
Resources – waiting for enough nodes/CPUs
AssocGrpCPUMinsLimit or similar – hitting account/usage limits
Dependency – waiting on another job to finish

`sstat`: live resource usage

While a job is running, sstat can show live resource statistics per job step.

Basic example:

sstat -j 123456.batch --format=JobID,MaxRSS,AveRSS,AveCPU

If you get no output, check which steps are present:

sstat -j 123456

Some clusters restrict sstat or only update it infrequently; behavior can be site-specific.

Tracking output and error logs

Standard output and error files

When you submit a batch script, you typically specify:

#SBATCH --output=myjob_%j.out
#SBATCH --error=myjob_%j.err

%j is replaced with the job ID. Monitoring often means:

Using tail -f on the output file:

  tail -f myjob_123456.out

Checking error messages:

  less myjob_123456.err

Common patterns:

A job appears R in squeue, but log files remain empty:

Maybe your script never reaches the main computation (e.g., early exit).
Wrong working directory or missing input files (use scontrol show job to check WorkDir).

Errors like Segmentation fault or Python tracebacks appear in *.err even when the job is still R.

Application-level progress indicators

For long-running jobs, it is good practice to have your code print progress messages, timestamps, or iteration numbers. This makes monitoring practical:

"Step 10/100 completed" every few minutes
Current simulation time or epoch/iteration
Periodic flush of buffers (ensuring messages appear in the log promptly)

In C/C++, explicitly flush stdout or write with fprintf(stderr, ...). In Python, use print(..., flush=True) or run with -u for unbuffered output.

Monitoring resource usage on nodes

Using `ssh` or `srun` to inspect a running job

If your site allows node access, you can log into a node allocated to your job:

squeue -j 123456 -o "%.18i %.8T %.15M %.30R"
# Suppose NODELIST(REASON) shows 'node123'
ssh node123

Or use srun inside an interactive allocation.

Once on the node, tools like top, htop, nvidia-smi, ps help monitor:

CPU utilization per process
Memory consumption
GPU utilization and memory (for GPU jobs)

Note: some clusters restrict direct ssh to compute nodes; in that case, prefer scheduler-provided tools (sstat, etc.).

GPU-specific monitoring

On GPU nodes, nvidia-smi is often available:

nvidia-smi

You can see:

GPU utilization (%)
GPU memory usage (MB)
Processes using each GPU

Some clusters provide job-filtered wrappers (e.g., srun --pty nvidia-smi watch -n 1 or custom scripts) so you only see your own job’s usage.

Understanding job states and diagnosing issues

Interpreting pending reasons

Pending (PD) jobs can be normal, but sometimes they indicate a problem:

Use:

squeue -u $USER -t PD -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"

Common reasons and what they might mean:

Resources – cluster is busy; your job waits for enough nodes/CPUs
Priority – other jobs have higher priority; wait or discuss with support
QOSMaxWallDurationPerJobLimit / MaxNodesPerJob – you requested more than allowed; adjust your script
ReqNodeNotAvail – specific requested node(s) unavailable; maybe down or reserved
Dependency – job depends on other job(s) not yet satisfied (see your --dependency settings)

If a job stays pending for an unusually long time with a reason you don’t understand, note the job ID and contact support.

Recognizing jobs that are “stuck”

A job can be R but effectively doing nothing useful. Signs:

Log file stops updating, while TIME in squeue keeps increasing.
CPU usage on nodes is low or zero (via top/htop/sstat).
Application obeys no progress indications for a long time.

Possible causes include:

Deadlocks or hangs in parallel code
Waiting for I/O or a network resource
Waiting for a license (for licensed software)

In such cases, you may decide to cancel the job (covered in the dedicated chapter) and investigate offline, rather than wasting allocation time.

Checking exit codes and failure reasons

After a job finishes, sacct is your main tool:

sacct -j 123456 --format=JobID,State,ExitCode,Elapsed,MaxRSS,AllocTRES%30

Typical patterns:

State=COMPLETED and ExitCode=0:0 – job exited normally
State=FAILED and non-zero ExitCode – application or script error
State=TIMEOUT – job hit wall-time limit
State=CANCELLED – manually cancelled or cancelled by system (check extended reason if available)

Look at logs (.out, .err) around the end of execution; errors often appear just before termination.

Monitoring job arrays

Job arrays group many similar jobs under one array ID. Monitoring follows the same principles with slight syntax changes.

Submit example:

sbatch --array=0-9 myjob.sh
# Suppose the array ID is 789012

Each task has ID like 789012_0, 789012_1, …

Monitoring a whole array in squeue:

squeue -j 789012

Monitoring specific tasks in sacct:

sacct -j 789012 --format=JobID,State,ExitCode

You can spot patterns:

Are all tasks failing? Possibly a bug in your script or shared input.
Only some tasks failing? Maybe specific input data problems or node-specific issues.

Using email and notifications

Many clusters allow email notifications via SBATCH directives in your job script:

#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=you@example.com

Common --mail-type values:

BEGIN – when job starts
END – when job finishes successfully
FAIL – when job fails
TIME_LIMIT – when job hits time limit
ALL – all of the above and a few more

This is useful for long jobs where you do not want to poll the scheduler constantly.

Some sites also provide web portals or dashboards; they often show the same info as squeue and sacct but with a graphical interface.

Simple monitoring workflows for beginners

While learning

Submit a small test job.
Immediately run squeue -u $USER to see its initial status.
Tail logs with tail -f myjob_%j.out (after substituting the actual job ID).
When the job finishes, run:

   sacct -j JOBID --format=JobID,State,ExitCode,Elapsed,MaxRSS

For production runs

Use clear --output / --error naming (include job name, ID, maybe date).
Add progress messages in your code or script.
Use squeue and/or email notifications to know when the job starts.
Periodically check:

sstat (if available) for memory and CPU
Log updates with tail

After completion, review sacct output and logs to confirm:

No hidden errors or warnings
Resources were used as expected (not vastly under- or over-requested)

Over time, this monitoring information feeds back into better job sizing and more efficient use of the cluster.

Comments

Please login to add a comment.

Don't have an account? Register now!