Table of Contents
Why monitoring matters
Once a job is submitted, the scheduler controls when and where it runs, but you are still responsible for:
- Checking whether it actually started
- Tracking resource usage (CPUs, memory, GPUs, time)
- Detecting problems early (stuck, failing, oversubscribing resources)
- Collecting information needed for performance tuning later
Monitoring is mostly about querying the scheduler and inspecting job-generated output.
This chapter focuses on typical SLURM-based systems, since they dominate HPC today. Other schedulers offer similar functionality with different commands and options.
Basic SLURM job status commands
`squeue`: listing running and pending jobs
squeue shows jobs known to the scheduler (pending, running, completing, etc.).
Basic usage to see your own jobs:
squeue -u $USERTypical columns:
JOBID– unique job identifierPARTITION– queue/partition nameNAME– job nameUSER– job ownerST– state (PD,R,CG, etc.)TIME– wall clock time since startNODES– number of nodes allocatedNODELIST(REASON)– node list, or reason if pending
Common states you will see:
PD– PENDING (waiting to start)R– RUNNINGCG– COMPLETING (shutting down)CA– CANCELLEDF– FAILEDCD– COMPLETED (finished successfully)TO– TIMEOUT (ran out of requested wall time)
Filter for a specific job:
squeue -j 123456Show more details (wide output):
squeue -u $USER -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"`sacct`: looking at finished and historical jobs
squeue only shows active or very recently completed jobs. For past jobs (including failed or cancelled ones), use sacct.
Basic usage for today’s jobs:
sacct -u $USERYou can control the time range:
sacct -u $USER --starttime=2025-01-01Useful columns:
sacct -j 123456 \
--format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,MaxRSSCommon fields:
State– final state (COMPLETED,FAILED,CANCELLED,TIMEOUT, …)ExitCode–0:0usually means success; non-zero means some errorElapsed– how long the job ranMaxRSS– maximum resident set size (approx. max memory usage per task)
Note: on many systems, MaxRSS is only available after the job finishes.
Inspecting job details while running
`scontrol show job`
scontrol gives detailed information about a job’s allocation and configuration.
scontrol show job 123456You might see:
- Requested and allocated resources (CPUs, memory, nodes)
- Node list
- Current state and reason (for pending jobs)
- Time limits and remaining time
- Paths to standard output/error files
Reason is particularly useful for pending jobs:
Priority– waiting for higher-priority jobsResources– waiting for enough nodes/CPUsAssocGrpCPUMinsLimitor similar – hitting account/usage limitsDependency– waiting on another job to finish
`sstat`: live resource usage
While a job is running, sstat can show live resource statistics per job step.
Basic example:
sstat -j 123456.batch --format=JobID,MaxRSS,AveRSS,AveCPUIf you get no output, check which steps are present:
sstat -j 123456
Some clusters restrict sstat or only update it infrequently; behavior can be site-specific.
Tracking output and error logs
Standard output and error files
When you submit a batch script, you typically specify:
#SBATCH --output=myjob_%j.out
#SBATCH --error=myjob_%j.err
%j is replaced with the job ID. Monitoring often means:
- Using
tail -fon the output file:
tail -f myjob_123456.out- Checking error messages:
less myjob_123456.errCommon patterns:
- A job appears
Rinsqueue, but log files remain empty: - Maybe your script never reaches the main computation (e.g., early exit).
- Wrong working directory or missing input files (use
scontrol show jobto checkWorkDir). - Errors like
Segmentation faultor Python tracebacks appear in*.erreven when the job is stillR.
Application-level progress indicators
For long-running jobs, it is good practice to have your code print progress messages, timestamps, or iteration numbers. This makes monitoring practical:
- "Step 10/100 completed" every few minutes
- Current simulation time or epoch/iteration
- Periodic flush of buffers (ensuring messages appear in the log promptly)
In C/C++, explicitly flush stdout or write with fprintf(stderr, ...). In Python, use print(..., flush=True) or run with -u for unbuffered output.
Monitoring resource usage on nodes
Using `ssh` or `srun` to inspect a running job
If your site allows node access, you can log into a node allocated to your job:
squeue -j 123456 -o "%.18i %.8T %.15M %.30R"
# Suppose NODELIST(REASON) shows 'node123'
ssh node123
Or use srun inside an interactive allocation.
Once on the node, tools like top, htop, nvidia-smi, ps help monitor:
- CPU utilization per process
- Memory consumption
- GPU utilization and memory (for GPU jobs)
Note: some clusters restrict direct ssh to compute nodes; in that case, prefer scheduler-provided tools (sstat, etc.).
GPU-specific monitoring
On GPU nodes, nvidia-smi is often available:
nvidia-smiYou can see:
- GPU utilization (%)
- GPU memory usage (MB)
- Processes using each GPU
Some clusters provide job-filtered wrappers (e.g., srun --pty nvidia-smi watch -n 1 or custom scripts) so you only see your own job’s usage.
Understanding job states and diagnosing issues
Interpreting pending reasons
Pending (PD) jobs can be normal, but sometimes they indicate a problem:
Use:
squeue -u $USER -t PD -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"Common reasons and what they might mean:
Resources– cluster is busy; your job waits for enough nodes/CPUsPriority– other jobs have higher priority; wait or discuss with supportQOSMaxWallDurationPerJobLimit/MaxNodesPerJob– you requested more than allowed; adjust your scriptReqNodeNotAvail– specific requested node(s) unavailable; maybe down or reservedDependency– job depends on other job(s) not yet satisfied (see your--dependencysettings)
If a job stays pending for an unusually long time with a reason you don’t understand, note the job ID and contact support.
Recognizing jobs that are “stuck”
A job can be R but effectively doing nothing useful. Signs:
- Log file stops updating, while
TIMEinsqueuekeeps increasing. - CPU usage on nodes is low or zero (via
top/htop/sstat). - Application obeys no progress indications for a long time.
Possible causes include:
- Deadlocks or hangs in parallel code
- Waiting for I/O or a network resource
- Waiting for a license (for licensed software)
In such cases, you may decide to cancel the job (covered in the dedicated chapter) and investigate offline, rather than wasting allocation time.
Checking exit codes and failure reasons
After a job finishes, sacct is your main tool:
sacct -j 123456 --format=JobID,State,ExitCode,Elapsed,MaxRSS,AllocTRES%30Typical patterns:
State=COMPLETEDandExitCode=0:0– job exited normallyState=FAILEDand non-zeroExitCode– application or script errorState=TIMEOUT– job hit wall-time limitState=CANCELLED– manually cancelled or cancelled by system (check extended reason if available)
Look at logs (.out, .err) around the end of execution; errors often appear just before termination.
Monitoring job arrays
Job arrays group many similar jobs under one array ID. Monitoring follows the same principles with slight syntax changes.
Submit example:
sbatch --array=0-9 myjob.sh
# Suppose the array ID is 789012
Each task has ID like 789012_0, 789012_1, …
Monitoring a whole array in squeue:
squeue -j 789012
Monitoring specific tasks in sacct:
sacct -j 789012 --format=JobID,State,ExitCodeYou can spot patterns:
- Are all tasks failing? Possibly a bug in your script or shared input.
- Only some tasks failing? Maybe specific input data problems or node-specific issues.
Using email and notifications
Many clusters allow email notifications via SBATCH directives in your job script:
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=you@example.com
Common --mail-type values:
BEGIN– when job startsEND– when job finishes successfullyFAIL– when job failsTIME_LIMIT– when job hits time limitALL– all of the above and a few more
This is useful for long jobs where you do not want to poll the scheduler constantly.
Some sites also provide web portals or dashboards; they often show the same info as squeue and sacct but with a graphical interface.
Simple monitoring workflows for beginners
While learning
- Submit a small test job.
- Immediately run
squeue -u $USERto see its initial status. - Tail logs with
tail -f myjob_%j.out(after substituting the actual job ID). - When the job finishes, run:
sacct -j JOBID --format=JobID,State,ExitCode,Elapsed,MaxRSSFor production runs
- Use clear
--output/--errornaming (include job name, ID, maybe date). - Add progress messages in your code or script.
- Use
squeueand/or email notifications to know when the job starts. - Periodically check:
sstat(if available) for memory and CPU- Log updates with
tail - After completion, review
sacctoutput and logs to confirm: - No hidden errors or warnings
- Resources were used as expected (not vastly under- or over-requested)
Over time, this monitoring information feeds back into better job sizing and more efficient use of the cluster.