Kahibaro
Discord Login Register

Monitoring jobs

Why monitoring matters

Once a job is submitted, the scheduler controls when and where it runs, but you are still responsible for:

Monitoring is mostly about querying the scheduler and inspecting job-generated output.

This chapter focuses on typical SLURM-based systems, since they dominate HPC today. Other schedulers offer similar functionality with different commands and options.

Basic SLURM job status commands

`squeue`: listing running and pending jobs

squeue shows jobs known to the scheduler (pending, running, completing, etc.).

Basic usage to see your own jobs:

squeue -u $USER

Typical columns:

Common states you will see:

Filter for a specific job:

squeue -j 123456

Show more details (wide output):

squeue -u $USER -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"

`sacct`: looking at finished and historical jobs

squeue only shows active or very recently completed jobs. For past jobs (including failed or cancelled ones), use sacct.

Basic usage for today’s jobs:

sacct -u $USER

You can control the time range:

sacct -u $USER --starttime=2025-01-01

Useful columns:

sacct -j 123456 \
  --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,MaxRSS

Common fields:

Note: on many systems, MaxRSS is only available after the job finishes.

Inspecting job details while running

`scontrol show job`

scontrol gives detailed information about a job’s allocation and configuration.

scontrol show job 123456

You might see:

Reason is particularly useful for pending jobs:

`sstat`: live resource usage

While a job is running, sstat can show live resource statistics per job step.

Basic example:

sstat -j 123456.batch --format=JobID,MaxRSS,AveRSS,AveCPU

If you get no output, check which steps are present:

sstat -j 123456

Some clusters restrict sstat or only update it infrequently; behavior can be site-specific.

Tracking output and error logs

Standard output and error files

When you submit a batch script, you typically specify:

#SBATCH --output=myjob_%j.out
#SBATCH --error=myjob_%j.err

%j is replaced with the job ID. Monitoring often means:

  tail -f myjob_123456.out
  less myjob_123456.err

Common patterns:

Application-level progress indicators

For long-running jobs, it is good practice to have your code print progress messages, timestamps, or iteration numbers. This makes monitoring practical:

In C/C++, explicitly flush stdout or write with fprintf(stderr, ...). In Python, use print(..., flush=True) or run with -u for unbuffered output.

Monitoring resource usage on nodes

Using `ssh` or `srun` to inspect a running job

If your site allows node access, you can log into a node allocated to your job:

squeue -j 123456 -o "%.18i %.8T %.15M %.30R"
# Suppose NODELIST(REASON) shows 'node123'
ssh node123

Or use srun inside an interactive allocation.

Once on the node, tools like top, htop, nvidia-smi, ps help monitor:

Note: some clusters restrict direct ssh to compute nodes; in that case, prefer scheduler-provided tools (sstat, etc.).

GPU-specific monitoring

On GPU nodes, nvidia-smi is often available:

nvidia-smi

You can see:

Some clusters provide job-filtered wrappers (e.g., srun --pty nvidia-smi watch -n 1 or custom scripts) so you only see your own job’s usage.

Understanding job states and diagnosing issues

Interpreting pending reasons

Pending (PD) jobs can be normal, but sometimes they indicate a problem:

Use:

squeue -u $USER -t PD -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"

Common reasons and what they might mean:

If a job stays pending for an unusually long time with a reason you don’t understand, note the job ID and contact support.

Recognizing jobs that are “stuck”

A job can be R but effectively doing nothing useful. Signs:

Possible causes include:

In such cases, you may decide to cancel the job (covered in the dedicated chapter) and investigate offline, rather than wasting allocation time.

Checking exit codes and failure reasons

After a job finishes, sacct is your main tool:

sacct -j 123456 --format=JobID,State,ExitCode,Elapsed,MaxRSS,AllocTRES%30

Typical patterns:

Look at logs (.out, .err) around the end of execution; errors often appear just before termination.

Monitoring job arrays

Job arrays group many similar jobs under one array ID. Monitoring follows the same principles with slight syntax changes.

Submit example:

sbatch --array=0-9 myjob.sh
# Suppose the array ID is 789012

Each task has ID like 789012_0, 789012_1, …

Monitoring a whole array in squeue:

squeue -j 789012

Monitoring specific tasks in sacct:

sacct -j 789012 --format=JobID,State,ExitCode

You can spot patterns:

Using email and notifications

Many clusters allow email notifications via SBATCH directives in your job script:

#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=you@example.com

Common --mail-type values:

This is useful for long jobs where you do not want to poll the scheduler constantly.

Some sites also provide web portals or dashboards; they often show the same info as squeue and sacct but with a graphical interface.

Simple monitoring workflows for beginners

While learning

  1. Submit a small test job.
  2. Immediately run squeue -u $USER to see its initial status.
  3. Tail logs with tail -f myjob_%j.out (after substituting the actual job ID).
  4. When the job finishes, run:
   sacct -j JOBID --format=JobID,State,ExitCode,Elapsed,MaxRSS

For production runs

  1. Use clear --output / --error naming (include job name, ID, maybe date).
  2. Add progress messages in your code or script.
  3. Use squeue and/or email notifications to know when the job starts.
  4. Periodically check:
    • sstat (if available) for memory and CPU
    • Log updates with tail
  5. After completion, review sacct output and logs to confirm:
    • No hidden errors or warnings
    • Resources were used as expected (not vastly under- or over-requested)

Over time, this monitoring information feeds back into better job sizing and more efficient use of the cluster.

Views: 13

Comments

Please login to add a comment.

Don't have an account? Register now!