20.2 Running large-scale simulations

Table of Contents

Framing a Large‑Scale Simulation Run

Running a large scale simulation on an HPC system is very different from running a small test on your laptop. You move from single executable runs to multi step workflows, from interactive testing to queued production jobs, and from “try and see” to carefully planned and monitored campaigns.

In the context of the final project, you assume that the scientific or engineering problem, the code base, and the general workflow are already in place. The focus here is how to plan, configure, launch, and manage simulations that use many nodes, many cores, or GPUs on a shared cluster.

A “large scale” run in this chapter means one or more of the following: using many nodes, using a significant fraction of the machine, running for many hours, or producing large amounts of data. You will treat these runs as production jobs, not exploratory tests.

From Small Tests to Production Runs

Before you scale up, you should already have:

A working code that runs correctly on a single node, ideally also on a small number of nodes.

A basic job script that can submit small runs to the scheduler.

A clear idea of which input parameters define problem size and which define numerical or physical configuration.

Large scale runs add a new layer on top of this. You must answer questions such as:

How does the problem size map to memory usage per process or per GPU.

How does runtime change when you increase the number of processes.

What is the maximum feasible scale before the code stops benefiting from more resources.

You will not resolve all performance details here, because those belong in the performance analysis chapter, but you must know enough about your code’s behavior to make reasonable scaling decisions.

Always perform small scale validation runs before any large scale production run. Never submit a large job that you have not tested on a smaller configuration of the same workflow and code version.

Choosing Appropriate Scale and Resources

Selecting resources for a large scale simulation is both a technical and a practical decision. Technically, you want enough nodes, cores, and memory to run efficiently. Practically, you need to respect queue limits, fair share policies, and project allocations.

Memory and Problem Size

Your first hard constraint is memory. You need to estimate how the total memory requirement grows with problem size and how it is distributed across processes or threads.

For many structured grid problems, if $N$ is the number of grid points, and each point stores $k$ variables of size $s$ bytes, then the raw data memory requirement is approximately

$$
M_{\text{data}} \approx N \cdot k \cdot s.
$$

This does not include overhead from halos, communication buffers, and temporary arrays. In practice you often multiply by a safety factor, for example 2 or 3.

If you use $P$ processes with a domain decomposition, and the domain is evenly partitioned, then a rough per process memory requirement is

$$
M_{\text{proc}} \approx \frac{M_{\text{data}}}{P} \cdot f_{\text{overhead}},
$$

where $f_{\text{overhead}}$ accounts for halos and temporary storage.

Compare $M_{\text{proc}}$ to the available memory per node and per core on your system. This determines a lower bound on how many nodes you must request.

Runtime and Scaling Targets

For the final project, you should define a target runtime per simulation. Typical targets for a single production job range from one to several hours. If your projection suggests that a single run will take multiple days at your maximum allowed scale, you may need to reduce the physical problem size or adjust the level of resolution.

You can extrapolate runtime using simple models, based on previous smaller runs. If $T(N, P)$ is runtime with problem size $N$ and parallelism $P$, and you have measured $T$ at one or two points, you can build a crude prediction. For example, if you observed near ideal weak scaling (constant time when both $N$ and $P$ grow proportionally), then doubling $N$ while doubling $P$ will not greatly change runtime. If you have only strong scaling behavior data for fixed $N$, you can expect diminishing returns in $T(N, P)$ as $P$ becomes large due to communication costs.

You do not need a perfect model, but you must have an estimate that lets you choose a job time limit (wall time) that is neither too short nor excessively long.

Queue Limits and Practical Constraints

Every system imposes limits, for example maximum wall time per job, maximum cores per job, or maximum GPUs per user. Large jobs can sit longer in the queue and are more likely to be killed if they exceed limits.

For large scale runs in a teaching or project environment, a good practice is to:

Stay within roughly 10 to 20 percent of the maximum job size allowed in the regular queue.

Select a wall time limit that is perhaps 20 to 30 percent longer than your estimated runtime.

Split a very long simulation into multiple stages if necessary, so each stage fits within the maximum wall time.

You will see how to split runs in the section on chained and restartable simulations.

Do not request far more resources or wall time than you reasonably need. On shared clusters, over requesting resources can reduce your priority and delay your own and others’ jobs.

Preparing Inputs and Configuration for Large Runs

Large runs often need different input and configuration files than small tests. The scientific setup may be the same, but you might increase resolution, change timestep control, or adjust I/O frequency and checkpoint settings.

Scaling Scientific Parameters

If your simulation uses a spatial or temporal resolution parameter, such as grid spacing $\Delta x$ or timestep $\Delta t$, then a large scale run might refine one or both. This increases total work and memory.

You must connect these choices to your resource plan. Doubling spatial resolution in each dimension in a 3D problem multiplies the number of grid points by approximately $2^3 = 8$. Your memory and floating point work will scale roughly with this factor, even if parallel efficiency remains similar.

Before finalizing large scale inputs, prepare a set of “tiered” configurations:

A small test configuration suitable for a single node and short runtime.

A medium configuration that already uses several nodes and runs for perhaps tens of minutes.

A full production configuration that uses your target scale and has the desired physical resolution and duration.

You should run at least the first two tiers successfully before attempting the third.

Adjusting Output and Checkpoint Intervals

At large scale, writing output too frequently can dominate runtime and overload the filesystem. You need to tune output frequency to balance scientific needs with performance and storage constraints.

If $t_{\text{sim}}$ is the total simulated time you want to cover and you output $N_{\text{out}}$ snapshots, the simulation outputs at intervals of

$$
\Delta t_{\text{out}} = \frac{t_{\text{sim}}}{N_{\text{out}}}.
$$

Given a rough cost of writing one snapshot, $T_{\text{io}}$, the total I/O time is approximately $N_{\text{out}} T_{\text{io}}$. If you know that $T_{\text{io}}$ grows significantly with the number of processes or file size, you may want to limit $N_{\text{out}}$ or use light weight diagnostic outputs rather than full state dumps.

Checkpoint intervals are governed by a balance between the risk of failure and the cost of writing checkpoints. If the mean time between failures of the system is $M$, and each checkpoint takes time $C$, then classic models suggest that an optimal checkpoint interval $I$ is on the order of

$$
I \approx \sqrt{2 M C}.
$$

You do not need to calculate this precisely for the course project, but you should understand the idea: too frequent checkpoints waste time, and too infrequent checkpoints increase expected lost work in case of failure.

In practice in a teaching setting, you may choose to write a checkpoint every fixed simulated time or every fixed number of steps, so that you can safely restart if a job ends due to wall time or other interruptions.

Writing Job Scripts for Large Simulations

You will use your job scheduler to launch large simulations in batch mode. The basic structure of a job script is already covered elsewhere in the course. Here the focus is on what changes when scaling up.

Resource Requests and Mapping

In a large scale job script, you must choose:

The number of nodes (--nodes in SLURM or equivalent).

The number of tasks or processes (--ntasks).

The number of tasks per node and perhaps the number of CPUs per task.

The number of GPUs per node if you use accelerators.

Your choices must match the parallel programming model of your code. For example, if you use pure MPI, and each node has 32 cores, you might choose 4 nodes and 32 tasks per node, for a total of 128 tasks. If you use a hybrid MPI plus OpenMP configuration, with 4 MPI processes per node and 8 threads per process on the same 32 core node, you would request 4 tasks per node and 8 CPUs per task.

A typical SLURM job script segment for a moderately large MPI job could look like:

#!/bin/bash
#SBATCH --job-name=proj_large_run
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=32
#SBATCH --time=08:00:00
#SBATCH --partition=standard
#SBATCH --output=logs/proj_large_run_%j.out
#SBATCH --error=logs/proj_large_run_%j.err
module load myproject/deps
srun ./my_simulation_exe input_large.in

The exact directives will differ by site, but for large runs you should:

Use informative job names that indicate scale or configuration.

Direct output and error to log files in a dedicated directory.

Load all required modules and set environment variables explicitly, so the environment is reproducible.

File System Layout and Working Directories

Large simulations produce many files. You must choose a good layout of directories to keep things manageable and to minimize interference between runs.

A simple layout for one project might use:

A bin directory for executables.

An inputs directory for configuration files.

A runs directory where each large scale run has its own subdirectory.

A logs directory for scheduler output.

Within runs, each run could have a name that encodes the scale and date, for example strongscale_128nodes_2026-01-05. Your job script should cd into the run directory before launching the executable, and all simulation outputs should be written there.

For large scale runs, never write large outputs to your home directory or any filesystem not intended for high volume data. Always write to a project or scratch space recommended by your HPC site.

Managing Output, Logs, and Intermediate Data

Handling output correctly is crucial for large simulations. Good habits here save time in analysis and help you debug failed runs.

Standard Output and Error

At large scale, writing extensive messages to standard output or error during runtime can hurt performance, because those streams are often collected and written to a single file. You should:

Reduce verbosity of logging in production runs, unless you are debugging.

Ensure that important configuration information and a brief summary of progress are still printed, for example at startup or at coarse intervals.

Ensure that your job script collects stdout and stderr into numbered log files (for example using %j in SLURM), so you can match logs to specific jobs.

If your application uses rank dependent printing, you may want only a single process to print general messages, to avoid thousands of nearly identical lines.

Managing Large Output Files

Large simulations often output fields, trajectories, or snapshots with sizes from gigabytes to terabytes. This raises several concerns.

First, you must think about file formats. If your code supports parallel friendly formats, such as HDF5 or NetCDF, they are preferable to many separate text or binary files. Parallel I/O concepts and file formats are covered elsewhere, but practically, you want to limit the number of files and use structured formats where possible.

Second, you must monitor disk usage. Most clusters enforce quotas. You should regularly check how much space your run directories use and delete unnecessary intermediate outputs or temporary files.

Third, consider compressing older data or converting it to reduced products once you complete your analysis. For example, you might extract summary statistics or lower resolution versions and then remove the full resolution raw snapshots that you no longer need.

Checkpointing, Restarts, and Chained Jobs

Long running large scale simulations are rarely completed in a single uninterrupted job. Time limits and occasional failures make restarts essential.

Checkpoints and Restart Strategy

A checkpoint is a saved state of your simulation that is sufficient to resume from that point. For large jobs, you should design your simulation workflow around checkpoints.

A simple strategy is:

Decide a frequency of checkpoints in terms of simulation time or steps, for example every 1000 steps.

Ensure that each checkpoint file is written atomically and that the code can verify its integrity upon restart.

Place checkpoint files in a dedicated subdirectory, such as checkpoints/, with names that encode the step or time.

Once you know the approximate runtime between checkpoints, you can compare it to your wall time. For example, if each 500 steps takes about 1 hour, and you checkpoint every 1000 steps, then the interval between checkpoints is about 2 hours. If your wall time limit is 8 hours, you can safely aim to run for at least three checkpoints per job.

Restarting then uses the latest checkpoint file as input and continues the simulation. Your input configuration should indicate whether the run starts from an initial condition or a checkpoint.

Always test your restart procedure on a small scale run before relying on it in a large production campaign. A broken or untested restart path can waste many hours of compute time if a large job fails and you cannot resume.

Chained Jobs and Job Dependencies

When one simulation needs more wall time than the maximum allowed, you can chain multiple jobs together, each continuing from the previous checkpoint. Many schedulers support job dependencies.

For example, with SLURM, suppose you submit an initial job:

jid1=$(sbatch run_large_part1.sh | awk '{print $4}')

You can then submit a second job that depends on successful completion of the first:

sbatch --dependency=afterok:${jid1} run_large_part2.sh

In run_large_part2.sh, you configure the input to read the latest checkpoint produced by the first job.

For the final project, you might not need complex chains, but it is valuable to understand that large campaigns often consist of many batches of jobs that depend on each other through checkpoints or parameter sweeps.

Monitoring and Controlling Large Runs

Once a large scale simulation is running, you must track its progress and be ready to intervene if something is wrong. Monitoring has two sides: resource usage and scientific or numerical health.

Scheduler Level Monitoring

Your scheduler offers commands to inspect running jobs. For a large job, you should observe:

Current state, such as running, pending, or completing.

Elapsed runtime and remaining wall time.

Node allocation and any signs that the job is in a draining or requeuing state.

You can check per node CPU and memory usage through system tools, and many sites provide web dashboards that show job performance. Look for signs that the job is using less CPU than expected or is close to memory limits. Low CPU usage can indicate load imbalance, I/O bottlenecks, or synchronization issues.

If you see that the job is approaching its wall time limit but is far from completing the next checkpoint, you might need to shorten checkpoint intervals for future runs or request longer wall time within allowed limits.

Application Level Monitoring

Within your code output or logs, you should include light weight progress indicators. Typical choices include printing the current step, simulated time, and perhaps a small set of physical quantities at coarse intervals.

A simple format might be:

Step 5000, t = 1.250, dt = 2.5e-04, max(u) = 1.02, CFL = 0.8

With such lines written every few hundred or thousand steps, you can:

Verify that the simulation is advancing.

Check that critical quantities remain within expected ranges.

Monitor how much simulated time you cover per wall clock hour.

In large runs, do not print progress every step, because it will slow the code and bloat your logs. Instead, use a configurable frequency.

When to Cancel a Large Run

Sometimes the best decision is to cancel a running job. For example, you might choose to cancel if:

The logs show numerical instabilities or divergence.

The runtime per step is much higher than in test runs, suggesting misconfiguration.

The job uses far less resources than allocated, perhaps due to a bug that disables parallelism.

In these cases, canceling quickly avoids wasting allocation and queue time. After canceling, investigate the cause, correct input or code, test on a small run, and then resubmit.

Running Campaigns of Large Simulations

Often, your final project will not consist of a single large run, but a set of related runs. For example, you might vary one physical parameter, initial condition, or resolution. Running a whole campaign requires more structure.

Parameter Sweeps at Scale

A parameter sweep is a set of simulations that differ only in certain parameter values. At large scale, you must decide whether to:

Run fewer very large jobs that explore parameters sequentially.

Run many smaller jobs in parallel, each using fewer resources.

The choice depends on how your code scales and on queue policies. On some systems, many small jobs are scheduled more flexibly than a few huge jobs. On others, large jobs get priority.

For your project, it is often better to:

Choose one or two scales for strong or weak scaling studies, or for a limited parameter sweep.

Automate job submission with simple scripts that generate input files and submit jobs, so you avoid manual errors.

Keep a clear record of which job ID corresponds to which parameter set.

Organizing Campaign Data

When you run multiple large simulations, the risk of confusion grows. To avoid this, enforce a consistent naming and directory scheme. For example:

runs/paramA_0.1/strongscale_64nodes/

runs/paramA_0.2/strongscale_64nodes/

Each run directory should contain at least:

A copy of the input files used in that run.

A copy of the executable or a note of the exact version and build configuration.

The scheduler output and error logs.

A brief text or markdown file describing the purpose of the run and any notable events, such as restarts or anomalies.

Such organization is invaluable later when you analyze performance and prepare your final report.

Validation and Sanity Checks at Large Scale

Scaling up can reveal numerical or algorithmic issues that were not visible in small tests. For example:

Different decomposition sizes can change the order of floating point reductions, which slightly alters results.

Higher resolution may require smaller timesteps to maintain stability.

Different communication patterns can interact with numerical schemes in subtle ways.

You must plan simple sanity checks for your large simulations.

Consistency Across Scales

Whenever possible, compare results from a large run to smaller runs. For example:

If you run the same physical problem at two resolutions, examine whether key features converge as expected.

If you change the number of processes for a fixed problem size, check that the main diagnostics (integrated quantities, final averages) are within acceptable differences.

Define tolerance levels appropriate for your application, such as relative differences below a few percent, or similar qualitative behavior of key fields.

Detecting Instabilities Early

Numerical instabilities can waste hours of compute time if not caught early. To detect them:

Monitor simple norms, such as the maximum or L2 norm of fields, and check for sudden uncontrolled growth.

Monitor conserved quantities when your problem has conservation laws, such as mass or energy, and check that they behave as theory predicts.

Your application may already have diagnostic tools to compute these. For large scale runs, you should enable these diagnostics at low overhead and inspect them periodically during the run and immediately after completion.

If you detect problems, you can adjust parameters such as timestep size, numerical scheme options, or boundary condition implementations. Always re test such changes in smaller runs before attempting another large scale simulation.

Tying It Together for the Final Project

For the purpose of this course, the “Running large scale simulations” component of your final project should demonstrate that you can:

Design a production level job that uses multiple nodes or GPUs in a justified way.

Prepare inputs, outputs, and directory structures suitable for large, reproducible runs.

Use checkpoints and, if needed, chained jobs to navigate wall time limits.

Monitor and, when necessary, control or cancel jobs responsibly.

Carry out basic validation and sanity checks on large scale results.

You do not need to reach the physical limits of your cluster. Instead, you should select a scale that is meaningfully larger than your initial tests, show that you can run reliably at that scale, and document your workflow clearly. The skills you develop here, planning and executing large simulations methodically, are directly transferable to real world HPC environments where jobs may use hundreds or thousands of nodes and run for days.

Comments

Please login to add a comment.

Don't have an account? Register now!