20.2 Running large-scale simulations

Table of Contents

From Prototype to Large-Scale Run

Running a large-scale simulation is not just “using more cores.” It is a process that links your science/engineering question, your code, the cluster, and the scheduler in a coordinated way. In the final project, the goal is to move from a working small prototype to a robust, efficient run at meaningful scale.

This chapter focuses on:

When and how to scale up from test runs
How to design and size your large jobs
Practical workflow for running and managing big simulations
Common pitfalls and how to avoid wasting time and allocation

You should already have:

A working code (or application) that runs correctly on a small problem
Basic familiarity with your cluster, scheduler (e.g. SLURM), and modules
Some understanding of MPI/OpenMP/GPU use if your code uses them

Planning a Large-Scale Simulation

Clarifying the simulation goal

Before you submit a “huge” job, define:

Scientific/engineering objective:

What question should the simulation answer?
What output metrics matter (e.g., error norms, convergence rate, statistics, visualizations)?

Required resolution/size:

Space: grid size, number of particles, mesh elements, etc.
Time: number of timesteps, simulated physical time, number of iterations.

Required accuracy:

What discretization or model error is acceptable?
Can you tolerate some approximation for speed?

Write this down. It will guide problem size, runtime, and what is “good enough.”

Choosing problem size and resources

Your code has some measure of “problem size” $N$ (e.g., grid points, particles, unknowns). A large-scale run might increase:

$N$ (higher resolution, more particles)
Simulation time (more timesteps)
Both together

However, you cannot just multiply everything arbitrarily. You must consider:

Memory requirements:

Estimate memory per data structure at your target size.
Compute memory per process/thread and per node.
Ensure: memory_per_node_used < memory_per_node_available with some margin.

Expected runtime:

Use small or medium test runs to estimate time-per-step:
$$ T_{\text{large}} \approx T_{\text{small}} \times \frac{\text{work}_{\text{large}}}{\text{work}_{\text{small}}} \times \frac{\text{cores}_{\text{small}}}{\text{cores}_{\text{large}}} \times \text{overhead factor} $$
Add margin (e.g., 20–50%) for overhead, I/O, and slower-than-ideal scaling.

I/O volume:

Estimate output size:
$$ \text{output size} \approx \text{size per snapshot} \times \text{number of snapshots} \times \text{number of files} $$
Check quota and filesystem recommendations.

Aim for jobs that are:

Large enough to be scientifically meaningful
Small enough to fit within:

Your allocation (core-hours / GPU-hours)
Walltime limits
Storage limits

Pre-Scaling Checks

Verifying numerical and physical correctness at small scale

Do not discover basic bugs at 10,000 cores. Before scaling:

Verify correctness:

Compare against analytical solutions (if available) or trusted reference results.
Run regression tests you designed during the project.

Check physical stability:

No unphysical blow-up (e.g., negative densities, NaNs).
Validate with a short run at intended resolution (or slightly smaller).

If you cannot justify the model and numerics at small scale, you are not ready to scale.

Establishing baseline performance

Before going “large,” gather:

Runtime per step (or per iteration) on:

1 core / 1 GPU
1 node
A small number of nodes

Parallel scaling behavior:

A few strong-scaling or weak-scaling tests with your real physics enabled, but small problem size.

Record:

Number of nodes / cores / GPUs
Problem size
Walltime, memory usage, and I/O behavior
Any performance anomalies

These baselines help you:

Predict walltimes and core-hour cost of large runs
Detect regressions when you modify the code or input

Designing Large-Scale Job Configurations

Choosing parallel decomposition and layout

For MPI/OpenMP-style applications:

Decide:

nodes × tasks_per_node × cpus_per_task (and gpus_per_task if applicable)

Match code’s decomposition to hardware:

For domain decomposition codes:

Ratio of processes in each dimension should match domain geometry where possible (e.g., a 3D grid: $N_x : N_y : N_z$ should roughly match process grid $P_x : P_y : P_z$).

For particle or ensemble simulations:

Map each independent simulation or domain to MPI ranks or array jobs.

For GPU applications:

Typically 1–4 GPUs per node; ensure the code is configured to use all requested GPUs.
Avoid severe CPU–GPU imbalance: do not request more CPUs than needed for feeding GPUs.

Document the mapping you use; you will need it when interpreting performance.

Walltime sizing and job partitioning

A large-scale simulation may not fit comfortably in a single job. Consider:

Walltime limits from the scheduler
Checkpointing capability of the code
Failure risks (hardware faults, preemption, user mistakes)

A common strategy:

Choose a segment length:

For example, runs of 4–12 hours each, depending on queue policies.

Use restart/checkpoint files between segments.
Submit a chain of jobs or an array (e.g. using job dependencies) to advance the simulation in stages.

This makes long projects more robust and easier to monitor.

Practical Submission Strategy

Dry runs and “medium-scale” rehearsals

Do not jump straight from a laptop-sized run to the entire machine. Use staged scaling:

Functional test:

Same input, 1 node, short walltime, minimal output.
Goal: correctness, no crashes, checkpoint/restart works.

Medium-scale rehearsal:

A fraction of target node count (e.g., 1/4 or 1/8).
Realistic problem size or slightly smaller.
Full physics, write checkpoints, produce representative output.
Measure:

Time per timestep
Memory usage
I/O costs

Target large-scale run:

Only after medium-scale performance and stability are acceptable.

Each step should modify only one or two parameters at a time (e.g., more nodes, not simultaneously changing resolution, physics, and file output).

Using job arrays and ensembles

Not all large-scale simulations are “one huge run.” Often you want many related simulations:

Parameter sweeps (e.g., varying boundary conditions or model parameters)
Uncertainty quantification or Monte Carlo runs
Different initial conditions

Best practice:

Use job arrays when the cluster supports them.
Each array task:

Reads a different input file or parameter set
Writes to its own output directory

Advantages:

Natural parallelism
Better scheduler utilization
Easier failure handling: rerun only failed tasks

This is often more efficient and robust than a single monolithic run.

Robustness: Checkpointing and Restart

Designing a checkpointing strategy

Large-scale runs must assume failures can occur:

Node or network failures
Scheduler preemption or time limits
User mistakes in job scripts

Your checkpointing strategy should consider:

Frequency:

Checkpoint often enough to limit lost work, but not so often that I/O dominates.
Rule of thumb: checkpoint interval comparable to 1–5% of walltime limit, or based on a fixed number of timesteps.

Content:

Save only what is strictly necessary to resume:

State variables
Timestepping information
Random number seeds, if stochastic

Format:

Prefer formats that:

Are portable between nodes
Can handle partial writes (e.g., use temporary files renamed on completion)
Are readable by your restart logic without ambiguity.

Test restart from a checkpoint at small scale before running large jobs.

Using job dependencies

To chain multiple segments:

Use scheduler job dependencies (e.g., in SLURM: --dependency=afterok:<jobid>):

Job 1: timesteps 0–N
Job 2: restarts from checkpoint at timestep N, runs N–2N
Job 3, etc.

Benefits:

Automatic progression when earlier jobs finish successfully
Easier recovery: if job 3 fails, you can fix and resubmit from job 2’s checkpoint.

Plan and document your dependency chain as part of the project.

Monitoring and Managing Large Runs

Tracking resource usage

During large runs, monitor:

CPU / GPU utilization
Memory usage
I/O throughput
Network utilization (if tools allow)

Use:

Scheduler commands to inspect running jobs
Cluster-provided web interfaces or monitoring dashboards
Simple logging inside your code (e.g., periodic timing, memory, iteration counters)

Look for:

Persistent underutilization of CPUs or GPUs
Processes close to memory limits
Excessive time in I/O

Detecting and handling problems early

Large jobs that misbehave can waste allocations and impact other users. Warning signs:

Runtime per step increasing over time (e.g., due to load imbalance or I/O congestion)
Rapidly growing output files
NaNs or abnormal physical values in logs
Frequent checkpoint failures or corrupted files

If you see issues:

Stop the job if it is clearly wrong or wasteful.
Investigate at smaller scale with more diagnostics enabled.
Adjust input parameters, decomposition, or I/O strategy.
Only resubmit large jobs after confirming fixes on smaller tests.

Data Management for Large-Scale Output

Organizing output

Plan a directory structure before running:

Separate:

Input
Checkpoints
Final outputs
Logs and diagnostics

Use informative naming:

Include parameter values and run IDs in directory/file names
Example:

runs/res_1024_dt_1e-3/run001
runs/res_1024_dt_1e-3/run001/checkpoints
runs/res_1024_dt_1e-3/run001/output
runs/res_1024_dt_1e-3/run001/logs

Document:

Which job IDs produced which directories
The git commit or code version used

Reducing and post-processing data

Large-scale simulations can generate more data than you can keep or analyze. To control this:

Reduce output during the run:

Write snapshots less frequently.
Store only necessary variables.
Use compression if appropriate.

Perform on-the-fly analysis if possible:

Compute derived statistics or diagnostics inside the simulation.
Save reduced data instead of full fields at every step.

After the run:

Move raw data off the scratch filesystem if required by policy.
Consider keeping:

Reduced “summary” datasets
A small number of high-value snapshots
Scripts used to generate plots and tables

For the final project, explicitly state what data you keep and why.

Evaluating the Large-Scale Run

Comparing with small-scale behavior

After the large run completes, compare with your earlier tests:

Does runtime per step match your predictions?
Did memory usage and I/O behave as expected?
How did parallel efficiency change at scale?

Use these comparisons to:

Explain any mismatch (e.g., communication overhead, I/O bottlenecks).
Justify your chosen resource configuration in your final report.

Assessing scientific results

Finally, check whether the large-scale simulation answered your original question:

Do outputs converge compared to lower-resolution runs?
Are the results qualitatively and quantitatively reasonable?
Are there remaining uncertainties that would require even larger or more numerous runs?

For the final project:

Summarize how the large-scale aspect contributed:

Higher resolution
Longer timescales
More robust statistics

Reflect on:

Trade-offs you made (resolution vs. runtime vs. data volume)
What you would change if you had 10× more resources
What you would do differently with the experience you now have

Checklist for Your Final Project Large-Scale Simulation

Use this checklist before and after running:

Before:

[ ] Problem size and target resolution defined and justified
[ ] Small-scale correctness validated
[ ] Baseline performance measured
[ ] Memory and runtime estimated; job sizing planned
[ ] Parallel configuration (nodes, tasks, threads, GPUs) chosen and documented
[ ] Checkpoint/restart tested on small runs
[ ] Directory structure and naming scheme set
[ ] Job scripts written for full run (and segments if needed)

During:

[ ] Medium-scale rehearsal completed successfully
[ ] Key metrics monitored (utilization, memory, I/O)
[ ] Logs checked for errors and anomalies
[ ] Interventions taken if the run misbehaves

After:

[ ] Outputs verified and backed up per policy
[ ] Data reduced and organized
[ ] Performance and scaling analyzed
[ ] Results interpreted relative to your initial goals
[ ] All steps, parameters, and scripts recorded for reproducibility

Following this structured approach will allow you to run meaningful large-scale simulations and to document them clearly in your performance analysis and final project report.

Comments

Please login to add a comment.

Don't have an account? Register now!