Table of Contents
From Prototype to Large-Scale Run
Running a large-scale simulation is not just “using more cores.” It is a process that links your science/engineering question, your code, the cluster, and the scheduler in a coordinated way. In the final project, the goal is to move from a working small prototype to a robust, efficient run at meaningful scale.
This chapter focuses on:
- When and how to scale up from test runs
- How to design and size your large jobs
- Practical workflow for running and managing big simulations
- Common pitfalls and how to avoid wasting time and allocation
You should already have:
- A working code (or application) that runs correctly on a small problem
- Basic familiarity with your cluster, scheduler (e.g. SLURM), and modules
- Some understanding of MPI/OpenMP/GPU use if your code uses them
Planning a Large-Scale Simulation
Clarifying the simulation goal
Before you submit a “huge” job, define:
- Scientific/engineering objective:
- What question should the simulation answer?
- What output metrics matter (e.g., error norms, convergence rate, statistics, visualizations)?
- Required resolution/size:
- Space: grid size, number of particles, mesh elements, etc.
- Time: number of timesteps, simulated physical time, number of iterations.
- Required accuracy:
- What discretization or model error is acceptable?
- Can you tolerate some approximation for speed?
Write this down. It will guide problem size, runtime, and what is “good enough.”
Choosing problem size and resources
Your code has some measure of “problem size” $N$ (e.g., grid points, particles, unknowns). A large-scale run might increase:
- $N$ (higher resolution, more particles)
- Simulation time (more timesteps)
- Both together
However, you cannot just multiply everything arbitrarily. You must consider:
- Memory requirements:
- Estimate memory per data structure at your target size.
- Compute memory per process/thread and per node.
- Ensure:
memory_per_node_used < memory_per_node_availablewith some margin. - Expected runtime:
- Use small or medium test runs to estimate time-per-step:
$$ T_{\text{large}} \approx T_{\text{small}} \times \frac{\text{work}_{\text{large}}}{\text{work}_{\text{small}}} \times \frac{\text{cores}_{\text{small}}}{\text{cores}_{\text{large}}} \times \text{overhead factor} $$ - Add margin (e.g., 20–50%) for overhead, I/O, and slower-than-ideal scaling.
- I/O volume:
- Estimate output size:
$$ \text{output size} \approx \text{size per snapshot} \times \text{number of snapshots} \times \text{number of files} $$ - Check quota and filesystem recommendations.
Aim for jobs that are:
- Large enough to be scientifically meaningful
- Small enough to fit within:
- Your allocation (core-hours / GPU-hours)
- Walltime limits
- Storage limits
Pre-Scaling Checks
Verifying numerical and physical correctness at small scale
Do not discover basic bugs at 10,000 cores. Before scaling:
- Verify correctness:
- Compare against analytical solutions (if available) or trusted reference results.
- Run regression tests you designed during the project.
- Check physical stability:
- No unphysical blow-up (e.g., negative densities, NaNs).
- Validate with a short run at intended resolution (or slightly smaller).
If you cannot justify the model and numerics at small scale, you are not ready to scale.
Establishing baseline performance
Before going “large,” gather:
- Runtime per step (or per iteration) on:
- 1 core / 1 GPU
- 1 node
- A small number of nodes
- Parallel scaling behavior:
- A few strong-scaling or weak-scaling tests with your real physics enabled, but small problem size.
Record:
- Number of nodes / cores / GPUs
- Problem size
- Walltime, memory usage, and I/O behavior
- Any performance anomalies
These baselines help you:
- Predict walltimes and core-hour cost of large runs
- Detect regressions when you modify the code or input
Designing Large-Scale Job Configurations
Choosing parallel decomposition and layout
For MPI/OpenMP-style applications:
- Decide:
nodes × tasks_per_node × cpus_per_task(andgpus_per_taskif applicable)- Match code’s decomposition to hardware:
- For domain decomposition codes:
- Ratio of processes in each dimension should match domain geometry where possible (e.g., a 3D grid: $N_x : N_y : N_z$ should roughly match process grid $P_x : P_y : P_z$).
- For particle or ensemble simulations:
- Map each independent simulation or domain to MPI ranks or array jobs.
For GPU applications:
- Typically 1–4 GPUs per node; ensure the code is configured to use all requested GPUs.
- Avoid severe CPU–GPU imbalance: do not request more CPUs than needed for feeding GPUs.
Document the mapping you use; you will need it when interpreting performance.
Walltime sizing and job partitioning
A large-scale simulation may not fit comfortably in a single job. Consider:
- Walltime limits from the scheduler
- Checkpointing capability of the code
- Failure risks (hardware faults, preemption, user mistakes)
A common strategy:
- Choose a segment length:
- For example, runs of 4–12 hours each, depending on queue policies.
- Use restart/checkpoint files between segments.
- Submit a chain of jobs or an array (e.g. using job dependencies) to advance the simulation in stages.
This makes long projects more robust and easier to monitor.
Practical Submission Strategy
Dry runs and “medium-scale” rehearsals
Do not jump straight from a laptop-sized run to the entire machine. Use staged scaling:
- Functional test:
- Same input, 1 node, short walltime, minimal output.
- Goal: correctness, no crashes, checkpoint/restart works.
- Medium-scale rehearsal:
- A fraction of target node count (e.g., 1/4 or 1/8).
- Realistic problem size or slightly smaller.
- Full physics, write checkpoints, produce representative output.
- Measure:
- Time per timestep
- Memory usage
- I/O costs
- Target large-scale run:
- Only after medium-scale performance and stability are acceptable.
Each step should modify only one or two parameters at a time (e.g., more nodes, not simultaneously changing resolution, physics, and file output).
Using job arrays and ensembles
Not all large-scale simulations are “one huge run.” Often you want many related simulations:
- Parameter sweeps (e.g., varying boundary conditions or model parameters)
- Uncertainty quantification or Monte Carlo runs
- Different initial conditions
Best practice:
- Use job arrays when the cluster supports them.
- Each array task:
- Reads a different input file or parameter set
- Writes to its own output directory
- Advantages:
- Natural parallelism
- Better scheduler utilization
- Easier failure handling: rerun only failed tasks
This is often more efficient and robust than a single monolithic run.
Robustness: Checkpointing and Restart
Designing a checkpointing strategy
Large-scale runs must assume failures can occur:
- Node or network failures
- Scheduler preemption or time limits
- User mistakes in job scripts
Your checkpointing strategy should consider:
- Frequency:
- Checkpoint often enough to limit lost work, but not so often that I/O dominates.
- Rule of thumb: checkpoint interval comparable to 1–5% of walltime limit, or based on a fixed number of timesteps.
- Content:
- Save only what is strictly necessary to resume:
- State variables
- Timestepping information
- Random number seeds, if stochastic
- Format:
- Prefer formats that:
- Are portable between nodes
- Can handle partial writes (e.g., use temporary files renamed on completion)
- Are readable by your restart logic without ambiguity.
Test restart from a checkpoint at small scale before running large jobs.
Using job dependencies
To chain multiple segments:
- Use scheduler job dependencies (e.g., in SLURM:
--dependency=afterok:<jobid>): - Job 1: timesteps 0–N
- Job 2: restarts from checkpoint at timestep N, runs N–2N
- Job 3, etc.
Benefits:
- Automatic progression when earlier jobs finish successfully
- Easier recovery: if job 3 fails, you can fix and resubmit from job 2’s checkpoint.
Plan and document your dependency chain as part of the project.
Monitoring and Managing Large Runs
Tracking resource usage
During large runs, monitor:
- CPU / GPU utilization
- Memory usage
- I/O throughput
- Network utilization (if tools allow)
Use:
- Scheduler commands to inspect running jobs
- Cluster-provided web interfaces or monitoring dashboards
- Simple logging inside your code (e.g., periodic timing, memory, iteration counters)
Look for:
- Persistent underutilization of CPUs or GPUs
- Processes close to memory limits
- Excessive time in I/O
Detecting and handling problems early
Large jobs that misbehave can waste allocations and impact other users. Warning signs:
- Runtime per step increasing over time (e.g., due to load imbalance or I/O congestion)
- Rapidly growing output files
- NaNs or abnormal physical values in logs
- Frequent checkpoint failures or corrupted files
If you see issues:
- Stop the job if it is clearly wrong or wasteful.
- Investigate at smaller scale with more diagnostics enabled.
- Adjust input parameters, decomposition, or I/O strategy.
- Only resubmit large jobs after confirming fixes on smaller tests.
Data Management for Large-Scale Output
Organizing output
Plan a directory structure before running:
- Separate:
- Input
- Checkpoints
- Final outputs
- Logs and diagnostics
- Use informative naming:
- Include parameter values and run IDs in directory/file names
- Example:
runs/res_1024_dt_1e-3/run001runs/res_1024_dt_1e-3/run001/checkpointsruns/res_1024_dt_1e-3/run001/outputruns/res_1024_dt_1e-3/run001/logs
Document:
- Which job IDs produced which directories
- The git commit or code version used
Reducing and post-processing data
Large-scale simulations can generate more data than you can keep or analyze. To control this:
- Reduce output during the run:
- Write snapshots less frequently.
- Store only necessary variables.
- Use compression if appropriate.
- Perform on-the-fly analysis if possible:
- Compute derived statistics or diagnostics inside the simulation.
- Save reduced data instead of full fields at every step.
- After the run:
- Move raw data off the scratch filesystem if required by policy.
- Consider keeping:
- Reduced “summary” datasets
- A small number of high-value snapshots
- Scripts used to generate plots and tables
For the final project, explicitly state what data you keep and why.
Evaluating the Large-Scale Run
Comparing with small-scale behavior
After the large run completes, compare with your earlier tests:
- Does runtime per step match your predictions?
- Did memory usage and I/O behave as expected?
- How did parallel efficiency change at scale?
Use these comparisons to:
- Explain any mismatch (e.g., communication overhead, I/O bottlenecks).
- Justify your chosen resource configuration in your final report.
Assessing scientific results
Finally, check whether the large-scale simulation answered your original question:
- Do outputs converge compared to lower-resolution runs?
- Are the results qualitatively and quantitatively reasonable?
- Are there remaining uncertainties that would require even larger or more numerous runs?
For the final project:
- Summarize how the large-scale aspect contributed:
- Higher resolution
- Longer timescales
- More robust statistics
- Reflect on:
- Trade-offs you made (resolution vs. runtime vs. data volume)
- What you would change if you had 10× more resources
- What you would do differently with the experience you now have
Checklist for Your Final Project Large-Scale Simulation
Use this checklist before and after running:
Before:
- [ ] Problem size and target resolution defined and justified
- [ ] Small-scale correctness validated
- [ ] Baseline performance measured
- [ ] Memory and runtime estimated; job sizing planned
- [ ] Parallel configuration (nodes, tasks, threads, GPUs) chosen and documented
- [ ] Checkpoint/restart tested on small runs
- [ ] Directory structure and naming scheme set
- [ ] Job scripts written for full run (and segments if needed)
During:
- [ ] Medium-scale rehearsal completed successfully
- [ ] Key metrics monitored (utilization, memory, I/O)
- [ ] Logs checked for errors and anomalies
- [ ] Interventions taken if the run misbehaves
After:
- [ ] Outputs verified and backed up per policy
- [ ] Data reduced and organized
- [ ] Performance and scaling analyzed
- [ ] Results interpreted relative to your initial goals
- [ ] All steps, parameters, and scripts recorded for reproducibility
Following this structured approach will allow you to run meaningful large-scale simulations and to document them clearly in your performance analysis and final project report.