Table of Contents
From Idea to Results: The Typical HPC Workflow
Working on an HPC system is less like using a laptop and more like running a small production process: you plan, prepare, submit, monitor, and collect results. While details vary by discipline and cluster, most HPC work follows a few recurring patterns. This chapter focuses on those patterns so you can recognize and adopt a “standard way of working” on HPC systems.
We’ll describe workflows from two complementary viewpoints:
- The user-level workflow: what you do, step by step.
- The system-level pattern: how jobs flow through the cluster.
Throughout, assume that things like compilers, job schedulers, parallel programming, and storage systems are covered elsewhere; here we show how they are typically combined in practice.
1. The Basic HPC User Workflow
Most users on a cluster repeatedly go through a core cycle:
- Develop and test locally (or on login node at tiny scale)
- Prepare input and job scripts
- Submit and monitor jobs
- Collect and post-process results
- Refine and repeat
In more detail:
1.1 Develop and test on a small scale
A typical pattern:
- Develop code on your local machine or a login node.
- Use small test inputs to verify:
- The code runs without crashing.
- Basic correctness of results (e.g., known solutions, conservation checks, sanity plots).
- Perform very short trial jobs on the cluster to:
- Validate job scripts.
- Check environment modules and libraries.
- Measure rough runtime and memory use for production planning.
Key habits in this phase:
- Keep tests cheap and fast (minutes, not hours).
- Use version control for your code and scripts.
- Document what each test is meant to verify.
1.2 Prepare input data and job scripts
Production work on HPC almost always uses batch jobs, not interactive runs. Typical preparations:
- Organize your project in a clear directory structure:
src/,input/,scripts/,runs/,results/,logs/, etc.- Create one or more job scripts that:
- Request appropriate resources (nodes, cores, time, memory, GPUs).
- Load needed modules.
- Set environment variables.
- Launch the application with the correct MPI/OpenMP/GPU options.
- Prepare input files/configurations:
- Parameter files, mesh descriptions, datasets, random seeds, etc.
- Often, multiple parameter variants for parameter sweeps.
Typical practice is to have one script template and generate many small changes (different parameters, different scales) from it.
1.3 Submit, queue, and monitor jobs
The standard cycle:
- Use a scheduler command to submit a job script (e.g.,
sbatch,qsub). - The job enters a queue until resources are available.
- While waiting or running, you:
- Check status (
squeue, etc.). - Inspect log and output files to ensure the job is behaving as expected.
- On failures, examine:
- Scheduler error messages.
- Application log files.
- Resource usage reports.
A good workflow includes gradual scaling:
- Start with smaller resource requests and shorter times.
- Increase problem size and resource count stepwise.
- Use each step to refine job configurations.
1.4 Collect, analyze, and archive results
After jobs finish:
- Move or copy result files to a designated location:
- E.g.,
$PROJECT/results/experiment_X/. - Run post-processing:
- Extract key metrics.
- Generate plots or summary tables.
- Convert raw formats to analysis-friendly ones.
- Archive:
- Input files, job scripts, output, and logs together.
- Record software version, module list, and job parameters.
This creates a traceable trail from configuration to result, crucial for reproducibility and debugging.
1.5 Iterate and refine
Based on outcomes:
- Adjust:
- Numerical parameters.
- Physical models.
- Resolution or problem size.
- Parallelization strategy or resource request.
- Run new sets of jobs (often using the same patterns and templates).
- Occasionally:
- Refactor or optimize code.
- Update dependencies or libraries.
- Change workflow tools (e.g., add workflow managers, containers, etc.).
HPC work is typically iterative, not “one-and-done”.
2. Common HPC Workflow Patterns
Beyond the generic loop, certain recurrent patterns appear across domains. Recognizing them helps you structure your own work.
2.1 Single large production run
Pattern: one or a few very large jobs that dominate the project.
Typical use cases:
- Climate or cosmology simulations.
- Large-scale engineering simulations (e.g., crash tests, aerodynamics).
- High-fidelity multiphysics models.
Workflow characteristics:
- Long setup phase: careful benchmark runs, stability checks, and performance tuning.
- Emphasis on:
- Checkpointing and restart strategy.
- Job chaining (multiple sequential jobs extending simulation time).
- Monitoring for numerical instabilities or divergence.
- Results:
- Large volumes of data per run, often requiring immediate post-processing and selective retention.
Best practices:
- Before committing, run:
- Short, reduced-scale dry runs to measure performance and memory.
- End-to-end tests of checkpoint/restart.
- Split very long computations into multiple queued segments chained by restarts, rather than asking for maximum walltime in one job.
2.2 Parameter sweeps and ensembles
Pattern: many similar jobs with different parameters or inputs.
Examples:
- Sensitivity analysis (vary one or more parameters).
- Calibration or inverse problems.
- Monte Carlo simulations with different random seeds.
- Uncertainty quantification.
Workflow characteristics:
- Thousands or more independent or loosely coupled jobs.
- Jobs may be:
- All small and similar.
- A mix of small and medium-sized.
- Emphasis on:
- Automated job generation and submission.
- Systematic naming and directory organization.
- Automated result aggregation.
Typical implementation:
- A driver script that:
- Generates input files for each parameter combination.
- Writes job scripts or
sbatchcommands. - Optionally submits them in batches.
- Scheduler features often used:
- Job arrays (one script, many instances with varying indexes).
- Job dependencies when post-processing is needed per run.
Best practices:
- Keep per-job resource requests modest to maintain throughput and fair usage.
- Use a consistent naming convention (e.g.,
run_paramA_1.0_paramB_0.5/). - Implement automated summary scripts to collate results (e.g., into a CSV, database, or HDF5 file).
2.3 Multi-stage pipelines
Pattern: a sequence of distinct stages, each potentially parallel, with dependencies between them.
Example pipeline types:
- Data-centric pipelines:
- Data acquisition or ingestion.
- Data cleaning / preprocessing.
- Feature extraction or transformation.
- Modeling, training, or simulation.
- Post-processing and visualization.
- Simulation pipelines:
- Mesh generation or geometry processing.
- Initial condition setup.
- Main simulation.
- Specialized post-processing (e.g., turbulence statistics).
- Visualization or export to analysis tools.
Workflow characteristics:
- Different stages may:
- Use different software.
- Have very different resource requirements.
- Produce intermediate data products.
- Job dependencies:
- Later stages depend on outputs of earlier stages.
- Scheduler dependency features are commonly used:
- Submit all jobs but ensure, for example, that Stage 2 starts only when Stage 1 finishes successfully.
Best practices:
- Clearly define stage boundaries and expected inputs/outputs.
- Use job dependencies instead of manual waiting.
- Remove or compress large intermediate files when safe, to save space.
- Consider simple workflow managers or scripts to orchestrate the pipeline.
2.4 Iterative optimization and training workflows
Pattern: Many iterations where each iteration involves computing a metric, adjusting parameters, and repeating.
Examples:
- Training machine learning models on large datasets.
- Inverse design or optimization (e.g., aerodynamic shape optimization).
- Iterative solvers that require multiple outer loops (e.g., nested simulation-based optimization).
Workflow characteristics:
- Tight iteration loop:
- Might run inside one long job (e.g., ML training),
- Or as a series of shorter jobs (e.g., expensive simulations with an external optimizer).
- Performance and convergence diagnostics are essential:
- Loss curves, objective function values.
- Early stopping or dynamic adaptation of resources.
Two common implementation modes:
- Single long-running job:
- Applies when the optimizer and evaluation runs can run on the same allocation.
- Common in distributed ML training.
- Controller + workers:
- A “controller” job (or process on login node, if allowed) submits evaluation jobs.
- Each evaluation job runs a simulation or model training with given parameters.
- The controller uses results to decide next parameters.
Best practices:
- Decide which parts belong in a single job vs in separate scheduled jobs.
- Implement checkpointing of the optimizer state as well as simulations.
- Keep a log of every iteration: parameters, objective value, job ID, and resource usage.
2.5 Data-heavy analysis and post-processing workflows
Pattern: The simulation phase may be done once, but analysis is repeated many times.
Examples:
- Large observational datasets (astronomy, genomics, sensor data).
- Repeated statistical analyses by different research teams.
- Visualization and exploratory data analysis of simulation outputs.
Workflow characteristics:
- Typically I/O-bound and memory-sensitive.
- Common operations:
- Scanning large datasets.
- Filtering and aggregation.
- Indexing, binning, and statistical summaries.
- Often scheduled as many small-to-medium jobs that:
- Read subsets or partitions of data.
- Write reduced or aggregated summaries.
Best practices:
- Design a data layout suitable for repeated parallel reads (e.g., chunked, partitioned files).
- Use parallel I/O libraries and compression where appropriate.
- Keep raw data immutable and derive smaller “analysis-ready” datasets.
- Plan for long-term storage and access patterns (what must stay on fast storage, what can be archived).
3. Workflow Organization on the Cluster
Beyond how you logically work, you must organize jobs, data, and scripts so that humans (including future you) can understand what’s going on.
3.1 Project and directory organization
Typical structure:
project_root/src/– code.input/– canonical input templates, base datasets.scripts/– job scripts, helper scripts.runs/run_001_short_test/run_002_scaling_test/run_100_production/results/– processed outputs and summaries.logs/– consolidated scheduler and application logs.config/– configuration files, parameter sets.env/ormodules/– information about software environment.
Workflow considerations:
- Keep source separate from run directories.
- Use run directories per job or experiment group:
- Each contains job scripts, input copies, and outputs for that run.
- Store metadata (parameters, seed, job ID, module list) inside each run directory.
3.2 Batch vs interactive workflows
Two operational modes:
- Batch-oriented:
- Most production HPC work.
- You submit jobs and disconnect.
- Output is examined later, not in real time.
- Interactive node sessions:
- Reserved nodes where you can run commands interactively for:
- Debugging, profiling, or exploratory runs.
- Trying analysis workflows before batch automation.
Typical pattern:
- Use interactive nodes for:
- Developing and tuning workflows.
- Short exploratory analyses.
- Use batch jobs for:
- Long, resource-intensive production runs.
- Large-scale sweeps and pipelines.
3.3 Using job dependencies to build workflows
Rather than manually waiting and submitting each stage:
- Submit all jobs at once with specified dependencies:
- E.g., “Job B starts only if Job A finishes successfully.”
- Common dependency patterns:
- Linear chains: A → B → C.
- Fan-out: A → (many jobs in parallel).
- Fan-in: (many jobs) → aggregation job.
- Benefits:
- Less manual intervention.
- Reduced risk of mistakes in run order.
- More robust, especially when you’re offline.
4. Scalability and Incremental Workflow Design
In practice, users rarely jump directly from a laptop-scale run to thousands of cores. A typical scaling workflow looks like:
- Correctness on tiny cases:
- Run quietly and quickly, verify correctness.
- Performance sanity check:
- Measure runtime on small but non-trivial cases.
- Inspect memory usage, parallel efficiency at low core counts.
- Scaling tests:
- Increase problem size and resources stepwise.
- Identify limits where performance stops improving (e.g., strong-scaling limits).
- Production planning:
- Use scaling data to:
- Estimate time-to-solution at target scale.
- Choose a practical node count and job length.
- Production runs:
- Launch planned large jobs or ensembles.
- Post-production optimization (if time allows):
- Analyze performance logs.
- Improve scaling or reduce I/O bottlenecks for future projects.
Designing your workflow in this incremental way avoids wasting large allocations and makes your work more predictable and reproducible.
5. Workflow Reliability and Reproducibility in Practice
Typical HPC workflows are fragile if not managed carefully. A few concrete practices make them robust:
- Automate recurring tasks:
- Job generation, submission, cleanup, and aggregation.
- Log everything:
- Parameters, input files, code version, environment.
- Use consistent naming and structure:
- So you can find things months later.
- Plan for failure:
- Checkpointing, restart scripts, and partial result recovery.
- Separate experiments:
- Different runs for different hypotheses or parameter sets, not “everything in one directory”.
These habits turn ad hoc sequences of jobs into coherent workflows that are easier to debug, share, and repeat.
6. Putting It All Together: A Minimal Example Workflow
As a concrete (but generic) outline, a typical HPC workflow for a new project might look like:
- Set up project structure on the cluster.
- Port or compile your code, run unit tests.
- Run tiny test jobs via the scheduler to:
- Check scripts, modules, paths.
- Run a small verification case:
- Confirm numerical correctness and output format.
- Perform scaling tests:
- A few jobs with different core counts and input sizes.
- Design production experiments:
- Single large run, parameter sweep, or pipeline stages.
- Use job arrays and dependencies to launch the experiments.
- As jobs complete:
- Monitor logs.
- Re-run failed jobs after fixing issues.
- Once all runs are done:
- Collect and reduce data.
- Generate plots and summary tables.
- Archive:
- Code version, scripts, inputs, outputs, and key metadata.
While tools and specific commands differ by system, this overall pattern of planning, staging, running, monitoring, and consolidating results is shared by most practical HPC workflows.