Table of Contents
From Concepts to Daily Practice
This chapter connects the ideas from earlier in the course to what you actually do when using an HPC system. Instead of focusing on theory, we’ll walk through what typical work looks like, which skills matter most in practice, and how the different parts of the stack come together.
You will see a few concrete “patterns” of use that appear across science, engineering, and industry. Later chapters will zoom in on specific workflows and case studies.
Typical Roles and Use Cases
Different people come to HPC with different goals. The practical workflows depend strongly on your role:
- Domain scientists/engineers
- Primary focus: physics, chemistry, biology, climate, finance, etc.
- Typical activities: run existing codes, adjust input parameters, analyze results.
- HPC skills needed: basic Linux, job submission, managing data and results, simple performance awareness.
- Research software engineers (RSEs) / developers
- Primary focus: build and maintain scientific software.
- Typical activities: write and optimize code, integrate libraries, debug performance, support users.
- HPC skills needed: compilers, build systems, parallel programming models, profiling, continuous integration (CI).
- Data scientists / ML engineers on HPC
- Primary focus: training and deploying large models, large-scale inference.
- Typical activities: manage large datasets, orchestrate GPU-heavy jobs, tune batch sizes and parallelism strategies.
- HPC skills needed: job schedulers, GPU usage, containers, workflow tools.
- System administrators / support staff
- Primary focus: operate the cluster, support users, maintain software stacks.
- Typical activities: configure schedulers, install software, monitor utilization, assist with performance issues.
- HPC skills needed: deep OS, networking, storage, and scheduler knowledge (beyond the scope of this course, but important context).
Most beginners start as “domain users” but often grow skills in scripting, performance, and software engineering as their projects scale up.
The Life Cycle of an HPC Project
Most real-world HPC work follows a repeating cycle, regardless of discipline:
- Define the problem
- What are you trying to compute or simulate?
- What resolution, time span, or dataset size do you need?
- What accuracy or uncertainty is acceptable?
- Map the problem to tools and methods
- Choose: existing application, library, or custom code.
- Decide: CPU vs GPU, shared-memory vs distributed-memory, or hybrid.
- Identify constraints: memory requirements, time-to-solution, deadlines, budget/allocations.
- Prototype and test at small scale
- Run small test cases interactively or with small batch jobs.
- Ensure correctness, basic performance, and stable behavior.
- Explore parameter ranges and estimate resource needs.
- Plan production runs
- Decide how many nodes/cores/GPUs to request.
- Choose walltime limits and queue/partition.
- Determine input/output sizes and where data will live.
- Run at scale
- Submit batch jobs.
- Monitor for failures, stalls, or performance drops.
- Capture logs and metadata (code version, input parameters, environment).
- Analyze and post-process
- Reduce large outputs, extract key metrics or visualizations.
- Possibly run separate analysis jobs on the cluster.
- Save only what you truly need long-term.
- Refine and iterate
- Adjust parameters, model, or code.
- Optimize performance if runtime or cost is too high.
- Repeat until you reach scientific or business conclusions.
- Document and preserve
- Record how you ran the computations (scripts, job files, configs).
- Save versions of software environments.
- Prepare for publication, reproducibility, or handover to collaborators.
Understanding where you are in this cycle helps you decide which HPC tools and skills matter most at any given moment.
Practical Working Patterns on a Cluster
Although every site is slightly different, many daily patterns are shared.
Interactive vs Batch Work
In practice, you will switch between:
- Interactive work
- On login nodes or short-lived interactive jobs.
- Used for:
- Editing code and scripts.
- Testing small examples.
- Exploring file systems and modules.
- Running quick checks and debugging small problems.
- Batch work
- Submitted to the scheduler with job scripts.
- Used for:
- Long or resource-intensive runs.
- Large parameter sweeps.
- Production simulations and large model training.
A typical day might involve: logging in, making small changes interactively, then launching multiple batch jobs that run for hours or days.
The “Experiment Folder” Pattern
For reproducible work and sanity, many practitioners organize their projects like this:
- A project root directory with:
src/– source code (if you develop your own).scripts/– job scripts, helper scripts.inputs/– canonical input decks/templates.configs/– YAML/JSON/INI parameter files.results/– organized output data.logs/– scheduler logs, application logs, error messages.env/orenv_files/– environment module lists, container recipes, etc.
Each “experiment” or run might live under:
results/experiment_001/,results/experiment_002/, etc., with:- A copy of the job script.
- The exact input file.
- Output data and logs.
This structure directly supports later questions like:
- “Which code version produced this dataset?”
- “What resources did I request for this run?”
- “Can I re-run this with slightly different parameters?”
Parameter Sweeps and Ensembles
A very common pattern in HPC practice:
- You have a model or simulation.
- You want to run it for many parameter combinations, seeds, or initial conditions.
- Instead of changing the code each time, you:
- Write a script that:
- Generates parameter files or command-line arguments.
- Submits many jobs or a job array.
- Store each run’s results in a structured directory hierarchy.
This can quickly multiply to hundreds or thousands of jobs. In practice, that means:
- Thinking about scheduler load and limits.
- Using job arrays rather than individual submissions, where possible.
- Automating naming, logging, and cleanup.
Practical Skills That Matter Most
From an “HPC in practice” perspective, some skills are disproportionately valuable for beginners:
Working Efficiently with the Scheduler
Beyond basic submission, you will often:
- Use job arrays to manage many similar tasks.
- Chain jobs with dependencies (e.g., run analysis only after all simulations finish).
- Plan around queue policies:
- Maximum walltime.
- Priority/limits per user or project.
- Special queues for debug, GPU, big-memory, etc.
Learning to express your computational plan in terms of scheduler features is central to effective HPC use.
Lightweight Automation and Scripting
You don’t need to be a professional software engineer, but:
- Basic shell scripting (e.g.,
bash) is extremely helpful. - Simple Python scripts are common for:
- Generating input files.
- Submitting multiple jobs.
- Parsing logs and aggregating results.
- Treating recurring tasks as scripts rather than manual steps:
- Reduces errors.
- Improves reproducibility.
- Makes it easier to scale from 1 run to 1,000 runs.
Handling Large Outputs
In practice, storage is always limited. Typical strategies:
- Reduce early and often
- Rather than saving full 3D fields at every time step, save:
- Coarsened resolutions.
- Derived metrics (e.g., averages, fluxes).
- Checkpoint files at fewer intervals.
- Separate “raw” and “processed” data
- Raw simulation output may be:
- Very large.
- Expensive to store long-term.
- Processed data (figures, statistics) is smaller and more portable.
- Prefer standard formats
- Use machine-readable, self-describing formats where your tools allow it.
- This pays off when sharing with collaborators or using new analysis software.
Typical HPC Workflows (High-Level)
Later subsections will detail specific workflows; here is how they generally look.
Workflow Type 1: Running Established Simulation Codes
Many users never write large parallel codes themselves. Instead, they:
- Choose an established simulation package (e.g., for CFD, structural mechanics, climate, molecular dynamics).
- Learn its specifics:
- Input file structure.
- Compilation or module loading.
- Recommended job sizes.
- Build a template job script that:
- Requests appropriate resources.
- Loads the right modules or container.
- Launches the application with correct MPI/OpenMP options.
- Copy and adapt that template for new simulations.
In practice, this is highly productive: the main skill is translating your scientific question into inputs for the existing code and understanding how to run it efficiently.
Workflow Type 2: Data-Intensive or ML Workloads
For data- or ML-heavy work on an HPC cluster, typical steps are:
- Stage data to a location with good performance (often a parallel filesystem).
- Prepare environment:
- Modules or containers with ML frameworks.
- Correct CUDA/driver versions for GPUs.
- Prototype on a single GPU or a small node allocation.
- Scale up:
- Distributed training across multiple GPUs/nodes.
- Possibly using built-in distributed features of ML frameworks.
- Use the scheduler to:
- Reserve many GPUs for fixed times (within policy limits).
- Chain pre-processing, training, and evaluation jobs.
The practical emphasis is on resource availability, queue policies, and managing data locality.
Workflow Type 3: Custom Code Development and Optimization
When you develop your own HPC code:
- Develop locally or interactively
- Test core logic on small data.
- Ensure correctness first.
- Integrate parallelism and libraries
- Add MPI/OpenMP/CUDA as needed.
- Link to numerical libraries rather than writing everything yourself.
- Use the cluster for scaling tests
- Submit small jobs with increasing core/GPU counts.
- Collect timing information and scaling behavior.
- Profile and optimize in cycles
- Use profiling tools to identify bottlenecks.
- Apply optimizations.
- Re-test and compare performance metrics.
The practical challenge is balancing time spent on correctness, performance, and new features under real project deadlines.
Collaboration and Shared Use in Practice
HPC systems are multi-user environments. This shapes how you work:
- Allocations and quotas
- You often share compute time and storage with a research group or project.
- Planning is required to avoid exhausting quotas or monopolizing queues.
- Shared scripts and templates
- Groups often maintain:
- Standard job scripts.
- Shared pipelines or workflow definitions.
- Common analysis tools.
- Version control and code sharing
- Git repositories (local or hosted) are widely used to:
- Track changes in code and scripts.
- Share between collaborators.
- Tag versions used in specific publications or reports.
- Support channels
- Practical help often comes from:
- Local documentation and “best practice” guides.
- User support tickets and consultation hours.
- Mailing lists, chat, or forums.
Efficient HPC practice includes knowing when to seek help and how to provide reproducible examples for support staff.
Risk Management and Robustness in Real Use
When runs take hours or days, and use hundreds of cores or GPUs, robustness matters:
- Checkpoints
- Use application-level checkpointing if available.
- Align checkpoint frequency with walltime limits and failure risks.
- Defensive job scripts
- Check for required input files and directories before launching.
- Avoid unintentionally overwriting existing results.
- Log environment information (e.g., module list, Git commit hash).
- Incremental scaling
- Do not jump directly from a tiny test to the largest possible job.
- Step through increasing sizes to catch problems early:
- Memory usage.
- I/O contention.
- Load imbalance or scaling limits.
- Monitoring and alerts
- Use tools provided by your site (web portals, CLI tools).
- For long projects, simple notification scripts (e.g., email/slack on job completion or failure) can be valuable.
These habits save substantial time and compute resources over the lifetime of a project.
Integrating HPC into the Broader Research/Engineering Process
In actual practice, HPC is not an isolated activity; it is one part of a larger process:
- Upstream
- Experimental design or study planning.
- Selection of models and input parameters.
- Data collection or preprocessing.
- Downstream
- Statistical analysis, visualization, and interpretation.
- Comparison with experiments or benchmarks.
- Writing reports, papers, or decision memos.
Effective HPC users:
- Align cluster work with project milestones and deadlines.
- Communicate resource needs and timelines to collaborators.
- Keep enough documentation that others (or their future selves) can understand and reproduce what was done.
Putting It All Together
“HPC in practice” is less about mastering every low-level detail and more about:
- Translating domain problems into computational workflows.
- Using clusters respectfully and efficiently, within shared environments.
- Building simple, robust, and repeatable processes.
- Iterating between small tests and large-scale runs.
- Capturing enough context that results are interpretable and repeatable.
The subsequent subsections in this chapter will walk through concrete examples of typical HPC workflows, coding practices, and real-world case studies that illustrate these principles in different domains.