Table of Contents
Learning goals for this chapter
By the end of this chapter you should be able to:
- Explain why parallel computing is used in HPC and what limits its benefits.
- Distinguish between key types of parallelism used in practice.
- Interpret scaling plots and basic performance metrics.
- Use simple formulas (Amdahl’s and Gustafson’s laws) to reason about speedup.
- Recognize common bottlenecks like load imbalance and communication overhead.
What is parallel computing (in the HPC context)?
Parallel computing means using multiple computing resources at the same time to solve one problem faster, handle larger problems, or both.
These resources can be:
- Multiple cores inside a single CPU.
- Multiple CPUs in a node.
- Multiple nodes in a cluster.
- GPUs and other accelerators.
Conceptually, parallel computing answers two questions:
- Decomposition: How can I break my problem into pieces that could be done at the same time?
- Coordination: How do these pieces communicate, synchronize, and share results?
The rest of this chapter gives you vocabulary and mental models to think about those two questions.
Performance basics: speedup, efficiency, and overhead
Parallel computing is almost always motivated by performance. To talk about performance quantitatively, three simple metrics are used frequently in HPC.
Speedup
Speedup measures how much faster a parallel version is compared to a reference serial (or less-parallel) version.
If $T_1$ is the execution time using one processing element (often 1 core) and $T_p$ is the time using $p$ processing elements, then
$$
S_p = \frac{T_1}{T_p}
$$
Interpretation:
- $S_p \approx p$: linear speedup (ideal, rarely achieved at large $p$).
- $S_p < p$: sublinear speedup (typical).
- $S_p > p$: superlinear speedup (unusual; can occur due to, e.g., cache effects).
Parallel efficiency
Efficiency normalizes speedup by the number of resources:
$$
E_p = \frac{S_p}{p} = \frac{T_1}{p \, T_p}
$$
- $E_p = 1$: perfect efficiency (every worker is 100% useful).
- $E_p < 1$: some fraction of time is “wasted” on overhead, idle time, etc.
Efficiency is especially important in HPC because computer time and energy are expensive. You may be able to use many nodes, but if efficiency is very low, you are wasting resources.
Sources of overhead
Anything that does not contribute directly to useful computation is overhead. Common overheads in parallel programs:
- Communication: exchanging data between threads or processes.
- Synchronization: waiting at barriers, locks, or for messages to arrive.
- Load imbalance: some workers finish early and sit idle.
- Startup costs: launching many processes/threads, initializing large data.
Parallel programming is mostly about reducing or hiding these overheads.
Types of parallelism: how work is divided
Many later chapters focus on specific programming models (OpenMP, MPI, CUDA, etc.). Here we focus on conceptual types of parallelism so you can recognize patterns in algorithms.
Task parallelism
In task parallelism (also called functional parallelism), different workers perform different operations or tasks, possibly on different data.
Examples:
- A simulation pipeline where one stage generates input, another runs a solver, and another post-processes results, all in a streaming fashion.
- Independent parameter sweeps: each worker runs the same code but with different input parameters; the work units (tasks) are logically independent.
- A multithreaded program where one thread handles I/O, another handles computation, another handles visualization.
Characteristics:
- Tasks may be heterogeneous and vary in cost.
- Tasks may communicate infrequently or not at all.
- Good for workflows with many independent or loosely coupled steps.
Key design questions:
- How do you identify independent tasks?
- How do you schedule them to keep all workers busy?
- What happens when tasks take very different amounts of time?
Data parallelism
In data parallelism, many workers perform the same operation on different pieces of data, often in lockstep.
Examples:
- Applying the same formula to each element of a large array.
- Updating each grid cell in a computational fluid dynamics simulation independently (subject to local neighbor data).
- Performing matrix–vector or matrix–matrix multiplication.
Characteristics:
- The main structure is: “for each element (or block of elements), do this.”
- Communication is often local (e.g., neighbor exchanges) or batched.
- Maps naturally to SIMD/vectorization and to GPU architectures.
Key design questions:
- How do you partition the data (blocks, cyclic patterns, etc.)?
- How much communication is needed between partitions?
- Are the operations on each partition truly independent or only partially so?
Combined patterns
Real applications often combine task and data parallelism, for example:
- Running many independent data-parallel simulations (a parameter sweep).
- Decomposing a grid domain (data parallel) and also offloading different solver stages as separate tasks.
Recognizing whether a piece of your application is task- or data-parallel (or both) helps you choose appropriate models (e.g., MPI processes, OpenMP tasks, GPU kernels) later.
Scaling: strong vs weak
Scaling describes how performance changes as more resources are used. Two common scenarios are strong scaling and weak scaling.
Strong scaling
Strong scaling answers:
If I keep the total problem size fixed and increase the number of processing elements, how does runtime change?
Ideal strong scaling:
- Doubling the number of cores halves the runtime.
- So the speedup $S_p$ is proportional to $p$.
Reality:
- Communication, synchronization, and serial regions limit achievable speedup.
- At some point, adding more cores gives little or no benefit (or can even slow things down).
Typical use case:
- You have a fixed-size problem (e.g., a specific simulation resolution) and want to finish it faster.
Weak scaling
Weak scaling answers:
If I increase the problem size proportionally to the number of processing elements, can I keep the runtime about the same?
Ideal weak scaling:
- As you scale from $p$ to $2p$ workers, you double the problem size.
- Each worker still has about the same amount of work.
- Total runtime stays constant.
Reality:
- Communication and global operations often grow with problem size.
- Effective weak scaling is often easier to achieve than strong scaling but still limited.
Typical use case:
- You care more about solving bigger problems (higher resolution, more particles, more scenarios) in about the same time, rather than just finishing a fixed problem faster.
Scaling plots and practical interpretation
You will commonly see:
- Speedup vs. cores for strong scaling.
- Time vs. problem size/cores for weak scaling.
Rules of thumb:
- Look for where curves flatten: that is where overheads dominate.
- For strong scaling, consider whether running with fewer cores but higher efficiency might be more cost-effective.
- Scaling behavior is often problem-size dependent: small problems may not scale well even if large ones do.
Amdahl’s Law: limits of parallelism in fixed problems
Amdahl’s Law provides a simple model of the maximum possible speedup for a fixed-size problem when only part of the code can be parallelized.
Let:
- $f$ be the fraction of the program that is serial (cannot be parallelized).
- $(1 - f)$ be the fraction that can be parallelized.
- $p$ be the number of processing elements.
Assuming perfect parallelization of the parallel part and no overhead:
$$
S_p = \frac{1}{f + \frac{1 - f}{p}}
$$
Important implications:
- The serial fraction dominates at large $p$.
If $p \to \infty$, then:
$$
S_{\max} = \frac{1}{f}
$$
Even a small serial fraction caps speedup.
- Example: if 5% of the code is serial ($f = 0.05$), the maximum possible speedup is
$S_{\max} = 1 / 0.05 = 20$, no matter how many cores you use. - For moderate $p$, communication, synchronization, and other overheads effectively increase $f$, reducing speedup further.
How Amdahl’s Law is used in practice:
- To reason about diminishing returns: there comes a point where adding more cores is not worth the cost.
- To prioritize optimization: reducing the serial portion or moving more work into parallelizable regions can have a big impact.
Limitations:
- It models a fixed-size problem (strong scaling).
- It assumes perfect parallelization of the parallel part and ignores many real-world effects.
Gustafson’s Law: scaling problem size with resources
Gustafson’s Law addresses a different situation: in HPC, we often want to solve larger problems as more resources become available, not just the same problem faster.
Assume:
- $f$ is still the fraction of time spent in the serial part when running on $p$ processors for the scaled-up problem.
- $(1 - f)$ is the fraction of time spent in the parallel part.
Gustafson’s Law gives the scaled speedup:
$$
S_p = f + (1 - f)p
$$
Interpretation:
- Even if part of the program is serial, you can use more processors effectively by increasing the parallel workload (e.g., higher resolution, more time steps, more samples).
- For modest $f$, $S_p$ can grow nearly linearly with $p$ if you keep growing the problem.
This matches how many large HPC systems are used:
- New machines often run problems that were impossible or impractical on older systems (bigger data, finer meshes), not just the same problems faster.
Contrasting with Amdahl:
- Amdahl: For a fixed problem, serial parts limit speedup as $p$ grows.
- Gustafson: For growing problems, we can still gain from larger $p$ by scaling workload.
In practice, both perspectives are useful:
- Amdahl helps you see where your code will stop scaling for a given problem size.
- Gustafson encourages you to think about using extra resources to solve more ambitious problems.
Load balancing: keeping workers busy
Load balancing is about distributing work so that all workers are kept usefully busy as much of the time as possible.
If some workers are idle while others are overloaded, you lose efficiency even if the total amount of work is well matched to the total resources.
Sources of imbalance:
- Inherently irregular problems:
- Some regions of a simulation domain are more expensive to compute (e.g., turbulence, phase transitions).
- Some tasks naturally take longer (e.g., some parameter combinations converge slowly).
- Static work partitioning that doesn’t match actual cost:
- Simple block decompositions where some blocks contain “harder” work.
- Work that changes over time:
- Adaptive mesh refinement, dynamic load in multi-phase simulations, etc.
Strategies to improve balancing:
- Static partitioning with better cost models:
- Assign work based on estimated cost, not just size.
- Dynamic scheduling:
- Use work queues, task stealing, or runtime schedulers to reassign tasks to idle workers.
- Domain decomposition techniques:
- Divide computational domains carefully to equalize work per partition while minimizing communication.
Trade-offs:
- More dynamic balancing often means more overhead (e.g., frequent task reassignment or more metadata).
- Good designs try to minimize imbalance while also limiting coordination cost.
Common bottlenecks in parallel programs
Beyond the formal laws and definitions, there are some recurring patterns that limit real-world parallel performance.
Serialization points
Even if most of your code is parallel, certain operations can force serialization:
- Global reductions (sums, minima, norms) across all workers.
- Centralized I/O (one process writes all data).
- Single-threaded pre- and post-processing steps.
These effectively contribute to the serial fraction $f$ in Amdahl’s model.
Communication and synchronization overhead
In distributed-memory and hybrid systems:
- Time spent sending/receiving messages or waiting at barriers grows with:
- Number of processes.
- Problem size.
- Algorithm communication pattern.
Bandwidth and latency of the interconnect become important. Algorithms that require frequent all-to-all communication typically scale poorly.
Memory and data locality
As you increase parallelism:
- Per-core memory bandwidth or cache capacity may be stressed.
- Poor data locality (e.g., random access patterns) can dominate runtime.
- On NUMA systems, accessing “remote” memory may be slower than local access.
These issues often appear as performance drops when you go from a few cores to many cores on the same node or across nodes.
How to think about parallelizing an application
When approaching a new problem with parallel computing in mind, a systematic thought process helps:
- Identify the core computational kernel(s):
- Where is most of the time spent?
- Are these loops, matrix operations, iterations over elements, etc.?
- Classify potential parallelism:
- Is the main opportunity data parallel (many similar operations on large data sets)?
- Is it task parallel (many independent jobs or workflow stages)?
- Estimate serial and parallel fractions:
- Roughly, what parts must execute in sequence?
- How will that constrain speedup (Amdahl) and what larger problems might you want to solve (Gustafson)?
- Consider scaling goals:
- Do you care about strong scaling (finish same problem faster) or weak scaling (solve bigger problem in same time)?
- Anticipate load imbalance and communication:
- Will some tasks or regions be more expensive than others?
- Where will data need to move, and how often?
- Choose appropriate models and tools:
- Shared-memory vs distributed-memory vs hybrid.
- Higher-level libraries that already exploit parallelism (e.g., numerical libraries).
Later chapters on OpenMP, MPI, hybrid programming, GPUs, and performance analysis will give you concrete mechanisms and tools. This chapter’s concepts provide the foundation for understanding why those mechanisms work and how to reason about their effectiveness.