Kahibaro
Discord Login Register

Parallel Computing Concepts

Learning goals for this chapter

By the end of this chapter you should be able to:

What is parallel computing (in the HPC context)?

Parallel computing means using multiple computing resources at the same time to solve one problem faster, handle larger problems, or both.

These resources can be:

Conceptually, parallel computing answers two questions:

  1. Decomposition: How can I break my problem into pieces that could be done at the same time?
  2. Coordination: How do these pieces communicate, synchronize, and share results?

The rest of this chapter gives you vocabulary and mental models to think about those two questions.

Performance basics: speedup, efficiency, and overhead

Parallel computing is almost always motivated by performance. To talk about performance quantitatively, three simple metrics are used frequently in HPC.

Speedup

Speedup measures how much faster a parallel version is compared to a reference serial (or less-parallel) version.

If $T_1$ is the execution time using one processing element (often 1 core) and $T_p$ is the time using $p$ processing elements, then

$$
S_p = \frac{T_1}{T_p}
$$

Interpretation:

Parallel efficiency

Efficiency normalizes speedup by the number of resources:

$$
E_p = \frac{S_p}{p} = \frac{T_1}{p \, T_p}
$$

Efficiency is especially important in HPC because computer time and energy are expensive. You may be able to use many nodes, but if efficiency is very low, you are wasting resources.

Sources of overhead

Anything that does not contribute directly to useful computation is overhead. Common overheads in parallel programs:

Parallel programming is mostly about reducing or hiding these overheads.

Types of parallelism: how work is divided

Many later chapters focus on specific programming models (OpenMP, MPI, CUDA, etc.). Here we focus on conceptual types of parallelism so you can recognize patterns in algorithms.

Task parallelism

In task parallelism (also called functional parallelism), different workers perform different operations or tasks, possibly on different data.

Examples:

Characteristics:

Key design questions:

Data parallelism

In data parallelism, many workers perform the same operation on different pieces of data, often in lockstep.

Examples:

Characteristics:

Key design questions:

Combined patterns

Real applications often combine task and data parallelism, for example:

Recognizing whether a piece of your application is task- or data-parallel (or both) helps you choose appropriate models (e.g., MPI processes, OpenMP tasks, GPU kernels) later.

Scaling: strong vs weak

Scaling describes how performance changes as more resources are used. Two common scenarios are strong scaling and weak scaling.

Strong scaling

Strong scaling answers:

If I keep the total problem size fixed and increase the number of processing elements, how does runtime change?

Ideal strong scaling:

Reality:

Typical use case:

Weak scaling

Weak scaling answers:

If I increase the problem size proportionally to the number of processing elements, can I keep the runtime about the same?

Ideal weak scaling:

Reality:

Typical use case:

Scaling plots and practical interpretation

You will commonly see:

Rules of thumb:

Amdahl’s Law: limits of parallelism in fixed problems

Amdahl’s Law provides a simple model of the maximum possible speedup for a fixed-size problem when only part of the code can be parallelized.

Let:

Assuming perfect parallelization of the parallel part and no overhead:

$$
S_p = \frac{1}{f + \frac{1 - f}{p}}
$$

Important implications:

If $p \to \infty$, then:
$$
S_{\max} = \frac{1}{f}
$$
Even a small serial fraction caps speedup.

How Amdahl’s Law is used in practice:

Limitations:

Gustafson’s Law: scaling problem size with resources

Gustafson’s Law addresses a different situation: in HPC, we often want to solve larger problems as more resources become available, not just the same problem faster.

Assume:

Gustafson’s Law gives the scaled speedup:

$$
S_p = f + (1 - f)p
$$

Interpretation:

This matches how many large HPC systems are used:

Contrasting with Amdahl:

In practice, both perspectives are useful:

Load balancing: keeping workers busy

Load balancing is about distributing work so that all workers are kept usefully busy as much of the time as possible.

If some workers are idle while others are overloaded, you lose efficiency even if the total amount of work is well matched to the total resources.

Sources of imbalance:

Strategies to improve balancing:

Trade-offs:

Common bottlenecks in parallel programs

Beyond the formal laws and definitions, there are some recurring patterns that limit real-world parallel performance.

Serialization points

Even if most of your code is parallel, certain operations can force serialization:

These effectively contribute to the serial fraction $f$ in Amdahl’s model.

Communication and synchronization overhead

In distributed-memory and hybrid systems:

Bandwidth and latency of the interconnect become important. Algorithms that require frequent all-to-all communication typically scale poorly.

Memory and data locality

As you increase parallelism:

These issues often appear as performance drops when you go from a few cores to many cores on the same node or across nodes.

How to think about parallelizing an application

When approaching a new problem with parallel computing in mind, a systematic thought process helps:

  1. Identify the core computational kernel(s):
    • Where is most of the time spent?
    • Are these loops, matrix operations, iterations over elements, etc.?
  2. Classify potential parallelism:
    • Is the main opportunity data parallel (many similar operations on large data sets)?
    • Is it task parallel (many independent jobs or workflow stages)?
  3. Estimate serial and parallel fractions:
    • Roughly, what parts must execute in sequence?
    • How will that constrain speedup (Amdahl) and what larger problems might you want to solve (Gustafson)?
  4. Consider scaling goals:
    • Do you care about strong scaling (finish same problem faster) or weak scaling (solve bigger problem in same time)?
  5. Anticipate load imbalance and communication:
    • Will some tasks or regions be more expensive than others?
    • Where will data need to move, and how often?
  6. Choose appropriate models and tools:
    • Shared-memory vs distributed-memory vs hybrid.
    • Higher-level libraries that already exploit parallelism (e.g., numerical libraries).

Later chapters on OpenMP, MPI, hybrid programming, GPUs, and performance analysis will give you concrete mechanisms and tools. This chapter’s concepts provide the foundation for understanding why those mechanisms work and how to reason about their effectiveness.

Views: 20

Comments

Please login to add a comment.

Don't have an account? Register now!