Table of Contents
Overview
Parallel computing is the use of many computing elements at the same time to solve a problem. Instead of running one instruction after another on a single core, a parallel program coordinates multiple activities so that parts of the work happen concurrently.
In high performance computing this is the central idea. Clusters, accelerators, and parallel programming models all exist to support some form of parallel execution. This chapter gives you a conceptual map of what “parallel” actually means for algorithms, performance, and practical programming, without going into the details of specific APIs, which are covered later.
Thinking in Parallel
A useful way to think about a parallel program is to separate three aspects: what work needs to be done, how that work can be divided into pieces that could happen at the same time, and how those pieces must communicate or synchronize.
The first step is to identify independent or mostly independent operations. If two operations do not depend on each other’s results, they are candidates to run in parallel. For example, computing the sum of each row of a large matrix can be split by rows, because the sum of row 1 does not depend on the sum of row 2.
The second step is to define a mapping from these independent work units to parallel “workers.” Depending on the programming model, these workers might be threads on a shared memory node, processes with separate memory spaces, or GPU threads on an accelerator. The details of how threads or processes are created and scheduled belong to later chapters, but the abstract idea is that you have many workers executing instructions in overlapping time intervals.
The third step is to handle the parts of the problem that are not independent. Wherever one piece of work needs data from another, you need communication or synchronization. This might be explicit message passing, shared variables with locks, or higher level constructs. The cost of this coordination is crucial for performance and will show up in the scaling behavior of your code.
Parallel Decomposition
Parallel decomposition is the systematic process of breaking a problem into pieces suitable for parallel execution. There are many possible decompositions of the same problem, and the choice affects both performance and complexity.
A typical pattern is to start from the data structures of your problem and ask how to divide them. For example, if you work with a 2D grid in a simulation, you might divide the grid into blocks or strips and assign each block to a parallel worker. In contrast, if your computation is a pipeline of distinct stages such as preprocessing, analysis, and output, you might decompose by tasks rather than data.
Good decompositions balance several goals at once. You want a large amount of work that is inherently parallel. You want each worker to have a similar amount of work, so no worker is idle while others are overloaded. You want to minimize the volume and frequency of data exchange among workers, because communication is slow compared to local computation. You also want to minimize synchronization points where many workers are forced to wait for the slowest one.
Decomposition is often an iterative design activity. You start with a simple partitioning, measure performance, and then adjust. In early stages you might deliberately choose a simpler decomposition that is easier to reason about, even if it is not yet optimal.
Parallel Overheads
Parallel execution introduces costs that do not exist, or are negligible, in purely serial code. These are often referred to collectively as parallel overheads.
One form of overhead is the cost of creating and managing parallel workers. Starting processes, launching threads, and handling context switches all take time. For short computations this overhead can dominate and make parallel execution slower than serial.
Another major overhead is communication, both in time and bandwidth. Moving data between cores, sockets, or nodes consumes cycles and may require data to traverse several levels of the memory hierarchy or network. The further the data has to travel, the more expensive the communication.
Synchronization is a related cost. Whenever workers must coordinate, such as waiting at a barrier or acquiring and releasing locks, some of them may be idle. Even if the synchronization primitive itself is fast, the imbalance in work causes others to wait for the slowest worker.
Finally, there is overhead associated with increased complexity. Parallel algorithms usually require more code to set up data distribution, manage boundaries between partitions, and handle corner cases. This complexity can lead to inefficiencies and bugs that are not present in the serial version.
Understanding and minimizing these overheads is essential for effective parallel computing. The ideal of “N workers give N times speedup” is rarely reached, and overheads are a key reason.
Concurrency versus Parallelism
Concurrency and parallelism are related but distinct concepts that often get mixed together.
Concurrency describes a structure in a program where multiple tasks are logically in progress at the same time. These tasks may interleave on a single processor or run simultaneously on multiple processors. Concurrency is about dealing with many things at once conceptually, such as handling multiple I/O requests or serving many network clients, regardless of whether you have hardware support to execute them in true parallel.
Parallelism is about actually using multiple hardware resources at the same time to perform computation. In parallel computing, you aim for different operations to execute physically at the same time on different cores or nodes in order to reduce total execution time.
In HPC you usually care primarily about parallelism for performance, but the program structure will often be concurrent in the broader sense. For example, an application might overlap communication and computation. While one part of the code is waiting for data from another node, another part is performing useful calculations. This is a concurrent design, and if the hardware has enough resources, it becomes parallel execution.
It is helpful to keep the distinction in mind. Some programming models and tools focus on concurrency semantics and safety, while others focus on maximizing parallel throughput.
Performance and Speedup
The most common way to quantify the benefit of parallelization is speedup. Speedup compares the time a computation takes in serial to the time it takes in parallel.
If $T_1$ is the execution time of the best serial implementation using one processing element, and $T_p$ is the execution time using $p$ processing elements, then the speedup $S_p$ is defined as
$$
S_p = \frac{T_1}{T_p}.
$$
If you double the number of processing elements and keep the problem size fixed, you would like the runtime to halve and the speedup to double. This ideal case $S_p = p$ is called linear speedup.
In practice, speedup is usually sublinear. Communication and synchronization, along with the inherently serial parts of the code, prevent perfect scaling. Sometimes speedup can even be superlinear, where $S_p > p$. This can happen when larger aggregate cache or memory on multiple processors allows a more efficient execution than on a single processor.
Another way to look at performance is efficiency. Parallel efficiency $E_p$ measures how effectively the processing elements are used and is defined as
$$
E_p = \frac{S_p}{p} = \frac{T_1}{p T_p}.
$$
An efficiency of 1 would mean perfect use of all processors. Real applications typically achieve significantly less, especially at large processor counts.
Key performance definitions:
Speedup:
$$S_p = \frac{T_1}{T_p}.$$
Efficiency:
$$E_p = \frac{S_p}{p} = \frac{T_1}{p T_p}.$$
These quantities will reappear when you analyze strong and weak scaling, and when you apply Amdahl’s Law and Gustafson’s Law.
Scalability
Scalability describes how well an application maintains or improves performance as more resources are added. It is not enough that an application runs faster on some number of processors, you also want to know how this behavior changes as you move to larger and larger systems.
Two important perspectives on scalability, strong scaling and weak scaling, are discussed in their own sections. At a conceptual level, strong scaling asks what happens to execution time when you increase the number of processors but keep the total problem fixed. Weak scaling asks what happens when you increase the problem size proportionally with the number of processors.
A scalable parallel program has a speedup that grows reasonably with the number of processors over the range of interest. As you increase $p$, some limiting factors become more prominent. The serial fraction of the code starts to dominate, communication paths become crowded, and load imbalance grows. Eventually, adding more processors yields little or no benefit, and may even hurt performance.
Scalability is a property not just of the algorithm, but of the combination of algorithm, implementation, and hardware. The same algorithm might scale well on one architecture and poorly on another due to differences in memory bandwidth, network topology, or cache sizes.
Dependencies and Parallel Limits
The structure of data and control dependencies in a program determines how much of it can be parallelized in principle. A dependency is a relationship that requires one operation to happen before another. Dependencies can be data dependencies, where one operation needs the result of another, or control dependencies, where the path of execution depends on a prior condition.
You can visualize dependencies with a directed acyclic graph, where nodes represent computations and edges represent dependencies. In such a graph, any set of nodes that have no unsatisfied incoming edges at a given time can, in principle, be executed in parallel. The longer the shortest path from start to finish in this graph, the more limited the maximum speedup.
A key conceptual insight is that the serial part of the program places an upper bound on possible speedup. Even if you have infinitely many processors, the serial fraction of the work must still be done sequentially and becomes the bottleneck. This idea is formalized in Amdahl’s Law, covered in a later chapter, which relates speedup to the fraction of the program that can be parallelized.
In practice, dependencies also shape communication patterns. If data must flow from one part of the domain to another at each step, then those parts must coordinate at regular intervals. This constrains how far you can reduce communication frequency and how much computation you can perform independently between synchronizations.
Communication, Synchronization, and Contention
Parallel workers must sometimes communicate intermediate results, share access to common data, or coordinate their progress. These interactions are usually called communication and synchronization.
Communication involves sending data from one worker to another or to a group. In distributed memory systems, this is explicit messages over a network. In shared memory systems, it may be implicit loads and stores to shared variables. Regardless of the mechanism, communication incurs latency and consumes bandwidth.
Synchronization establishes order when access to data or phases of computation must be controlled. Examples include barriers that force all workers to reach a certain point before any can proceed, and mutual exclusion constructs that prevent two workers from updating the same data at the same time.
Contention arises when several workers attempt to use a shared resource concurrently. This can be a shared variable, a communication link, a memory bank, or an I/O channel. Contention typically results in waiting and degraded performance. In extreme cases it can cause pathological slowdowns that negate most of the benefit of parallelism.
The art of effective parallel programming involves arranging communication and synchronization to be as infrequent and as local as possible, while still respecting correctness. It also involves structuring data and computation to reduce contention, for example by partitioning data so that each worker primarily accesses its own portion.
Granularity of Parallelism
Granularity refers to the amount of work done by each parallel unit before it needs to communicate or synchronize. Coarse grain parallelism uses relatively large chunks of work per worker, while fine grain parallelism uses very small chunks.
If the granularity is too fine, overheads from creating tasks, scheduling, and communication dominate the useful work. For instance, if each task performs only a few arithmetic operations before synchronizing with others, most of the execution time will be spent in overhead.
On the other hand, very coarse granularity can lead to poor load balancing. If one worker receives significantly more work than another, the faster ones will finish early and sit idle while the slowest determines the total runtime. This is a common issue in irregular problems where the amount of work per data element is not uniform.
Finding an appropriate granularity is therefore a trade off. You want tasks large enough to amortize overhead, but small enough to distribute work evenly and to adapt to dynamic behavior. Different levels of the system can have different notions of granularity, from vector instructions on a single core to tasks assigned to nodes in a cluster.
Nonideal Behaviors in Parallel Programs
Parallel programs often exhibit nonideal behaviors that reduce performance or even cause incorrect results. Some of these are performance pathologies, others are logical errors.
One important nonideal behavior is load imbalance, which has its own chapter. When some workers have more work than others, overall performance is limited by the slowest worker. This effect becomes more pronounced as you increase the number of workers.
Another behavior is unnecessary serialization, where design or implementation choices introduce serial bottlenecks. Examples include central coordination points, global locks protecting large data structures, or serial I/O phases that force all workers to wait.
Contention for shared resources can also manifest as nonideal scaling. Multiple workers trying to access the same memory region, file, or network link can slow each other down. Even if the algorithm is theoretically parallel, the way it uses hardware can limit performance.
On the correctness side, race conditions and deadlocks are common problems in parallel programs, and they are covered elsewhere. Even when the program does not crash, subtle races can corrupt data and lead to results that appear plausible but are wrong.
Recognizing these patterns helps diagnose why a scaling curve flattens or why performance suddenly degrades beyond a certain number of processors. Profiling tools and performance models can guide you in identifying which parts of the code are responsible.
Algorithmic Reformulation for Parallelism
Many serial algorithms rely heavily on sequential dependencies, which limits their parallel potential. To achieve good parallel performance, it is often necessary to reformulate the algorithm itself, not just parallelize an existing serial implementation.
Algorithmic reformulation might change data structures to reduce dependencies, replace inherently sequential methods with more parallel friendly ones, or restructure computation so that more of it can proceed concurrently. For example, you might replace a recursive algorithm that processes elements one by one with an iterative algorithm that operates on batches of elements in parallel.
Sometimes you trade more total work for better parallelism. An algorithm that does slightly more arithmetic operations overall, but allows most of them to be done independently, can be faster on parallel hardware than an algorithm that is work optimal in serial but has a long sequential dependency chain.
This perspective emphasizes that parallel computing is not only about using parallel constructs in code. It is also about choosing algorithms and formulations that expose parallel structure and minimize the impact of dependencies, communication, and synchronization.
Summary
Parallel computing concepts provide the foundation for everything else in this course. You have seen how to think in terms of parallel decomposition, what parallel overheads look like, and how speedup and efficiency quantify performance. You have also seen that dependencies and communication patterns impose hard limits on what parallelism can achieve, and that algorithm design is central to overcoming these limitations.
The following chapters will explore specific forms of parallelism such as task and data parallelism, examine scaling behavior through strong and weak scaling, and introduce formal models like Amdahl’s and Gustafson’s Laws. Later sections on programming models and performance analysis will turn these conceptual ideas into concrete skills you can apply to real parallel applications.