6.3 Strong scaling

Table of Contents

Intuition for Strong Scaling

Strong scaling describes how the runtime of a fixed total problem changes when you increase the number of processing elements, such as cores, GPUs, or nodes. In strong scaling experiments you do not change the size of the input or the level of resolution. You only change how many resources are used to solve exactly the same problem.

Imagine solving a 3D simulation on a fixed grid, processing a fixed number of images, or factoring a matrix of a fixed size. If you go from 1 core to 2, 4, 8, and so on while the overall problem specification does not change, then you are studying strong scaling.

Strong scaling answers questions like: “If I double the number of cores, how much faster does my program run for this exact case?” or “What is the largest number of cores I can use before adding more no longer helps for this problem size?”

Strong scaling is different from weak scaling, where the focus is on increasing problem size together with resources. That topic is covered elsewhere, so here the emphasis stays on the fixed problem case.

Key Quantities and Basic Formulas

To talk about strong scaling in a precise way, you need a small set of standard quantities.

Let:

$T(1)$ be the runtime of a given fixed problem on a single processing element.
$T(P)$ be the runtime of the same problem on $P$ processing elements.

From these, several measures follow.

The speedup $S(P)$ is defined as the factor by which the runtime is reduced:

$$
S(P) = \frac{T(1)}{T(P)}.
$$

This compares multi-processor performance with the single-processor baseline. In an ideal world, doubling the number of processors halves the runtime. In that case the speedup would grow linearly, $S(P) = P$.

The parallel efficiency $E(P)$ describes how close the program is to this ideal behavior:

$$
E(P) = \frac{S(P)}{P} = \frac{T(1)}{P \, T(P)}.
$$

Parallel efficiency is a value between 0 and 1. Higher values mean that each extra processing element contributes a large fraction of its theoretical potential. Strong scaling studies pay particular attention to how $E(P)$ behaves as $P$ grows for a fixed problem, because this tells you how quickly the code runs into overheads and serial limits.

Strong scaling definitions

Speedup for a fixed problem:
$$
S(P) = \frac{T(1)}{T(P)}.
$$
Parallel efficiency for a fixed problem:
$$
E(P) = \frac{S(P)}{P} = \frac{T(1)}{P \, T(P)}.
$$

When you perform strong scaling experiments on a real system, you usually measure $T(P)$ for several values of $P$, compute $S(P)$ and $E(P)$, and then interpret how these change as you add more resources.

Ideal Strong Scaling and Practical Limits

In the ideal case, every part of the computation can be perfectly divided across processing elements, and there is no communication, synchronization, or idle time. For a fixed problem in such a perfect scenario, the runtime is

$$
T_{\text{ideal}}(P) = \frac{T(1)}{P},
$$

giving $S(P) = P$ and $E(P) = 1$ exactly. This behavior is called linear strong scaling.

Real programs do not behave like this. There are at least three fundamental reasons.

First, some portion of the computation is inherently serial. For example, reading the input, performing a global reduction, or updating a shared data structure might be impossible to fully parallelize. The existence of this serial fraction means that the runtime cannot keep decreasing in proportion to $1/P$ indefinitely.

Second, parallel execution itself introduces overheads. Processes or threads must exchange data, wait for each other at synchronization points, and sometimes run into contention on shared resources such as memory or communication links. These effects typically increase with $P$, and at some point the cost of managing the parallel work dominates any additional benefit.

Third, the mapping between the problem and the hardware may become unfriendly as $P$ grows. For instance, a fixed mesh may not divide evenly among many ranks, or the amount of work per process becomes so small that costs that were negligible at large per-process workloads become significant.

Strong scaling experiments reveal the point where adding more resources leads to diminishing returns. Often the speedup curve starts close to linear, then bends downward. Eventually, extra cores have little effect, and in some cases the runtime may even increase because overheads grow faster than useful computation.

Amdahl’s Law in the Strong Scaling Context

Amdahl’s Law provides a simple model that connects the serial fraction of a program with the best strong scaling behavior you can possibly achieve.

Suppose a fraction $f$ of the program is strictly serial, and the remaining fraction $1 - f$ is perfectly parallelizable. For a fixed problem size, the runtime on $P$ processing elements can be written as

$$
T(P) = T(1) \left( f + \frac{1 - f}{P} \right).
$$

From this, the speedup is

$$
S(P) = \frac{T(1)}{T(P)} = \frac{1}{f + \frac{1 - f}{P}}.
$$

If you imagine $P$ growing without bound, the parallel part becomes arbitrarily fast, and Amdahl’s Law gives a limiting speedup

$$
\lim_{P \to \infty} S(P) = \frac{1}{f}.
$$

This means that even a small serial fraction can severely limit strong scaling. For example, if $f = 0.05$, then the maximum possible speedup is $1 / 0.05 = 20$, no matter how many resources are used.

Amdahl’s Law for strong scaling
For a fixed problem with serial fraction $f$:

Runtime:
$$
T(P) = T(1) \left( f + \frac{1 - f}{P} \right).
$$
Speedup:
$$
S(P) = \frac{1}{f + \frac{1 - f}{P}}.
$$
Maximum possible speedup:
$$
S_{\max} = \frac{1}{f}.
$$

In practice, the effective serial fraction includes not only program regions that are logically serial, but also parallel overheads that do not shrink as you add more processing elements. Thus, Amdahl’s Law is not just about the algorithm, but also how it is implemented and how it uses the hardware.

How to Perform a Strong Scaling Experiment

Strong scaling is an experimental concept as much as a theoretical one. To study it in practice, you design and run a series of controlled tests.

First, you choose a specific problem configuration that remains fixed for all runs. This might be a particular input dataset, a defined grid size, a specific matrix dimension, or a set number of time steps. It is important that the total work done is exactly the same so that variations in runtime can be attributed to changing $P$ alone.

Second, you select a range of processor counts $P$ to test. These should typically span at least an order of magnitude when possible, such as 1, 2, 4, 8, 16, 32, and so on. You do not need to test every possible value, but you should cover both small and large values of $P$ relative to your expected usage.

Third, for each chosen $P$ you run the program, measure the total wall clock runtime $T(P)$ for the entire computation, and record it. For reliability, you may repeat each run several times, then use the median or average runtime to smooth out noise from system variability.

Fourth, you compute $S(P)$ and $E(P)$ for each data point using the definitions given earlier. This can be done in a script or spreadsheet. You then create plots of speedup versus $P$ and efficiency versus $P$. Typically, speedup is plotted on a linear or logarithmic scale, often alongside the ideal straight line $S(P) = P$, while efficiency is often plotted as a function of $P$ alone.

Finally, you interpret the results. You identify the region where scaling is close to ideal, where it starts to degrade, and where further increases in $P$ are no longer beneficial. You can also attempt to estimate an effective serial fraction by fitting Amdahl’s Law to the measured speedup curve.

Strong scaling experiments are most meaningful when the baseline single processor run is feasible. If $T(1)$ is too long to measure in practice, you can use a different reference such as $T(P_{0})$ for some baseline $P_{0}$, and redefine speedup relative to that. The interpretation is slightly different, but the basic shape of the strong scaling behavior remains useful.

Recognizing Strong Scaling Limits

As you increase $P$ for a fixed problem, at some point you encounter strong scaling limits. These appear in the data and in your experience using the system in several characteristic ways.

The first sign is that the reduction in runtime per additional processing element becomes smaller. For instance, the runtime may halve when going from 1 to 2 processors, and nearly halve again when going to 4, but give only a marginal gain when going from 32 to 64. In terms of speedup, the curve begins to bend downward away from the ideal line.

Parallel efficiency provides a quantitative view of this behavior. If $E(P)$ remains close to 1 over a range of $P$, then the code strong scales well there. As $P$ grows further, $E(P)$ usually declines. When it falls below some threshold, such as 0.5 or 0.25, additional resources are often not worth the cost.

Another sign arises from the internal behavior of the code. If profiling shows that more time is spent in communication, synchronization, or I/O than in actual computation as $P$ increases, then these overheads are limiting strong scaling. Contention for memory bandwidth or shared caches can also appear as a plateau or even increase in runtime despite using more cores per node.

A further limit is imposed by granularity. If the amount of work per process or thread becomes too small, then fixed costs such as thread creation, message setup, or entering and leaving parallel regions dominate. In that regime, increasing parallelism creates more overhead than useful work.

When you encounter these limits, strong scaling results guide choices about job sizes. You can select a processor count that lies in the region of good efficiency, rather than simply requesting as many cores or nodes as the system allows.

Strong Scaling, Cost, and Resource Usage

In a shared HPC environment, strong scaling is closely linked to efficient resource usage. Faster is not always better if it requires a disproportionate increase in resource consumption.

The basic idea is to compare the total cost in processing time, usually approximated as the product $P \, T(P)$, for different choices of $P$. From the definition of parallel efficiency,

$$
E(P) = \frac{T(1)}{P \, T(P)},
$$

you can rearrange to obtain

$$
P \, T(P) = \frac{T(1)}{E(P)}.
$$

If $E(P)$ is close to 1, then the total processing time is close to $T(1)$. As efficiency drops, the total cost in CPU-hours or core-hours rises. For example, an efficiency of 0.5 implies that you are consuming twice as much total processing time as the single processor baseline, even if your job completes faster in wall clock time.

This tradeoff matters for queue policies and project budgets. Strong scaling experiments allow you to identify a point where the reduction in wall clock time justifies the increased total cost. For some applications, finishing as quickly as possible is critical, so sacrificing efficiency is acceptable. For others, especially large production runs in limited allocations, you may choose a $P$ where efficiency remains high.

Strong scaling characteristics also influence fair usage. If a job gains only a small reduction in runtime when moving from 128 to 512 cores, while consuming four times as many resources, it may be fairer to keep the job at 128 cores and leave the remaining cores for other users.

Algorithmic and Implementation Factors That Influence Strong Scaling

Strong scaling performance is not only a property of the hardware. It also depends strongly on how an algorithm is structured and how it is implemented.

Algorithms with a high ratio of computation to communication, and that can be decomposed into roughly equal independent tasks, tend to strong scale well. For example, certain stencil computations on structured grids or embarrassingly parallel workloads often show near linear speedup up to quite large $P$ for a fixed problem.

In contrast, algorithms that require frequent global communication, have irregular data dependencies, or involve serial bottlenecks such as sequential pivoting in some solvers, may exhibit poor strong scaling beyond modest processor counts. For these codes, the effective serial fraction in the sense of Amdahl’s Law is large, even if much of the code is conceptually parallel.

Implementation choices can either reduce or aggravate these effects. For example, minimizing unnecessary synchronization points, aggregating small messages into larger ones, using nonblocking communication, and reducing the frequency of global reductions can all help extend the useful strong scaling range.

Load balance is another crucial factor. Even if an algorithm is parallel in principle, if some processing elements receive more work than others, the faster ones will sit idle waiting for the slowest to finish. For a fixed problem, increasing $P$ makes such imbalance more visible, because each unit of extra work has a larger relative impact on runtime. Careful domain decomposition and dynamic load balancing strategies can improve strong scaling in such cases.

These algorithmic and implementation aspects are addressed in more detail in other chapters, but from the perspective of strong scaling, they can be summarized as ways of reducing the effective serial fraction and overhead, thereby improving $S(P)$ and $E(P)$ over a wider range of $P$.

Practical Guidelines for Using Strong Scaling Results

Once you understand and measure strong scaling, you can use it to make better decisions about how to run codes on an HPC system.

One practical use is to select an appropriate processor count for a given problem and time requirement. If you know that your application’s efficiency is high up to 64 cores but declines sharply beyond that, you might choose 64 cores for routine runs, and only use more when you truly need faster turnaround.

Strong scaling results are also valuable when planning production campaigns. If you must run many independent jobs on identical inputs or on similarly sized problems, you can estimate total resource consumption and completion time more accurately using measured $T(P)$ and $E(P)$. This helps in scheduling and in satisfying allocation limits.

Another use is in performance tuning. If you observe that speedup flattens very early, this indicates that there are significant serial or overhead components. This can guide profiling efforts: you know that the strong scaling limit is not just a hardware issue but may be improved by focusing on particular routines, communication patterns, or synchronization patterns.

Finally, strong scaling behavior can inform discussions with system administrators and support staff. Presenting measured speedup and efficiency curves can make it easier to justify the need for a certain number of cores or nodes, or to explain why particular jobs are configured in a certain way.

By treating strong scaling as a concrete, measurable property of your code and problem, instead of an abstract idea, you gain a practical tool for matching your workload to the HPC resources available, and for recognizing when further optimization or redesign is necessary.

Comments

Please login to add a comment.

Don't have an account? Register now!