Kahibaro
Discord Login Register

6.4 Weak scaling

Intuition of Weak Scaling

Weak scaling asks a simple question. If you increase both the problem size and the available computing resources in proportion, can you keep the time to solution roughly constant?

In strong scaling, the total problem size is fixed and you add more processors to make it run faster. In weak scaling, the work per processor is fixed, and when you add more processors you increase the total problem size.

Imagine you have a simulation where each processor handles a fixed number of grid cells. With 1 processor you simulate 1 million cells. With 2 processors you simulate 2 million cells. With 1024 processors you simulate 1024 million cells. If the time per simulation step is about the same for 1 and 1024 processors, your code has good weak scaling.

Weak scaling is especially important for applications that want to solve ever larger problems, rather than just finish a given problem faster. Many large-scale simulations in climate, astrophysics, or materials science care about this behaviour.

Formal Definition and Metrics

The central idea is to keep local work per processing element constant and increase the global work with the number of processing elements.

Let $P$ be the number of processors (or processes or cores, depending on context), and let $n(P)$ be the problem size. For weak scaling, you choose $n(P)$ so that the work per processor is constant. If $W(n)$ is the total amount of work for problem size $n$, then for ideal weak scaling you want
$$
\frac{W(n(P))}{P} = \text{constant}.
$$

A common practical choice is $n(P) = P \cdot n_0$ for some base size $n_0$. For example, you might simulate $n_0$ particles per MPI process, or hold a subdomain of fixed volume per GPU.

The key performance quantity that you measure in weak scaling is the time to solution, or the time per iteration or time step, as a function of $P$. Ideally this time is independent of $P$.

You can define weak scaling efficiency, $E_w(P)$, as
$$
E_w(P) = \frac{T(1)}{T(P)},
$$
where $T(P)$ is the time to solution when using $P$ processors and a problem size scaled with $P$ so that work per processor is constant.

Weak scaling efficiency
For a problem size that grows with $P$ such that the work per processor is constant,
$$
E_w(P) = \frac{T(1)}{T(P)}.
$$
Ideal weak scaling yields $T(P) \approx T(1)$, so $E_w(P) \approx 1$.

If $T(P)$ increases with $P$, then $E_w(P)$ drops below 1. In contrast to strong scaling, you do not expect speedup in weak scaling. You expect constant time and therefore constant efficiency.

Comparing Weak and Strong Scaling Behaviour

Strong and weak scaling describe different performance questions, and they respond differently to algorithmic and system features.

In strong scaling, the total problem size is fixed, and you ask how much faster the job runs as you add processors. Strong scaling efficiency is usually defined as
$$
E_s(P) = \frac{T(1)}{P \cdot T(P)},
$$
and ideal strong scaling yields $T(P) = T(1)/P$ so $E_s(P) = 1$.

In weak scaling you keep the work per processor constant. If you double $P$, you double the problem size. Ideal weak scaling keeps $T(P)$ roughly equal to $T(1)$ so $E_w(P) \approx 1$.

Some codes may show poor strong scaling because adding processors increases relative communication and synchronization overhead for a fixed-sized problem. However, the same codes might show good weak scaling if the per-processor workload remains large enough that communication costs stay in proportion to useful computation.

This distinction matters, because claims like “my code scales to 10,000 cores” are incomplete unless you know whether they refer to strong or weak scaling. A code can have excellent weak scaling to huge core counts, even if its strong scaling saturates early.

Practical Setup of Weak Scaling Experiments

When you design a weak scaling test, you must define precisely what “fixed work per processor” means for your application. The exact choice depends on the algorithm and data structures.

For a regular grid simulation, each MPI rank might own a subdomain of size $N_x \times N_y \times N_z$. In a weak scaling test you keep $N_x$, $N_y$, and $N_z$ constant per rank and increase the number of ranks, which increases the total grid size.

For particle-based codes, you might assign a fixed number of particles per rank. As you increase the number of ranks, you generate more particles overall, while preserving the local count.

For linear algebra computations that use domain decomposition, each process might store a fixed number of rows or columns of a matrix. Adding more processes increases the global matrix dimension.

In practice, to perform a weak scaling experiment you usually perform the following conceptual steps. You choose a base problem size that fits easily in memory and runs comfortably on a single processor. You design a rule for increasing the global problem size with $P$ such that the local work or local data volume remains approximately constant. Then you run the code for several values of $P$ and measure the time per iteration or per time step, or the overall time to solution, for each run.

To reduce noise, you may run each configuration several times and average the timings. You should ensure that each run does a meaningful amount of work so that startup and I/O overheads do not dominate the measurements.

Sources of Deviation from Ideal Weak Scaling

Even if you keep the work per processor constant, weak scaling usually degrades as you increase $P$. There are several reasons why the time $T(P)$ can grow with $P$.

Communication overhead often increases. Many parallel algorithms require exchange of data between neighbouring subdomains or all-to-all communication patterns. As the number of processes grows, the number of messages and the distance that messages must travel across the interconnect can increase. Latency and contention can both contribute to extra time.

Synchronization points can become more expensive. Global operations like reductions or barriers, which wait for all processes, can take longer as more processes must participate. In addition, variability in per-process performance can create idle time.

The communication-to-computation ratio can change with global problem size. For example, in a 3D domain decomposition, each rank may hold a fixed number of grid points, but the surface area between subdomains relative to their volume can change due to how the decomposition is mapped. The geometry of the domain and partitioning strategy can therefore affect weak scaling.

Memory system effects can become more pronounced. As more nodes are involved, the pattern of memory access and network traffic can lead to contention that was not visible on a small system. File system interactions, such as loading input or writing checkpoints, can also show more severe bottlenecks when many processes act concurrently.

Finally, system-level issues such as operating system noise, background jobs, or differences between nodes can create imbalances that become noticeable only at large scale. These do not change the algorithmic work per processor, but they affect the observed time.

Interpreting Weak Scaling Results

Weak scaling curves are typically plotted with time per iteration, time per step, or total runtime on the vertical axis, and number of processors (or nodes, or GPUs) on the horizontal axis. For ideal weak scaling, the curve is flat. In practice, you often see a gradual upward slope.

If time grows slowly with $P$, your code exhibits good weak scaling for the tested range. If the curve bends upward sharply beyond some scale, this indicates that some cost, often communication or synchronization, starts to dominate.

It is useful to report not only raw times but also weak scaling efficiency. For example, you might say that the time per step increased from 1 second to 1.2 seconds when going from 1 to 1024 processors, which corresponds to $E_w(1024) \approx 1/1.2 \approx 0.83$ or 83 percent.

You should also relate weak scaling behaviour to the intended use of the code. If your application never needs to run on more than a few hundred cores, then a gentle efficiency drop at 10,000 cores may be acceptable. Conversely, if you aim for leadership-class systems, even modest deviations from flat time at large $P$ can be important.

Interpreting weak scaling results always requires clarity about the problem definition and configuration. Differences in input physics, boundary conditions, or numerical tolerances can hide or exaggerate scaling behaviour. To make results comparable, you should specify exactly how the problem size grows with $P$ and what metric you used for timing.

Algorithmic Considerations for Weak Scaling

Weak scaling behaviour depends not only on the underlying hardware but also on the algorithm and how it is parallelized. Some algorithms are naturally local. Others involve inherently global operations that can limit weak scaling.

Domain decomposition methods, where each processing element is responsible for a local portion of the domain and interacts mainly with neighbours, are often well suited for good weak scaling. Their communication pattern can remain mostly local, and each processor continues to perform a similar amount of computation as the problem grows.

In contrast, algorithms that rely on frequent global reductions, FFTs with global data exchanges, or all-to-all communication can see weak scaling degrade sooner. The cost of these global interactions often grows faster than linearly with the number of processes.

Data structures and decomposition strategies can be adapted to improve weak scaling. Examples include partitioning the domain to minimize communication volume and surface area, aggregating messages to reduce per-message overhead, and organizing computations so that communication overlaps with useful work.

As node-level parallelism becomes richer, with many cores per node and accelerators, it is common to consider hybrid parallel approaches. Good weak scaling may require a balance between intra-node and inter-node parallelism, with careful design of communication patterns that suit the system topology.

Relation to Problem Size and Scientific Ambitions

Weak scaling is directly connected to the question of “how big” a problem you can solve in a given amount of time. If an application exhibits near-ideal weak scaling, then doubling the number of processors approximately doubles the problem size you can handle within the same runtime.

For many scientific and engineering domains, increasing the resolution, number of samples, or physical realism of a model is of central importance. For example, in climate modelling, finer grids capture more detailed atmospheric dynamics. In material science, larger systems of atoms allow the study of more realistic structures. In uncertainty quantification, more samples lead to better statistical estimates.

If weak scaling is good, the transition from a modest cluster to a leadership-class machine can translate directly into more ambitious simulations. If weak scaling is poor, larger machines may not deliver proportionally larger problems within acceptable time limits.

For this reason, evaluating weak scaling, in addition to strong scaling, helps researchers and engineers understand whether an application can make effective use of future, larger systems, and whether efforts to improve algorithms or parallelization strategies are necessary to reach desired problem sizes.

Practical Caveats and Common Pitfalls

In practice, several aspects can distort weak scaling measurements if you are not careful.

Initialization and I/O overheads often scale differently from the main computation. If the simulation itself has good weak scaling but initialization becomes much more expensive with many processes, the measured $T(P)$ will rise even if the core algorithm scales well. To avoid confusion, many practitioners time only the steady-state region of the computation or separately report initialization and I/O costs.

Using different solvers or iteration counts at larger problem sizes can also affect results. For some numerical methods, the number of iterations required to reach a given accuracy depends on the global problem size. If you hold error tolerances fixed, a larger problem may inherently require more iterations, and thus more time, even with perfect parallel efficiency. When analysing weak scaling, you should be aware of such algorithmic effects.

Load imbalance can grow as the system size increases. Even if you assign the same nominal work to each process, inhomogeneous physics, adaptive meshes, or irregular data structures can cause some processes to have more work than others. The overall time is then limited by the slowest process. Good weak scaling requires not only a balanced initial distribution of work, but also dynamic strategies if the workload changes during the computation.

Finally, system-level policies and resource sharing can influence measurements. On shared clusters, other users’ jobs may generate background load on the network or file system. Running experiments at different times or under different queue policies can lead to variability. For serious weak scaling studies, it is often necessary to coordinate with system administrators or run on dedicated allocations to reduce external interference.

By recognizing these caveats and designing experiments carefully, you can obtain weak scaling measurements that reflect the true behaviour of your application and guide further optimization efforts.

Views: 43

Comments

Please login to add a comment.

Don't have an account? Register now!