6.4 Weak scaling

Table of Contents

Understanding Weak Scaling

Weak scaling describes how the performance of a parallel program changes when you increase the problem size in proportion to the number of processing elements (cores, nodes, GPUs, etc.), while attempting to keep the amount of work per processing element constant.

In other words: if each core always does roughly the same amount of work, can we keep the time-to-solution about the same as we add more cores and make the overall problem bigger?

This is fundamentally different from strong scaling, where the total problem size is fixed and the number of processing elements increases.

The Core Idea

Assume:

$P$ = number of processing elements (e.g., cores or ranks)
$N$ = global problem size
$n = N / P$ = problem size handled by each processing element

For weak scaling, you:

Keep $n$ approximately constant.
Increase $P$.
Thus, $N$ grows as $N = n \cdot P$.

The ideal weak scaling behavior is:

Runtime $T(P)$ stays constant as $P$ increases, because each processing element always does the same amount of work.

So, in an ideal world you would have:
$$
T(P) \approx T(1) \quad \text{for all } P
$$

Of course, in practice, communication, synchronization, and other overheads make $T(P)$ increase with $P$. Weak scaling analysis helps you understand how fast it increases and why.

Weak Scaling Efficiency

To quantify weak scaling behavior, we typically define a weak scaling efficiency.

Let:

$T(1)$ be the runtime on 1 processing element for problem size $N_1$.
$T(P)$ be the runtime on $P$ processing elements for problem size $N_P$, chosen so that $N_P / P \approx N_1$ (same work per processing element).

A common definition is:
$$
E_\text{weak}(P) = \frac{T(1)}{T(P)}
$$

If $T(P) = T(1)$ for all $P$, then:
$$
E_\text{weak}(P) = 1 \quad \text{(or } 100\% \text{)}
$$

In practice:

$E_\text{weak}(P) < 1$ (efficiency drops as $P$ increases).
The rate of decrease in $E_\text{weak}(P)$ tells you how well your code scales to larger machines at a fixed workload per processing element.

Some practitioners rescale or define efficiency slightly differently, but the basic idea is: compare runtime at 1 PE with runtime at $P$ PEs, holding work per PE constant.

Why Weak Scaling Matters in HPC

Many HPC applications are not about solving a fixed-size problem faster. Instead, they aim to:

Simulate larger physical domains (e.g., higher-resolution climate models, bigger astrophysical volumes).
Increase mesh or grid resolution to capture more detail.
Handle larger datasets (e.g., more particles, more cells, more images).
Run with more complex physics or models that increase per-domain work.

In these cases, when larger machines become available, users increase the problem size instead of just reducing the time-to-solution. Weak scaling is the metric that answers:

If I use a bigger machine to solve a bigger problem (with the same work per core), does the time-to-solution stay roughly the same?

Good weak scaling means you can grow your scientific or engineering ambition (bigger, finer, more detailed simulations) without paying a large time penalty.

Typical Weak Scaling Scenarios

Some common patterns where weak scaling is the natural measure:

Grid-based simulations (PDE solvers, CFD, climate, seismology):

The global grid size is increased proportional to the number of subdomains (and thus the number of ranks/threads).
Each rank holds a subdomain of approximately fixed size.

Particle-based methods (molecular dynamics, N-body simulations, SPH):

The number of particles per process is kept roughly constant.
As you add more processes, you simulate more particles overall.

Data-parallel processing (large-scale image processing, ML training on larger datasets):

Each worker gets a fixed-size data shard.
More workers means more total data processed in the same time.

In all of these, the main question is: how much extra overhead appears when the system grows?

How to Design a Weak Scaling Experiment

To evaluate weak scaling for your own application, you usually:

Choose a unit workload per process

Example: a grid of $128 \times 128 \times 128$ cells per rank.
Or: 1 million particles per rank.
Or: a fixed-number of data items per process.

Set up runs with increasing $P$

For $P = 1, 2, 4, 8, 16, \dots$
For each $P$, build a global problem such that each rank still gets the same local size (or as close as possible).

Example (3D grid):

At $P = 1$: global grid $128^3$.
At $P = 8$: global grid $256 \times 256 \times 256$ distributed over 8 ranks, each rank still roughly $128^3$ cells.
At $P = 64$: global grid $512^3$, etc.

Measure the same metric for each run

Total wall-clock time.
Or time for the main compute loop (excluding initialization/IO if needed, but be consistent).

Compute weak scaling efficiency

Using $E_\text{weak}(P) = T(1) / T(P)$ or your preferred variant.
Plot $T(P)$ and/or $E_\text{weak}(P)$ versus $P$ on a graph.

Analyze deviations

If $T(P)$ stays flat, weak scaling is excellent.
If $T(P)$ grows slowly (e.g., logarithmically), weak scaling may still be acceptable.
If $T(P)$ grows steeply, there are scaling bottlenecks to investigate.

Interpreting Weak Scaling Behavior

In an idealized case with only perfectly parallel work and no overhead:

Each processing element does exactly the same amount of work.
No one needs to communicate or synchronize.
Then $T(P) = T(1)$ and $E_\text{weak}(P) = 1$ for all $P$.

Real-world programs deviate because of:

Communication costs

Exchange of halo/boundary data in grid-based methods.
Global reductions (e.g., sums, norms, dot products).
All-to-all communications in some algorithms.

Synchronization overheads

Global barriers.
Phases where processes must wait for others.

Load imbalance

Some processes get more work or slower tasks.
Dynamic behavior (e.g., adaptive mesh refinement, dynamic particle distributions).

Memory and network contention

Shared interconnects.
Shared file systems or memory bandwidth.

In weak scaling plots:

A slow, nearly linear increase of runtime with $P$ is often acceptable up to some scale.
A sharp increase at particular scales (e.g., when moving to multiple nodes, or across racks) often reveals architectural boundaries (e.g., inter-node vs intra-node, network topology changes).

Simple Cost Models for Weak Scaling

A very basic conceptual model for runtime under weak scaling:

$$
T(P) = T_\text{comp} + T_\text{comm}(P) + T_\text{sync}(P) + T_\text{other}(P)
$$

Where:

$T_\text{comp}$: computation time per process (ideally constant with $P$ under weak scaling).
$T_\text{comm}(P)$: communication time per process (often grows with $P$).
$T_\text{sync}(P)$: time waiting for synchronization (also often grows with $P$).
$T_\text{other}(P)$: miscellaneous overheads (I/O, OS noise, etc.).

For many stencil-like PDE solvers with simple nearest-neighbor communication:

$T_\text{comm}(P)$ often grows with surface area of local domains and with network latency/bandwidth effects.
As you scale to massive core counts, $T_\text{comm}$ and $T_\text{sync}$ tend to dominate.

The goal of weak scaling optimization is to keep $T_\text{comm}(P)$ and $T_\text{sync}(P)$ as small as possible so that total runtime $T(P)$ grows slowly.

Weak vs Strong Scaling: When to Use Which

Even though the full comparison is covered in the broader parallel concepts, it’s important to understand when weak scaling is the right tool:

Use weak scaling when:

You expect to increase problem size as more resources become available.
Your main scientific or engineering goal is to solve larger or higher-resolution problems, not just faster ones.
The natural workload definition is “work per core” (e.g., fixed-size domain or fixed number of particles per rank).

Use strong scaling when:

You want to reduce time-to-solution for a fixed problem size.
Response time is critical, but the problem size is constrained (e.g., real-time or near real-time scenarios).

Both analyses are complementary. A code might have:

Good weak scaling (can handle larger problems well).
Poor strong scaling (can’t speed up small/medium problems much beyond a certain number of cores).

Understanding this helps you set realistic expectations and choose appropriate job sizes on HPC systems.

Practical Considerations for Weak Scaling on Clusters

When performing weak scaling studies on real clusters, several practical aspects matter:

Domain Decomposition Strategy

How you split the problem across processes affects weak scaling:

Aim to keep local domain shapes as regular as possible (e.g., close to cubes in 3D).
Irregular decompositions or “strips” can increase surface-to-volume ratio and thus communication.

Example: For a 3D grid, prefer:

$P = P_x \cdot P_y \cdot P_z$ with $P_x \approx P_y \approx P_z$ over long skinny decompositions.

Network Topology and Placement

Weak scaling curves can change behavior:

When moving from within a single node (shared memory) to across multiple nodes (network).
When moving across racks or network tiers.

On many systems:

Performance degrades more rapidly when you span more network hops.
You may need scheduler options (e.g., node layout, process mapping) to get good placement.

I/O Effects

For large-scale weak scaling runs:

Total data volume (checkpoints, output files) often grows with $P$.
Even if computation scales well, parallel file system bandwidth or metadata operations can become a limiting factor.

Often:

For pure weak scaling studies, you may want to minimize or standardize I/O (e.g., same I/O per process, or measure compute-only phase separately).
In realistic production runs, scaling I/O is itself a major challenge.

Common Pitfalls in Weak Scaling Experiments

When designing or interpreting weak scaling results, avoid:

Changing more than one factor at a time

Keep software version, compiler flags, algorithmic parameters, etc. constant.
Otherwise, you cannot attribute changes in performance solely to scaling.

Comparing different workloads per process

If the local problem size is not actually constant, you are no longer measuring pure weak scaling.
Ensure the workload is as consistent as possible across different $P$.

Including startup and initialization artifacts

For small $P$, fixed overheads (startup, reading input) might dominate.
For large $P$, these may become negligible relative to compute.
Decide up front whether to include these in your metric, and be consistent.

Ignoring load imbalance

Weak scaling assumes constant work per process.
If your decomposition yields uneven work, measured scaling reflects both algorithmic scalability and load imbalance.

Misinterpreting “flat” runtime

Flat runtime is ideal in theory, but slight increases are expected.
You need to understand which contributions (communication, synchronization, I/O) are responsible, particularly at large scales.

Simple Example Scenario (Conceptual)

Consider a 2D heat equation solver on a structured grid, parallelized with domain decomposition:

Each process owns an $n \times n$ sub-grid.
At every timestep, each process:

Performs computations on its local grid (bulk work).
Exchanges halo values with its four neighbors (communication).

Weak scaling experiment:

Keep $n$ fixed for each process (e.g., $n = 512$).
Increase $P$ and grow global grid accordingly.

Expected behavior:

Computation time per step is roughly constant (same $n^2$ operations per process).
Communication per step may increase:

Per process, it may be constant or change slightly depending on decomposition.
Overall system communication volume grows with $P$.

Runtime per step $T(P)$:

Should ideally grow very slowly with $P$.
In practice, it increases due to network contention, message latency aggregation, and synchronization.

Such an example is often used in practice to characterize the scalability limits of both the application and the underlying HPC system.

When Weak Scaling Results Guide Application Design

Weak scaling studies often expose:

Which algorithmic components scale poorly (e.g., global reductions, all-to-all).
Which data structures cause excessive communication (e.g., irregular patterns requiring many small messages).
Which parts are sensitive to network topology or message size.

Developers then:

Replace global operations with more scalable alternatives (e.g., hierarchical reductions).
Change decomposition strategies to reduce surface-to-volume ratio.
Introduce overlap of communication and computation to hide latency.

For users (non-developers), understanding weak scaling:

Helps decide how far to scale out a given application.
Informs job size selection (e.g., not using 10,000 cores if performance gains saturate at 1,000 cores for your target resolution).
Sets expectations about what “good performance” looks like on a given system.

Summary

Weak scaling measures how runtime changes when both problem size and number of processing elements grow together, keeping work per element roughly constant.
Ideal weak scaling: runtime stays constant as you add more processing elements and increase the problem size.
Weak scaling efficiency compares runtime at larger $P$ against runtime at $P=1$ for constant local workload.
It is especially relevant when the goal is to solve larger, higher-resolution, or more detailed problems, not just to reduce the time for a fixed-sized problem.
Real-world weak scaling behavior is affected by communication, synchronization, load imbalance, network topology, and I/O.
Carefully designed weak scaling experiments are essential for understanding how your applications and systems behave at large scale, and for making informed decisions about algorithm design and job sizing on HPC clusters.

Comments

Please login to add a comment.

Don't have an account? Register now!