Table of Contents
Concept of Strong Scaling
Strong scaling describes how the time to solve a fixed-size problem changes as you increase the number of processing elements (cores, nodes, GPUs, etc.). The total amount of work (problem size) stays constant; only the parallel resources change.
Formally, strong scaling efficiency $E_s$ for $p$ processing elements is often defined as
$$
E_s(p) = \frac{T(1)}{p \, T(p)}
$$
where:
- $T(1)$ is the time to solve the problem on 1 processing element
- $T(p)$ is the time to solve the same problem on $p$ processing elements
If doubling the number of cores halves the runtime (with fixed problem size), the program is said to strong-scale perfectly over that range.
Speedup and Efficiency in Strong Scaling
The two core quantities used in strong scaling are:
- Speedup:
$$
S(p) = \frac{T(1)}{T(p)}
$$ - Parallel efficiency (as above):
$$
E_s(p) = \frac{S(p)}{p} = \frac{T(1)}{p T(p)}
$$
Interpretation:
- $S(p) = p$ and $E_s(p) = 1$ (or 100%) means ideal strong scaling.
- $S(p) < p$ and $0 < E_s(p) < 1$ is the usual real-world case.
- If $S(p) > p$ (superlinear speedup), it typically indicates effects like better cache usage, not “breaking physics.”
In practice, $T(1)$ may be replaced by $T(p_{\text{min}})$, the runtime at the smallest core count for which the problem fits in memory.
How to Perform a Strong Scaling Study
A strong scaling study is an experiment:
- Choose a fixed problem size
- Same input size, same grid size, same number of particles, same dataset, etc.
- Ensure that the problem fits into memory for all tested core counts.
- Select a range of core counts
- E.g., $p = 1, 2, 4, 8, 16, 32, \dots$
- Stay within practical limits (e.g., avoid using entire machine for tiny problem).
- Measure runtime for each $p$
- Use consistent measurement methods (e.g., application timer, job runtime).
- Run each configuration multiple times to average out noise.
- Compute speedup and efficiency
- Use the formulas above.
- Often $T(1)$ is replaced with $T(p_{\text{min}})$ if $T(1)$ is too slow or infeasible.
- Plot results
- Speedup vs. cores: compare to ideal line $S_{\text{ideal}}(p) = p$.
- Efficiency vs. cores: see how quickly it drops as you add resources.
Typical Strong Scaling Behavior
As you increase $p$ for a fixed-size problem:
- Initially, speedup often grows nearly linearly.
- After some point, communication and other overheads become significant:
- Speedup curve bends away from ideal linear.
- Efficiency decreases more rapidly.
- Beyond a certain $p$, adding more cores can increase runtime:
- Communication dominates.
- Parallel overheads outweigh benefits of further splitting the work.
This “useful strong scaling range” is important in choosing how many cores to request on an HPC cluster: using more cores than this range wastes resources without significant speedup.
Factors Limiting Strong Scaling
For a fixed problem size, the following effects become more pronounced as you add cores:
- Serial (non-parallelizable) portions of code
- Even short serial segments limit maximum speedup.
- Directly related to concepts such as Amdahl’s Law (covered elsewhere).
- Communication overhead
- More processes or threads mean more messages and synchronization.
- Latency and bandwidth limitations of the interconnect become visible.
- Load imbalance
- If work is unevenly distributed, some cores wait idle while others are busy.
- The more cores you have, the more visible any imbalance becomes.
- Synchronization costs
- Barriers, reductions, and locks get more expensive with higher core counts.
- Memory system contention
- Multiple cores competing for access to the same memory or data structures.
- Cache coherence and memory bandwidth limits can dominate runtime.
- Granularity of work
- Each core may eventually have too little work compared to the fixed overhead of starting and managing parallel tasks.
Strong Scaling vs. Practical Resource Usage
Strong scaling analysis helps answer questions like:
- “How many cores should I request for this job so I don’t waste allocation time?”
- “Is it worth paying for more nodes or GPUs for this problem size?”
- “Is my application limited by serial sections or by communication?”
Practical observations:
- Diminishing returns: After some point, doubling cores yields only a small percentage improvement in runtime.
- Cost vs. speed trade-off: For a fixed problem, there is often an optimal core count where:
- Runtime is “fast enough”
- Efficiency is still acceptable (e.g., > 50%)
On shared systems with fair-share policies, using more cores than needed for modest speed gains can be considered poor resource usage.
Designing Applications for Strong Scaling
To improve strong scaling for a fixed problem size, common strategies include:
- Reduce serial regions
- Refactor or parallelize previously sequential sections.
- Reduce communication volume and frequency
- Communicate in larger, fewer messages.
- Use communication-avoiding algorithms where possible.
- Improve data locality
- Arrange data so each core primarily operates on its local data.
- Minimize remote memory accesses or cross-node dependencies.
- Enhance load balancing
- Use dynamic scheduling or better partitioning of the domain.
- Avoid hotspots where a subset of ranks do most of the work.
- Optimize synchronization
- Remove unnecessary barriers.
- Replace global collectives with smaller, local ones when possible.
- Increase computation per communication
- Sometimes slightly changing the algorithm can allow more work to be done per communication step, improving strong scaling.
Example Strong Scaling Experiment Template
Below is a simple template to organize a strong scaling experiment (independent of specific parallel programming models):
- Setup
- Fix problem size (e.g., grid $1000 \times 1000$, or $10^7$ particles).
- Run commands (conceptually):
# 1 core
srun -n 1 ./my_app --size 1000 > log_p1.txt
# 2 cores
srun -n 2 ./my_app --size 1000 > log_p2.txt
# 4 cores
srun -n 4 ./my_app --size 1000 > log_p4.txt
# 8 cores
srun -n 8 ./my_app --size 1000 > log_p8.txt- Extract runtimes (from logs or job info) into a table:
| Cores $p$ | Time $T(p)$ [s] | Speedup $S(p)$ | Efficiency $E_s(p)$ |
|-----------|-----------------|----------------|----------------------|
| 1 | 100 | 1.00 | 1.00 |
| 2 | 52 | 1.92 | 0.96 |
| 4 | 28 | 3.57 | 0.89 |
| 8 | 17 | 5.88 | 0.74 |
- Interpret
- Up to 8 cores, efficiency is still reasonably high.
- If at 16 cores efficiency dropped to 0.4, you might choose 8 cores for production runs of this specific problem size.
When Strong Scaling Is (and Is Not) Appropriate
Strong scaling is particularly useful when:
- The problem size is fixed by physical or experimental constraints (e.g., exact resolution needed).
- You want faster turnaround time for the same problem.
- You need to decide how many resources to request for a given job.
It is less appropriate when:
- Problem sizes naturally grow with available resources (then weak scaling is more relevant).
- You are more interested in throughput (jobs per day) than in minimizing the time for a single run.
Understanding strong scaling gives you a key tool for making efficient, informed decisions about how to run your codes on HPC systems.