Table of Contents
Key Performance Principles for Accelerators
Accelerators can deliver large speedups, but only if their strengths are used well and their bottlenecks are avoided. This chapter focuses on the main performance considerations when using GPUs and other accelerators in HPC codes.
We will assume you already know basic GPU architecture concepts and the idea of host (CPU) vs device (GPU) from earlier sections.
1. Data Movement and PCIe/Interconnect Bottlenecks
In most systems, the CPU and GPU have separate memory spaces. Moving data between them is often the dominant cost.
Minimize Host–Device Transfers
- Move data to the device once, use it as much as possible, and only copy results back when needed.
- Avoid patterns like:
- copy input → launch small kernel → copy output → repeat
- Prefer:
- copy large input → many kernel launches using resident data → copy final result
Transfer Larger, Fewer Chunks
Transferring many small buffers incurs overhead. Instead:
- Pack small arrays into a single larger buffer when possible.
- Group related operations so they can reuse the same data on the device.
Overlap Communication and Computation
Use asynchronous operations to hide transfer latency:
- Asynchronous memory copies (
cudaMemcpyAsync,hipMemcpyAsync, etc.) can run concurrently with kernels, if they use different streams/queues. - Typical pattern:
- While kernel $k$ is running on batch $i$, start transferring data for batch $i+1$.
This is often called double buffering or pipelining.
Consider Interconnect Limits
- PCIe has lower bandwidth and higher latency than GPU on-board memory.
- Faster interconnects like NVLink or GPU-direct networking reduce penalties but do not remove them.
- Algorithm design should:
- Maximize arithmetic intensity (more computation per byte transferred).
- Keep intermediate results on the device whenever practical.
2. Compute vs Memory Bound: Arithmetic Intensity
An accelerator’s theoretical peak performance is rarely achieved because applications are limited by memory bandwidth or by computation.
Arithmetic Intensity
Arithmetic intensity (AI) is:
$$ \text{AI} = \frac{\text{Number of floating-point operations}}{\text{Bytes of data moved}} $$
- High AI → more likely to be compute-bound.
- Low AI → more likely to be memory-bound (performance limited by memory bandwidth).
On accelerators:
- Global memory bandwidth is high, but compute throughput is even higher.
- Many HPC kernels are memory-bound on GPUs.
Strategies to Increase Effective Arithmetic Intensity
- Reuse data in faster memories (registers, shared memory, caches) rather than reloading from global memory.
- Fuse kernels: combine multiple simple passes (e.g., separate kernels for scaling, adding, and activation) into one to reduce memory traffic.
- Block/tile computations: operate on tiles that fit in shared memory or cache.
3. Memory Access Patterns and Coalescing
Efficient use of GPU global memory requires careful access patterns.
Coalesced Accesses
- GPUs group threads into warps (or wavefronts); when these threads access consecutive memory addresses, the hardware can combine them into fewer memory transactions.
- Coalesced pattern example:
- Thread
ireadsA[i](contiguous elements). - Non-coalesced pattern example:
- Thread
ireadsA[stride * i]with large stride.
Consequences of poor coalescing:
- Increased number of memory transactions.
- Lower effective bandwidth and higher memory latency.
Layout and Indexing
To improve coalescing:
- Align arrays and structures so that data used together is contiguous.
- Prefer structures of arrays (SoA) over arrays of structures (AoS) when each thread needs the same field from many items.
- Choose loop and thread indexing so that adjacent threads access adjacent elements.
Shared Memory and Locality
Shared memory (or similar on-chip scratchpad memory) allows explicit control of locality:
- Load a tile from global memory into shared memory.
- Perform many operations on this tile.
- Write results back once.
This:
- Reduces global memory traffic.
- Improves effective bandwidth.
- Often gives substantial speedups for stencil computations, matrix operations, and other structured kernels.
4. Parallelism, Occupancy, and Utilization
Accelerators rely on massive parallelism. Underutilization wastes performance.
Exposing Enough Parallelism
- Each kernel launch should have enough work to keep the GPU busy:
- Many blocks/workgroups.
- Many threads per block/workgroup (subject to hardware limits and resource usage).
- Tiny kernels that only use a fraction of the GPU often underperform, because overheads become important and the hardware’s parallelism is not exploited.
Occupancy
Occupancy is the ratio of active warps (or wavefronts) to the maximum possible:
- High occupancy helps hide memory latency by allowing the scheduler to switch among warps.
- Too low occupancy → poor latency hiding.
- However, maximal occupancy is not always necessary for best performance. There is a trade-off with resource usage (registers, shared memory).
Key factors limiting occupancy:
- Registers per thread.
- Shared memory per block.
- Threads per block.
Tools from vendors (e.g., occupancy calculators, profilers) help choose launch parameters.
Load Balancing Across Threads
- Threads should have similar amounts of work to avoid idling.
- Avoid severe divergence where some threads do much more work than others.
- Partition the problem so work is spread uniformly across blocks and threads, as much as possible.
5. Control Flow and Divergence
GPUs execute warps of threads in lockstep using SIMD-like execution.
Branch Divergence
- If threads in a warp take different branches of an
ifstatement, execution becomes serialized: - First, one path executes while other threads in the warp are inactive.
- Then the other path executes, again with some threads inactive.
- Divergence reduces effective parallelism and throughput.
Mitigation strategies:
- Restructure algorithms so that threads in the same warp mostly follow the same path.
- Use data layout or preprocessing to group similar work together.
- Sometimes, replace branches with arithmetic or predication, if it simplifies control flow.
6. Kernel Launch Overheads and Granularity
Each kernel launch has a fixed overhead:
- Launching very small kernels many times can severely limit performance.
- Launch fewer, larger kernels when possible.
Practical tips:
- Fuse consecutive kernels that operate on the same data.
- Process larger batches in one launch instead of many tiny batches.
- If an algorithm is iterative with small per-iteration work, consider:
- Moving the iteration loop inside the kernel.
- Using device-side loops where appropriate.
7. Precision, Mixed Precision, and Specialized Units
Modern accelerators often have specialized hardware for low-precision arithmetic (e.g., tensor cores).
Precision vs Performance vs Accuracy
- Lower precision (e.g., FP16) usually:
- Runs faster.
- Uses less memory and bandwidth.
- Trade-offs:
- Numerical accuracy.
- Algorithmic stability.
Mixed-precision approaches:
- Use low precision for bulk operations.
- Accumulate or refine in higher precision (e.g., FP32 for accumulation, FP64 for critical steps).
These can deliver large speedups if the underlying algorithm tolerates lower-precision computations.
8. Multi-GPU and Accelerator-Aware Parallelism
When using multiple accelerators, performance depends not just on each device, but also on how they are coordinated.
Key considerations:
- Work distribution: Each GPU should have enough work, and similar amounts of work, to avoid load imbalance.
- Inter-GPU communication:
- Use fast interconnects (NVLink, PCIe peer-to-peer) when possible.
- Minimize cross-GPU data transfers; prefer domain decompositions that require minimal boundary exchange.
- Overlap communication with computation:
- While one part of the domain is being updated, exchange halo/boundary data for another part in the background.
9. Using Libraries and Vendor Tools for Performance
Hand-written kernels are not always necessary or optimal.
Optimized Libraries
Use accelerator-optimized libraries whenever possible:
- Dense linear algebra (e.g., cuBLAS, hipBLAS, oneMKL).
- Sparse linear algebra (e.g., cuSPARSE, rocSPARSE).
- FFTs (e.g., cuFFT, rocFFT, vendor FFT libraries).
They are:
- Tuned by experts.
- Usually faster and more portable across hardware generations.
Profiling and Measurement
Performance tuning must be guided by measurement:
- Use vendor profilers to:
- Identify kernel hotspots.
- Determine if kernels are compute- or memory-bound.
- Inspect occupancy, memory throughput, and branch divergence.
- Iterate:
- Measure → identify bottleneck → apply specific optimization → re-measure.
10. Portability and Performance Portability
Different vendors and architectures have different performance characteristics:
- Some favor wider vector units.
- Some have larger or smaller on-chip memories.
- Some have different optimal block sizes or memory layouts.
Strategies:
- Use portability frameworks (e.g., Kokkos, RAJA, SYCL) where appropriate.
- Separate algorithmic structure from low-level tuning details.
- Provide multiple implementations or tunable parameters that can be adapted per architecture.
This helps maintain performance across evolving accelerator hardware without fully rewriting code.
11. Energy and Efficiency on Accelerators
Accelerators can provide more performance per watt, but inefficient usage still wastes energy and allocations.
Efficiency considerations:
- Avoid running underutilized GPUs (tiny kernels, poor occupancy, excessive idling).
- Reduce unnecessary data movement, which consumes both time and power.
- Choose the right precision to avoid overcomputing.
- For production runs, use tuned parameters discovered via profiling rather than default or “safe” but slow settings.
In summary, performance on accelerators depends on more than simply offloading computation to a GPU. Carefully managing data movement, memory access patterns, parallelism granularity, and hardware-specific features is essential for achieving the speedups that make accelerators attractive in HPC.