Table of Contents
Understanding Performance on Accelerators
Accelerators such as GPUs can provide large speedups, but only if code is structured to use their strengths and avoid their bottlenecks. This chapter focuses on how to think about performance on accelerators, which questions to ask, and which metrics to look at. The goal is not to teach a specific programming model, but to build an intuition that you can apply with CUDA, OpenACC, or other frameworks.
Performance Model: Computation vs Data Movement
On accelerators, performance is often limited less by how many floating point operations you can do, and more by how fast you can move data. A simple way to think about this is the roofline model, which relates arithmetic intensity to maximum performance.
Arithmetic intensity is the ratio of useful floating point operations to bytes moved from memory. If your kernel has arithmetic intensity $I$ (in FLOP/byte) and the accelerator has peak compute performance $P_\text{peak}$ (FLOP/s) and peak memory bandwidth $B_\text{mem}$ (byte/s), then the maximum achievable performance is
$$
P_\text{max} = \min\left(P_\text{peak},\; I \times B_\text{mem}\right).
$$
For a given accelerator, if $I \times B_\text{mem} < P_\text{peak}$, your kernel is memory bound. If $I \times B_\text{mem} \geq P_\text{peak}$, it is compute bound.
This simple rule guides many optimization decisions. Compute bound kernels need better utilization of functional units and instruction-level efficiency. Memory bound kernels need improvements in memory access patterns, data layout, and reuse.
The key performance consideration is to increase arithmetic intensity when possible, for example by doing more useful work per loaded datum, and to ensure that when you do move data, you use the full width and bandwidth of the accelerator memory system.
Host–Device Transfers and Overheads
Accelerators typically have their own memory, separate from the host RAM. Moving data between host and accelerator introduces latency and consumes limited bandwidth on the interconnect, for example PCIe or NVLink. Even if the GPU kernel itself is very fast, large or frequent transfers can dominate runtime.
The time for a single host–device transfer of size $S$ bytes can be estimated as
$$
T_\text{transfer} \approx T_\text{latency} + \frac{S}{B_\text{link}},
$$
where $T_\text{latency}$ is a fixed overhead and $B_\text{link}$ is the effective bandwidth of the interconnect.
Several practical considerations follow.
First, throughput is best when transfers are large and infrequent. Many small transfers are inefficient because latency dominates. It is usually better to pack data into larger contiguous buffers and transfer them in fewer operations.
Second, data movement across the host–device boundary should be minimized. Whenever possible, keep data resident on the accelerator across multiple kernels. If an entire pipeline can run on the GPU without bringing intermediate results back to the CPU, overall performance often improves dramatically.
Third, overlapping communication with computation is essential for good performance. Instead of waiting for a transfer to complete before launching a kernel, many accelerator programming models allow asynchronous transfers and overlapping execution. The idea is to copy data for the next computation while the current kernel is running, so that transfer time is hidden behind computation.
Finally, the choice of memory used for transfers matters. Some systems provide pinned or page-locked host memory that enables higher transfer bandwidth and reduces overhead, at the cost of reducing available pageable memory. For large, performance critical transfers, using such memory is often beneficial.
Memory Access Patterns on Accelerators
Once data is on the accelerator, global memory access patterns become critical. Accelerators are designed to service memory requests from many threads in parallel, but they are most efficient when those threads access memory in simple, regular patterns.
Coalescing is the central idea. A group of threads (for example a warp or wavefront) can have their memory requests combined into a smaller number of large memory transactions when their accesses are aligned and contiguous. If each thread in a group accesses consecutive elements in memory, the hardware can read or write a full cache line or memory segment in one transaction. If accesses are scattered, the hardware must issue many smaller transactions, which reduces effective bandwidth.
Two principles follow from this.
First, design data layouts that match the way threads traverse data. For example, if each thread processes element i, storing data as a simple array and indexing by i often yields coalesced accesses. In contrast, pointer chasing through complex data structures, or storing data in an order that does not match thread indices, often leads to very poor memory efficiency.
Second, structure algorithms so that threads within a block or group access nearby data at the same time. This may involve reorganizing loops so that the innermost iteration matches the accelerator execution model, or restructuring multi-dimensional arrays to be accessed in row-major or column-major order consistent with thread indexing.
Besides global memory, accelerators provide faster on-chip memories such as shared memory or similar scratchpad storage. These are most effective when used to increase data reuse and reduce repeated global memory accesses. For example, in a stencil or matrix operation, threads can cooperatively load a tile of data into shared memory, then reuse it multiple times. This increases arithmetic intensity and can move a kernel from memory bound toward compute bound.
However, shared memory has its own access rules. Bank conflicts occur when multiple threads access different addresses that map to the same memory bank in a single cycle. This can serialize access and hurt performance. Avoiding bank conflicts often requires careful alignment, padding, or adjusted indexing so that different threads read from different banks when possible.
Caching behavior also influences performance. Some accelerators have hardware caches for global memory, but relying on them without understanding the access pattern can be risky. Streaming large arrays that are used only once can evict more valuable cached data. Many models provide flags or memory spaces for read-only data that is accessed by many threads, such as constant memory or texture-like caches, which can be used when access patterns are suitable.
Occupancy and Parallelism on Accelerators
Accelerators rely on massive parallelism to hide latency. Whenever a thread issues a long latency operation, for example a global memory read, the hardware can switch to another ready thread group and keep the compute units busy. This is why occupancy is a key performance concept.
Occupancy is usually defined as the fraction of the maximum number of active warps or thread groups that are actually resident on a streaming multiprocessor or similar compute unit. It depends on several resource limits per compute unit.
First, the total number of threads and blocks requested by the kernel. If you launch too few, you cannot reach high occupancy regardless of other factors.
Second, the registers used per thread. Each compute unit has a fixed number of registers. If each thread uses many registers, fewer threads can be resident, which lowers occupancy. Some compilers allow you to limit register usage, but this may cause register spilling to local memory, which can hurt performance. The tradeoff between register usage and occupancy is often subtle and must be measured.
Third, shared memory usage per block. If each block allocates a large shared memory buffer, fewer blocks can reside on the compute unit. Reducing shared memory or reusing buffers may allow more concurrent blocks, increasing occupancy and potentially improving throughput.
Fourth, hardware limits on the number of blocks or warps per compute unit. Even if registers and shared memory suffice, architecture specific limits may cap occupancy.
Higher occupancy is not always equal to higher performance. The goal is to have enough active threads to hide latencies, while still giving each thread enough resources. Always measure the impact of occupancy changes.
Choosing block sizes, grid sizes, and other kernel launch parameters is part of performance tuning. Reasonable defaults often work, but for performance critical kernels it is common to experiment with different configurations to balance occupancy, memory access efficiency, and flexibility in the algorithm.
Control Flow, Divergence, and SIMT Execution
Most accelerators use a Single Instruction, Multiple Threads (SIMT) execution model. Threads are organized into groups that share a program counter. When all threads in a group execute the same instruction stream, the hardware can be highly efficient. When threads within a group diverge, for example due to conditional branches, loops with different iteration counts, or early exits, performance suffers.
Divergence arises whenever threads in the same group follow different control paths. To handle this, the hardware serializes the different paths and masks off threads that are not active on a given path. As a result, even if only one thread takes a rarely used branch, the entire group may have to execute that branch, wasting cycles for the other threads.
To improve performance, kernels should minimize divergence within thread groups. Common techniques include reorganizing computations so that threads in the same group process data with similar control flow, separating special cases into separate kernels or code paths that are launched only on the relevant subset of data, and replacing some branches with arithmetic expressions when appropriate.
Loops can also cause divergence if different threads have different iteration counts. Whenever possible, structure loops so that all threads in a group perform the same number of iterations, or group data by similar iteration counts.
Some amount of divergence is inevitable in many real applications. The key is to identify hotspots where divergence strongly affects performance, for example inner loops in performance critical kernels, and to restructure them so that the common path is as uniform as possible.
Precision, Specialized Units, and Throughput
Accelerators often support multiple floating point precisions with very different performance. For example, single precision may be much faster than double precision, and special formats such as half precision or bfloat16 may be supported by dedicated units with yet higher throughput, especially on hardware designed with machine learning workloads in mind.
If an algorithm tolerates lower precision without harming scientific validity or stability, using a faster precision can yield very large speedups. Mixed precision techniques use high precision where it is needed for stability, for example in accumulators or residual computations, and low precision elsewhere.
Specialized units such as tensor cores or matrix multiply-accumulate units can perform small matrix operations at extremely high throughput. To benefit from them, computations must be reformulated in the small matrix shapes they support, and operands often must be stored in specific layouts. This sometimes requires nontrivial refactoring of algorithms but can provide significant performance gains for kernels dominated by dense linear algebra.
The tradeoff between precision, performance, and numerical behavior needs careful consideration. From a performance perspective, the rule is simple: select the lowest precision and the most specialized units that still produce correct and acceptable results for the problem at hand.
Kernel Granularity and Launch Overheads
Each kernel launch has an overhead on the host and device side. If kernels are very small and do little work, this overhead can become a large fraction of runtime. This is particularly problematic in applications that originally evolved as many small CPU functions and are naively ported to accelerators as many tiny kernels.
A practical guideline is to increase kernel granularity. Instead of launching one kernel per element or per tiny task, combine work so that each kernel processes a larger batch of data. This can be achieved by fusing adjacent operations into a single kernel, or by processing multiple time steps or multiple problem instances within one launch, when dependencies allow.
Kernel fusion has additional benefits beyond reducing launch overhead. Fused kernels can keep intermediate data on-chip, for example in registers or shared memory, instead of writing to global memory and then reading back in a subsequent kernel. This reduces global memory traffic and increases arithmetic intensity.
However, very large or complex kernels can become harder to optimize, maintain, and debug. They may also increase register pressure and lower occupancy. Practical tuning often involves experimenting with how much fusion is beneficial for a given hardware and problem.
Overlapping and Asynchronous Execution
Accelerators and hosts can often execute operations concurrently. Asynchronous kernel launches and data transfers allow the runtime to overlap different activities, which is crucial to reaching high utilization.
The central idea is to organize work in multiple independent streams or queues. While the accelerator executes a kernel from one stream, it can perform data transfers associated with another stream, provided that hardware resources exist for both. Likewise, the host can continue preparing data or launching new work without waiting for earlier operations to complete, unless results are actually needed.
Effective use of asynchronous operations imposes some constraints. First, dependencies between computations must be clear and respected. If a kernel uses data that is being transferred, proper synchronization must ensure that the transfer completes first. Most APIs provide explicit mechanisms for this, such as events or waits.
Second, data must be placed and aligned so that the hardware copy engines can operate efficiently and independently of computation units. Systems may have multiple copy engines that can overlap host-to-device and device-to-host transfers with kernels, but only if streams and memory allocations are configured correctly.
Third, from a performance perspective, you should aim to construct pipelines where different stages are continuously kept busy, for example reading input on the host, transferring to the device, running kernels, and transferring results back, all in parallel across different batches of data. The benefits are most visible in applications that repeatedly process many chunks of data, such as streaming or batched workloads.
Power, Thermal Limits, and Sustained Performance
Accelerators operate within power and thermal envelopes. Even if theoretical peak performance is high, sustained performance may be lower if kernels trigger throttling. Very intensive kernels in double precision, or heavy use of specialized units, may cause the device to reach power limits and reduce frequency.
In practice, this means that kernels that saturate all units at maximum frequency may not see proportional speedups compared to slightly less intensive ones. Sometimes, a more balanced kernel that avoids pathological hot spots and uses resources more evenly can achieve better sustained throughput.
Monitoring tools provided by vendors can report power draw, temperature, clock frequencies, and throttling reasons. When optimizing performance, it is useful to watch these metrics. If a kernel is consistently power limited, algorithmic changes that reduce peak power consumption per operation, such as using lower precision, improving data locality, or smoothing computational intensity over time, may lead to higher sustained performance.
On shared systems, power and thermal management may be controlled by the cluster administrators, and per job power caps may be enforced. Within those constraints, efficient use of accelerators means doing as much useful work as possible per joule consumed.
Practical Performance Tuning Strategy
Given all these factors, performance optimization on accelerators can seem complex. A structured approach helps.
First, profile before optimizing. Use performance tools that show where time is spent, how busy compute units are, and whether kernels are memory bound or compute bound. Identify a small number of hot kernels that dominate execution.
Second, for each hot kernel, classify it using the roofline perspective. Estimate arithmetic intensity and identify whether it is memory or compute bound. Then focus optimizations accordingly. Memory bound kernels usually need better data layouts, more coalesced accesses, and increased reuse through shared memory or caches. Compute bound kernels benefit from better occupancy, reduced divergence, and improved instruction efficiency.
Third, check for avoidable host–device transfers. Aim to keep data on the accelerator as long as possible, and to overlap remaining transfers with computation. Consolidate small transfers into larger ones when possible.
Fourth, adjust kernel launch configurations and resource usage. Experiment with block sizes and shared memory to balance occupancy and resource availability. Measure the impact of changes, since there is no universal best configuration.
Fifth, consider kernel fusion and restructuring of control flow to reduce divergence and launch overhead. Always balance performance gains with code complexity and maintainability.
Finally, validate numerical correctness and stability when changing precision or algorithm structure. Performance improvements are only useful if the scientific or engineering results remain trustworthy.
With these principles and a willingness to iterate based on measurements, accelerators can be used much more effectively, and large fractions of their theoretical performance can be reached in real applications.