Kahibaro
Discord Login Register

10.5 Performance considerations for accelerators

Key Performance Principles for Accelerators

Accelerators can deliver large speedups, but only if their strengths are used well and their bottlenecks are avoided. This chapter focuses on the main performance considerations when using GPUs and other accelerators in HPC codes.

We will assume you already know basic GPU architecture concepts and the idea of host (CPU) vs device (GPU) from earlier sections.


1. Data Movement and PCIe/Interconnect Bottlenecks

In most systems, the CPU and GPU have separate memory spaces. Moving data between them is often the dominant cost.

Minimize Host–Device Transfers

Transfer Larger, Fewer Chunks

Transferring many small buffers incurs overhead. Instead:

Overlap Communication and Computation

Use asynchronous operations to hide transfer latency:

This is often called double buffering or pipelining.

Consider Interconnect Limits

2. Compute vs Memory Bound: Arithmetic Intensity

An accelerator’s theoretical peak performance is rarely achieved because applications are limited by memory bandwidth or by computation.

Arithmetic Intensity

Arithmetic intensity (AI) is:
$$ \text{AI} = \frac{\text{Number of floating-point operations}}{\text{Bytes of data moved}} $$

On accelerators:

Strategies to Increase Effective Arithmetic Intensity

3. Memory Access Patterns and Coalescing

Efficient use of GPU global memory requires careful access patterns.

Coalesced Accesses

Consequences of poor coalescing:

Layout and Indexing

To improve coalescing:

Shared Memory and Locality

Shared memory (or similar on-chip scratchpad memory) allows explicit control of locality:

This:

4. Parallelism, Occupancy, and Utilization

Accelerators rely on massive parallelism. Underutilization wastes performance.

Exposing Enough Parallelism

Occupancy

Occupancy is the ratio of active warps (or wavefronts) to the maximum possible:

Key factors limiting occupancy:

Tools from vendors (e.g., occupancy calculators, profilers) help choose launch parameters.

Load Balancing Across Threads

5. Control Flow and Divergence

GPUs execute warps of threads in lockstep using SIMD-like execution.

Branch Divergence

Mitigation strategies:

6. Kernel Launch Overheads and Granularity

Each kernel launch has a fixed overhead:

Practical tips:

7. Precision, Mixed Precision, and Specialized Units

Modern accelerators often have specialized hardware for low-precision arithmetic (e.g., tensor cores).

Precision vs Performance vs Accuracy

Mixed-precision approaches:

These can deliver large speedups if the underlying algorithm tolerates lower-precision computations.


8. Multi-GPU and Accelerator-Aware Parallelism

When using multiple accelerators, performance depends not just on each device, but also on how they are coordinated.

Key considerations:

9. Using Libraries and Vendor Tools for Performance

Hand-written kernels are not always necessary or optimal.

Optimized Libraries

Use accelerator-optimized libraries whenever possible:

They are:

Profiling and Measurement

Performance tuning must be guided by measurement:

10. Portability and Performance Portability

Different vendors and architectures have different performance characteristics:

Strategies:

This helps maintain performance across evolving accelerator hardware without fully rewriting code.


11. Energy and Efficiency on Accelerators

Accelerators can provide more performance per watt, but inefficient usage still wastes energy and allocations.

Efficiency considerations:

In summary, performance on accelerators depends on more than simply offloading computation to a GPU. Carefully managing data movement, memory access patterns, parallelism granularity, and hardware-specific features is essential for achieving the speedups that make accelerators attractive in HPC.

Views: 45

Comments

Please login to add a comment.

Don't have an account? Register now!