13.2 Fast Fourier Transform libraries

What FFT Libraries Provide in HPC

Fast Fourier Transform (FFT) libraries implement discrete Fourier transforms (DFTs) and related transforms efficiently and portably. In HPC, you generally do not write your own FFT; you rely on optimized libraries that:

Exploit CPU vector units and memory hierarchy
Use multi-threading and/or MPI
Support multidimensional and batched transforms
Handle real vs complex data, in-place vs out-of-place transforms
Provide highly tuned kernels for particular architectures

Most major numerical stacks ship with or depend on one or more FFT libraries.

Key criteria when choosing an FFT library in HPC:

Performance on your target architecture (CPU vs GPU, vendor)
Parallel model support (OpenMP, MPI, GPU backends)
Precision needs (single, double, sometimes half or quad)
Licensing and portability
API complexity and ecosystem support (Fortran, C, C++, Python bindings, etc.)

This chapter focuses on representative FFT libraries commonly seen in HPC rather than on FFT theory.

Widely Used FFT Libraries

FFTW (Fastest Fourier Transform in the West)

FFTW is one of the most widely used general-purpose FFT libraries on CPUs.

Key characteristics

Open source, portable, widely available on clusters
Supports 1D, 2D, and 3D transforms
Supports complex-to-complex, real-to-complex, and complex-to-real transforms
Single and double precision; some support for long double
Supports in-place and out-of-place transforms
Supports multi-threading (via POSIX threads or OpenMP, depending on build)

Planning mechanism

A central concept in FFTW is the plan. You describe the transform you want, and FFTW builds an optimized plan for executing it:

Planning phase can be:

FFTW_ESTIMATE – fast planning, less optimization
FFTW_MEASURE or FFTW_PATIENT – spends time benchmarking different methods for your transform size and cache, then chooses the best

Plans can be saved and reused, reducing startup overhead for repeated runs

Typical C usage pattern (schematic):

#include <fftw3.h>
int N = 1024;
fftw_complex *in, *out;
fftw_plan plan;
in  = fftw_malloc(sizeof(fftw_complex) * N);
out = fftw_malloc(sizeof(fftw_complex) * N);
/* initialize in[...] */
plan = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_MEASURE);
fftw_execute(plan);  /* perform the FFT */
/* use out[...] */
fftw_destroy_plan(plan);
fftw_free(in);
fftw_free(out);

HPC considerations:

Create plans once and reuse them in loops
Consider using FFTW_WISDOM (saved plans) to amortize planning cost
Use multi-threaded FFTW when appropriate (fftw_init_threads, fftw_plan_with_nthreads)

FFTW is often available as a module on clusters, e.g. module load fftw.

Vendor-Optimized FFT Libraries (MKL, cuFFT, etc.)

Hardware vendors ship their own math libraries, usually including FFT routines that are heavily tuned for their processors or accelerators.

Intel oneMKL FFTs

Intel’s Math Kernel Library (oneMKL) includes highly optimized FFT routines for Intel CPUs (and some other platforms via oneAPI).

Drop-in FFT implementation with C, C++, and Fortran interfaces
Supports 1D/2D/3D, real/complex, in-place/out-of-place
Optimized for Intel vector units (SSE, AVX, AVX-512)
Multi-threaded via Intel’s threading runtime

Usage pattern (conceptual):

Create a descriptor describing your transform (size, domain, precision).
Commit the descriptor (prepares internal structures).
Call a compute function to perform the transform.
Free the descriptor.

On many clusters, loading the Intel compiler or oneAPI module gives you access to MKL’s FFT routines as part of the larger numerical library stack.

NVIDIA cuFFT

For GPU-based FFTs on NVIDIA hardware, the standard choice is cuFFT.

Key points:

Library for performing FFTs on CUDA-capable GPUs
Supports batched transforms (many transforms at once), which is often essential for high performance
Interfaces from C/C++ and Fortran (via wrappers)
Integrates with CUDA streams for overlapping computation and data transfer

Basic flow:

Allocate arrays on the GPU (cudaMalloc)
Create a cuFFT plan with cufftPlan* functions
Execute forward or inverse FFT with cufftExec*
Destroy the plan, free device memory

Performance considerations:

Minimize host–device data transfers; keep data on the GPU if possible
Use batched transforms for many small FFTs
Match GPU precision support and performance to your needs (float vs double)

Other Vendor Libraries

AMD: ROCm ecosystem includes rocFFT for AMD GPUs.
IBM and other CPU vendors often provide FFTs as part of their own BLAS/LAPACK-like offerings or math libraries (e.g., ESSL).

On large systems, vendor FFT libraries are often the fastest option for that specific hardware and are integrated into system-wide software stacks.

FFT Libraries in Scientific Software Stacks

Many higher-level frameworks and languages expose FFTs through their own APIs but rely under the hood on optimized libraries:

Python/NumPy/SciPy:

numpy.fft / scipy.fft are frontends; they may use FFTW, MKL, pocketfft, or other backends depending on how NumPy was built.
With Intel’s distributions, FFT operations are often backed by MKL FFT.

MATLAB:

fft uses highly tuned libraries (often MKL or vendor libraries on HPC systems).

FFTW wrappers:

Many languages (Fortran, C++, Rust, etc.) provide bindings to FFTW for convenience.

Domain-specific codes:

Plane-wave DFT codes, spectral CFD codes, and some particle-in-cell codes use FFT libraries as core building blocks.

In practice, you often use FFTs through such frameworks, but understanding the underlying libraries helps interpret performance and scaling behavior.

Parallel and Distributed FFTs

For large-scale simulations, FFTs must work across multiple cores and multiple nodes.

Shared-Memory Parallel FFTs

Many libraries support multi-threaded FFTs on a single node:

FFTW with threads or OpenMP build
MKL FFT with internal threading
GPU libraries (cuFFT, rocFFT) within a single GPU or multi-GPU setup (depending on features)

Performance factors:

Thread scaling depends strongly on transform size and memory bandwidth
Cache behavior is critical; too many threads can hurt performance for small FFT sizes
Some libraries allow users to control thread counts via environment variables or API calls

Distributed (MPI) FFTs

Distributed FFTs decompose large multidimensional arrays across nodes using MPI.

Main approaches:

Slab decomposition:

Split the domain along one dimension.
Simpler, but may not scale well when you need many MPI ranks.

Pencil decomposition:

Split along two dimensions.
Better scalability to thousands of ranks, at the cost of more complex communication.

Libraries and frameworks:

FFTW MPI interface:

Provides MPI-enabled transforms using its own parallel routines.

P3DFFT, 2DECOMP&FFT, and similar packages:

High-level libraries built on top of MPI and sometimes FFTW/MKL.
Specialize in scalable 3D FFTs with pencil decomposition, widely used in spectral CFD and turbulence simulations.

Vendor and system-specific:

Some large supercomputers ship with highly optimized distributed FFTs in their math libraries or scientific software stacks.

Key HPC issues with distributed FFTs:

Global all-to-all communication patterns, which stress interconnects
Strong scaling limits from communication overhead
Sensitivity to process placement and network topology

In practice, parallel FFT performance can dominate the runtime of entire applications, so choice of library and decomposition strategy is critical.

Accuracy, Precision, and Transform Variants

Precision Choices

Most FFT libraries support:

Single precision (float, cufftComplex, etc.)
Double precision (double, cufftDoubleComplex, etc.)

Some also support:

Extended precision (long double) on CPUs
Lower precision (e.g., half) primarily on GPUs for specialized applications

Trade-offs:

Single precision: faster and uses less memory, but lower accuracy
Double precision: more accurate but slower; on some GPUs, much slower

Select precision based on numerical requirements of your application and the performance characteristics of your hardware.

Normalization and Conventions

Different libraries may apply different scaling factors:

Some scale the forward transform, some the inverse, some neither
The discrete Fourier transform can be defined with or without $1/N$ factors; libraries often choose a convention and document it

When mixing libraries or comparing results across codes, you must:

Check how the library defines FORWARD and INVERSE
Check where (if at all) normalization is applied
Adjust scaling factors if necessary

Many libraries also provide related transforms:

Real-input FFTs to exploit symmetry
Sine and cosine transforms (DST/DCT)
Multi-dimensional transforms with arbitrary dimensions

Practical Considerations on HPC Systems

Accessing FFT Libraries via Modules

On clusters, FFT libraries are typically exposed via environment modules. Common patterns:

module avail fftw
module load fftw/3.3.10-gcc
module load intel-oneapi-mkl
module load cuda (for cuFFT)
module load rocm (for rocFFT)

Loading these modules:

Sets compiler and linker flags (e.g., -lfftw3, -lmkl_rt, CUDA libraries)
Adjusts include paths and library paths

You typically query documentation or module help to discover:

Necessary compile/link flags
Threading and MPI options
Example build commands

Linking and Integration with Your Code

Key aspects when integrating an FFT library into your own code:

Language bindings: choose the API (C, Fortran, C++) supported by your language and compiler setup.
Threading compatibility:

Avoid oversubscribing cores by matching library thread settings with OpenMP/MPI settings.
Use environment variables (e.g., OMP_NUM_THREADS, MKL_NUM_THREADS) and library calls to control it.

MPI integration:

Decide whether you will use an MPI-enabled FFT library (e.g., FFTW MPI) or manage the distribution and communication yourself.

GPU usage:

Ensure proper data layout on the GPU.
Use batched APIs for performance.

When and How to Choose an FFT Library

In practice, your choice often follows these patterns:

CPU-only, general-purpose:

Use FFTW if portability and open source are priorities.
Use vendor libraries (MKL, ESSL, etc.) when they are available and tuned for your hardware.

GPU-heavy workloads:

Use cuFFT on NVIDIA GPUs, rocFFT on AMD GPUs.
Consider higher-level frameworks (e.g., Kokkos, heFFTe, vendor-provided distributed FFTs) if you need portability across devices.

Large-scale distributed simulations:

Consider specialized distributed FFT frameworks (P3DFFT, 2DECOMP&FFT, FFTW MPI interface) that already implement scalable decompositions.

High-level languages or frameworks:

Prefer to use the FFT interface they provide and ensure that the backend is configured to use a high-performance library on your system.

Benchmarking is essential: the “fastest” library is often problem- and system-dependent. Many HPC centers provide benchmark results, example build scripts, or recommendations for which FFT libraries to use on their systems.

Comments

Please login to add a comment.

Don't have an account? Register now!