13.1 Linear algebra libraries

Role of Linear Algebra in HPC

Linear algebra operations—vector updates, matrix–vector products, matrix factorizations—sit at the core of most numerical HPC workloads:

Solving systems of linear equations $Ax = b$
Eigenvalue problems and singular value decompositions
Least-squares and regression
Optimization and machine learning kernels
Time-stepping PDE solvers (implicit methods, multigrid smoothers, etc.)

Optimizing these operations manually for every application, architecture, and programming language would be impossible. Linear algebra libraries provide:

Well-tested implementations of standard operations and algorithms
Performance-optimized kernels tuned for specific CPUs/GPUs and memory hierarchies
Standardized APIs that many higher-level frameworks and applications depend on

In practice, you almost never write your own inner linear algebra loops in HPC: you call into these libraries.

Levels of Linear Algebra Libraries

Linear algebra libraries in HPC are commonly organized into levels of abstraction:

Low-level building blocks (e.g., BLAS)

Dense vector and matrix operations
Simple, composable kernels (like y = αx + y, matrix–vector, matrix–matrix)

Mid-/high-level solvers and factorization libraries (e.g., LAPACK, ScaLAPACK, PETSc, Trilinos)

Provide algorithms for solving linear systems, eigenproblems
Build on top of low-level BLAS-like kernels

This chapter focuses on the category of linear algebra libraries as a whole; later subsections will highlight specific families (BLAS, LAPACK, ScaLAPACK) in more detail.

Key Characteristics of HPC Linear Algebra Libraries

Standardized interfaces

HPC linear algebra libraries tend to share:

Conventional naming schemes
For example, in BLAS/LAPACK-style libraries:

dgemm → double-precision (d) general (ge) matrix–matrix multiply (mm)
sgemv → single-precision (s) general (ge) matrix–vector multiply (mv)

Language bindings

Historically Fortran; now also C, C++, Python, and others
Many higher-level libraries (NumPy, SciPy, MATLAB, R, Julia, etc.) wrap these routines

Once you know the basic interface ideas, you can move between implementations without changing your code structure much.

Dense vs. sparse

Linear algebra libraries typically specialize in dense or sparse operations:

Dense linear algebra

Matrices are “full” (most entries are nonzero)
Operations like full matrix–matrix products, LU/QR factorization
BLAS/LAPACK-style libraries

Sparse linear algebra

Matrices are mostly zeros, stored in compressed formats
Typical in PDEs, graph problems, large optimization problems
Libraries: PETSc, Trilinos, hypre, SuiteSparse, etc. (usually layered above BLAS)

Dense libraries are often the first step in learning HPC linear algebra; sparse libraries add data structure complexity and more algorithmic choices.

Precision and data types

Most linear algebra libraries support multiple numeric types:

Floating-point precisions

Single precision: float, 32-bit (BLAS prefix s)
Double precision: double, 64-bit (BLAS prefix d)
Sometimes half or mixed precision (common on GPUs, ML-focused libraries)

Real vs. complex

Real types: s and d prefixes
Complex types: c (complex single), z (complex double)

Choice of precision affects:

Speed (single/half often faster)
Memory and bandwidth usage
Numerical accuracy and stability

Performance Considerations

Why libraries can outperform hand-written code

Highly-optimized linear algebra libraries exploit:

Vectorization and instruction-level parallelism
Cache-aware and cache-blocking algorithms
Multi-threading and NUMA-aware placement
Architecture-specific instructions (e.g., AVX-512, SVE)
GPU offloading or accelerator-specific tuning

Vendor implementations (e.g., Intel, AMD, NVIDIA) are often deeply optimized by experts with access to microarchitectural details; open-source implementations (e.g., OpenBLAS, BLIS) provide competitive performance across many platforms.

Dense linear algebra cost models

A rough understanding of operation counts helps you reason about costs:

Vector operations (Level 1): $\mathcal{O}(n)$ flops
Matrix–vector (Level 2): $\mathcal{O}(n^2)$ flops
Matrix–matrix (Level 3): $\mathcal{O}(n^3)$ flops

For large $n$, Level 3 operations dominate and achieve the best performance because they have high computational intensity (flops per byte moved). Many algorithms are reorganized to use Level 3 kernels wherever possible.

Threading and parallelism

Modern linear algebra libraries may:

Spawn multiple threads per process to use all cores on a node
Use SIMD within each thread
Coordinate with MPI-based distributed memory libraries (like ScaLAPACK; covered separately)

Important practical issues:

Thread oversubscription: if your MPI job also uses threaded BLAS, you may get too many threads. Often controlled via environment variables (e.g., OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS).
NUMA behavior: large allocations and thread placement can affect performance.

Common Implementation Families

Within the general category “linear algebra libraries,” there are several major families (covered more specifically in later subsections). At a high level:

Reference vs. optimized implementations

Reference versions prioritize correctness and portability
Optimized versions prioritize performance on specific hardware

Vendor vs. community

Vendor libraries: tuned by CPU/GPU vendors (Intel, AMD, NVIDIA, etc.)
Community/open-source: portable, widely used across clusters

Examples include:

CPU-focused dense libraries
GPU-focused dense libraries
Mixed CPU/GPU libraries
Libraries that expose the same BLAS/LAPACK API but with different backends

Choosing among them often depends on cluster hardware and system-provided software stacks.

Integration in HPC Software Stacks

On HPC systems, linear algebra libraries are typically not used in isolation; they are part of a broader ecosystem.

Dependency of higher-level software

Many scientific packages depend on linear algebra libraries:

PDE solvers and multi-physics frameworks
Quantum chemistry and materials codes
Climate and weather models
Data analytics and machine learning libraries

These applications:

Expect particular APIs (e.g., BLAS 3.8, LAPACK 3.x)
May require specific precision support
Often express configuration as: "Use this BLAS/LAPACK implementation."

Understanding which library is actually being used (and with which options) is crucial for performance and reproducibility.

Library selection on clusters

On typical HPC systems you might encounter:

Multiple implementations installed side-by-side
Different default libraries for different compiler toolchains
Environment modules to load a particular combination

Common questions when selecting:

Does this implementation support multi-threading? How is it controlled?
Is it optimized for this CPU architecture?
Does it have any licensing implications?
Does it work with my compiler and MPI versions?

Static vs. shared linking

Linear algebra libraries can be:

Shared libraries (.so, .dylib) loaded at runtime
Static libraries (.a) linked into the executable

Implications:

Static linking can simplify deployment on some clusters or containers.
Shared libraries ease updates and reduce binary size.
Version mismatches (ABI or symbol differences) are a common source of runtime problems.

Practical Usage Patterns

Replace hand-written loops by library calls

Instead of:

for (i = 0; i < n; ++i)
    y[i] = alpha * x[i] + y[i];

you would typically call a BLAS-like routine that performs the same operation. Benefits include:

Less custom code to maintain
Access to optimized implementations on each platform
Predictable behavior and numerical properties

Composing algorithms out of building blocks

Many higher-level algorithms can be designed as:

Sequences of Level 3 matrix–matrix operations for performance
Occasional Level 2 or Level 1 operations for vector updates

When you design algorithms, you often:

Try to reformulate them to favor operations that libraries perform well
Minimize data movement between kernels
Organize data layouts (row-major, column-major, blocking) to match library expectations

Testing and validation

Because linear algebra libraries are so central:

Bugs or misconfigurations can corrupt many downstream results.
Small changes in library version or chosen implementation can change performance or (slightly) numeric results.

Common practices:

Use unit tests for small matrices with known solutions.
Check residuals like $\|Ax - b\|$ or orthogonality for eigenvectors.
Compare against a reference or high-precision computation where possible.

Portability and Reproducibility Concerns

Using linear algebra libraries in HPC touches on broader topics of software environments and reproducibility:

Different clusters use different default implementations and versions.
Performance may vary dramatically between them even for the same API call.
Certain libraries may use auto-tuning on first use, leading to subtle run-to-run performance differences.

Common strategies:

Explicitly document which library implementation and version you used.
Use environment modules or containers to pin library versions.
Where possible, design your build system to allow easy swapping of BLAS/LAPACK backends.

Summary

Linear algebra libraries are foundational to HPC:

They provide standard, high-performance building blocks for dense and sparse computations.
They encapsulate complex optimization and parallelization details behind stable APIs.
They integrate deeply into HPC software stacks and affect both performance and reproducibility.

Effective HPC practice involves recognizing when and how to rely on these libraries, choosing appropriate implementations on given hardware, and structuring your own code to exploit them efficiently.