Table of Contents
Overview
High performance computing depends heavily on high quality numerical libraries and carefully assembled software stacks. For most practical workloads, performance comes not from hand written loops, but from calling well optimized, often vendor tuned, building blocks. Understanding what these libraries provide, how they fit together, and how they are delivered on HPC systems is essential for productive work on clusters.
This chapter focuses on the role of numerical libraries and software stacks in HPC, how they are organized on clusters, and what you need to know to choose and use them effectively without becoming a library developer yourself.
Why Numerical Libraries Matter
Many scientific and engineering problems reduce to a small number of core numerical kernels. Examples include solving linear systems, computing eigenvalues, performing Fourier transforms, and optimizing nonlinear functions. These kernels are difficult to implement efficiently, especially on modern heterogeneous systems with deep memory hierarchies, vector units, and complex interconnects.
Instead of coding these algorithms from scratch, HPC applications typically rely on numerical libraries that offer:
- Carefully designed algorithms that are numerically stable and robust.
- Implementations optimized for specific CPUs, GPUs, and interconnects.
- Standardized interfaces so that code written decades ago still works with modern implementations.
The central idea is to separate "what" you want to compute from "how" it is executed. Your code calls a function, for example dgemm for matrix multiplication, and a tuned implementation performs the operation efficiently on the hardware you are using.
In HPC, you should almost never implement basic numerical kernels yourself. Always search for an appropriate library first.
Building Blocks of an HPC Software Stack
On an HPC cluster, numerical libraries are not isolated pieces that you download ad hoc. They are integrated into a software stack that typically includes:
- One or more compiler suites.
- Math and communication libraries that match each compiler and hardware platform.
- High level scientific libraries and frameworks that build on these math libraries.
- Environment management tools, often module systems, that ensure compatible combinations.
A "stack" is essentially a compatible set of compilers, libraries, and tools tested to work together. When you load a specific module, for example a compiler or MPI module, you are often implicitly selecting a whole chain of dependent libraries.
From a user perspective, the most important aspects are:
You rarely link directly against all low level libraries yourself. Instead, you pick a compiler, an MPI implementation if you use distributed memory, and often a math library module. Higher level packages and your application then build on that.
Different stacks may co exist to support different hardware generations or different programming models. For example, one stack may be optimized for CPU only nodes, and another for GPU accelerated nodes with different math libraries and compilers.
Categories of Numerical Libraries
Although later sections in this part of the course discuss specific families such as BLAS, LAPACK, and FFT libraries, it is useful here to see how they fit conceptually into broader categories. Typical categories include:
Low level dense linear algebra libraries, which provide basic operations on vectors and matrices.
Sparse linear algebra libraries, which handle matrices with many zeros and special storage formats.
Fast Fourier Transform libraries, specialized for transforms of various dimensions and data types.
Spectral and eigenvalue solvers, often built on top of basic linear algebra routines.
Optimization libraries, for linear, nonlinear, and mixed integer problems.
Specialized packages for domains such as computational fluid dynamics, quantum chemistry, or finite element methods, which themselves wrap and orchestrate lower level numerical routines.
What links these categories in HPC is composability. Basic BLAS routines may be used inside LAPACK, which may be used inside PETSc or Trilinos, which may then be used inside a larger application code.
When evaluating a numerical library, always ask two questions:
- Does it already build on standard, optimized building blocks such as BLAS or FFT libraries?
- Does it integrate cleanly with the rest of your software stack?
Vendor and Architecture Specific Implementations
A key feature of HPC numerical libraries is that there are both reference implementations and vendor tuned implementations. The reference implementation defines the interface and correct behavior. Vendor libraries take the same interface and provide hardware specific optimizations.
For example, a reference BLAS might define the function daxpy with the mathematical operation
$$
y \leftarrow \alpha x + y
$$
for vectors $x$ and $y$. A vendor tuned BLAS will implement the same operation but use vector instructions, cache blocking, and other techniques to achieve high performance on a specific CPU.
At a higher level, the same pattern appears for distributed memory libraries. A reference interface such as ScaLAPACK might have an implementation that works on any MPI layer, while vendors produce versions tuned for their interconnects and node architectures.
As a user, the important points are:
Interfaces are stable. You usually do not change your source code when you move from a reference library to a vendor tuned one.
Performance is not portable by default. The same source code may run much faster when linked against a vendor specific library tuned for your particular HPC system.
On many clusters, these vendor libraries are wrapped behind module names such as intel-mkl, cray-libsci, or amd-blis. Loading the right module ensures that your code uses the best available implementation.
Linking and ABI Considerations in HPC
In order to rely on numerical libraries in compiled languages, your application must link against them. On clusters, there are a few recurring patterns to be aware of.
First, numerical libraries often come in multiple variants that match different ABIs, for example different integer sizes, different data models, or different thread models. You may see separate libraries for 32 bit and 64 bit indexing, for sequential and multithreaded execution, or for different MPI implementations.
Second, libraries may be provided as static archives or as shared objects. Some HPC centers prefer static linking for performance reproducibility and to avoid runtime dependency issues, while others prefer shared libraries for flexibility and reduced disk space usage. This choice affects how you specify libraries to the compiler and how easy it is to move binaries between systems.
Third, many higher level libraries provide "wrapper" compiler commands that hide complex link lines. For example, a library may provide a script such as pkg-config files or its own *-config helper that emits the necessary compiler and linker flags. Using these helpers is safer than trying to guess which low level libraries to add manually.
In modern stacks, cross language compatibility is also important. C, C++, and Fortran codes often need to interoperate through common numerical libraries. This requires consistent calling conventions and careful attention to name mangling and data layout. In practice, many libraries define a C interface and provide Fortran bindings that call into it.
On HPC systems, never assume that a library can be linked arbitrarily with any compiler or MPI. Always use the library and compiler combinations provided and documented by the site, and use official helper tools or module supplied wrappers to construct link lines.
Threading, MPI, and Hybrid Awareness in Libraries
Numerical libraries are no longer simple serial routines. Many offer internal parallelism using threads, vectorization, and sometimes distributed memory. When you combine them with your own parallel code, you have to be aware of interaction effects.
Some math libraries provide internal multithreading through OpenMP or proprietary threading layers. The number of threads is usually controlled by environment variables or specific API calls. If you are writing a parallel application that also uses threads, you must coordinate the thread counts to avoid oversubscription of CPU cores.
Other libraries are "MPI aware" and can distribute data across processes using MPI, for example scalable linear algebra packages that operate on block cyclic distributed matrices. In that case, you have to match the MPI implementation and the process layout used by your job to the expectations of the library.
Hybrid usage is common. For example, you might use MPI across nodes, OpenMP threads within each node, and call a math library that uses vectorization inside each thread. This stack only works efficiently if each layer is configured to use the resources that are actually available.
From a user perspective, the main responsibilities are:
Read the documentation for how each library uses threads or MPI.
Configure thread counts and process layouts through environment variables or library routines in a consistent way.
Coordinate this configuration with your job scheduler resource requests so that the total number of threads and processes matches the allocated cores.
Software Distribution Models for HPC Libraries
On personal machines, numerical libraries are often installed through system package managers or language specific tools such as pip or conda. On HPC systems, software distribution is more complex, because:
Multiple compiler versions coexist.
CPU and GPU architectures differ between nodes or clusters.
MPI implementations and their ABIs differ.
Licensing terms may restrict redistribution.
As a result, HPC centers tend to maintain centrally managed stacks, exposed through module systems. Inside these stacks, numerical libraries are usually:
Built separately for each supported compiler family and MPI implementation.
Configured for different CPU microarchitectures, sometimes with automatic dispatch at runtime.
Optionally built in both single and double precision variants, and sometimes with additional mixed precision support.
Some sites also use meta build and deployment frameworks such as Spack or EasyBuild to manage these variants. From a user standpoint, these tools are typically hidden behind uniform module names.
For user managed environments, such as container based workflows or personal conda environments on shared clusters, numerical libraries may be pulled in as dependencies of higher level packages. In those cases, it is important to ensure that the library stack inside the container or environment matches the system libraries with which it has to interact, or that the container is self contained.
In HPC, do not try to "fix" missing or outdated numerical libraries by installing arbitrary prebuilt binaries in your home directory. Prefer the site provided stack, or use a systematic build tool or container that gives you a consistent environment.
Language Level Access to Numerical Libraries
Different programming languages interact with numerical libraries in different ways. In HPC contexts, a few common patterns appear.
Fortran applications often call numerical libraries directly, since many interfaces originated in Fortran. Many libraries still document their canonical API in Fortran, and compilers provide straightforward linking mechanisms.
C and C++ codes typically use C bindings provided by libraries. These bindings expose functions that are easier to use from C, with types that map naturally to C arrays and pointers. C++ can then wrap these functions in higher level abstractions, such as matrix classes or template based linear algebra frameworks.
Higher level languages used in HPC workflows, such as Python, Julia, or R, usually do not call BLAS or LAPACK directly from user code. Instead, their core numerical arrays and linear algebra modules are already backed by these libraries. For example, an array multiplication operation in Python's popular numerical packages is effectively a call to a BLAS gemm routine. When these languages are run on an HPC cluster, they often benefit transparently from optimized system libraries.
When you design an application, it is often best to rely on language native abstractions that already use optimized libraries. Only when performance profiling indicates that a particular library call is critical do you need to think about explicitly controlling which library is used and how it is linked.
Stable APIs and Long Term Portability
One important advantage of established numerical libraries is the stability of their APIs. Many interfaces have remained unchanged for decades. For example, core BLAS and LAPACK function signatures have stayed essentially the same, even while their implementations have been rewritten multiple times to exploit new hardware features.
This stability is valuable in several ways.
First, it supports long term maintenance of scientific codes. Applications written years ago that depend heavily on BLAS routine calls can continue to be compiled and run on modern machines.
Second, it permits independent vendor optimization. Hardware vendors can compete to produce the fastest implementation without forcing application developers to change their code.
Third, it fosters composability across libraries. New higher level packages can build on top of these stable primitives, confident that they will remain available in the future.
However, stability does not mean immutability. Many libraries introduce extended APIs, support for new data types such as mixed precision or complex arithmetic, or new execution models such as GPU acceleration. These extensions are often added alongside the classic interface, not as replacements.
For long lived HPC codes, design your internal abstractions so that they are built on top of stable library APIs, rather than directly on low level hardware features that may change with every new CPU or GPU generation.
Performance Portability Through Library Choices
Performance portability refers to the ability of a single source code to perform reasonably well on different architectures without modification. Numerical libraries are central to achieving this in HPC.
By delegating performance critical operations to libraries, your code adopts whatever optimizations the library provides on each platform. For example, the same sequence of library calls might use AVX vector instructions on one CPU, SVE on another, or offload to a GPU when a GPU enabled variant of the library is available.
Choosing library abstractions that express the mathematical operation, rather than low level loops, increases the chance that such optimizations can be applied automatically. For instance, writing your code in terms of dense matrix multiply operations gives library authors the opportunity to apply sophisticated blocking and scheduling strategies.
At the software stack level, performance portability often involves using vendor neutral high level libraries that can select between multiple backends. For example, on one platform they may call a vendor BLAS, while on another they may use an open source implementation.
Practical performance portability also requires that you pay attention to data layout, memory access patterns, and communication patterns, because libraries can only optimize within the scope of the data and operations you expose to them.
Interplay Between Numerical Libraries and Software Environments
On a typical HPC cluster, numerical libraries are not used in isolation but as part of carefully curated software environments. For example, an environment might be defined by:
A particular compiler version.
A matching MPI stack.
A set of numerical libraries that depend on these compilers and MPI.
A selection of higher level frameworks and domain specific codes compiled against this stack.
These environments are often exposed as modules that bundle a consistent set of versions. Loading one module may automatically load dependencies such as compilers, MPI, and math libraries. This is important because a mismatch in versions can lead to subtle runtime errors, ABI incompatibilities, or degraded performance.
In practice, you may have to choose between different environments based on:
The architecture of the nodes you target.
The need for GPU support versus CPU only operation.
The requirement to match the environment in which third party codes or libraries were built.
When you develop your own applications, it is a good habit to document which environment and which numerical libraries were used for compilation and production runs. This information is key for reproducibility and for future performance investigations.
Whenever you change a major component of your environment, such as the compiler or MPI, you should assume that you need to rebuild your application and its dependent libraries, and you should re validate both numerical correctness and performance.
Summary
Numerical libraries and software stacks are the foundation upon which most HPC applications are built. They encapsulate decades of algorithmic research and architecture specific optimization, and they provide stable interfaces that support both long term code maintenance and performance portability.
For the practical HPC user, the main tasks are to understand which libraries are available on a given system, how they fit into the broader software stack, how to link and configure them correctly, and how to structure code so that it can benefit from ongoing improvements in these libraries without constant rewrites.