Table of Contents
What BLAS Is (In Practice)
BLAS (Basic Linear Algebra Subprograms) is a standardized collection of low-level routines for common linear algebra operations, especially on vectors and matrices.
At the HPC user level, BLAS is:
- A standard API: many libraries implement the same function names and behavior.
- The performance-critical kernel underneath higher-level packages (LAPACK, ScaLAPACK, many scientific codes).
- Something you almost never write yourself, but often link against or call from C, C++, Fortran, Python, etc.
BLAS is all about well-optimized building blocks (e.g., dot products, matrix–vector multiplies, matrix–matrix multiplies) that can be highly tuned for each architecture.
BLAS Levels: Level-1, Level-2, Level-3
The BLAS interface is traditionally divided into three “levels,” mostly by the shape and computational intensity of operations.
Level-1 BLAS: Vector–Vector
Level-1 routines operate on vectors and do $O(n)$ work.
Typical operations:
- Vector scaling:
scal
$x \leftarrow \alpha x$ - Vector addition:
axpy
$y \leftarrow \alpha x + y$ - Dot product:
dotordotc
$\alpha \leftarrow x^T y$ (or conjugate-transpose for complex) - Norms:
nrm2(2-norm),asum(1-norm) - Index of max-absolute-value element:
amax
Examples of routine names:
daxpy– double-precisionaxpysscal– single-precisionscalcdotc– complex single-precision dot product (conjugated)idamax– index of max-abs element in a double vector
Key properties:
- Memory-bound: performance limited by memory bandwidth more than flops.
- Useful as building blocks and for simple vector operations, but less scalable in terms of floating-point throughput.
Level-2 BLAS: Matrix–Vector
Level-2 routines operate on matrix–vector combinations and do $O(n^2)$ work.
Typical operations:
- General matrix–vector multiply:
gemv
$y \leftarrow \alpha A x + \beta y$ - Symmetric/Hermitian matrix–vector multiply:
symv,hemv - Solving triangular systems with a vector right-hand side:
trsv
$x \leftarrow A^{-1}x$ for triangular $A$ - Rank-1 updates:
ger
$A \leftarrow A + \alpha x y^T$
Examples of routine names:
sgemv,dgemv– single/double general matrix–vector multiplydtrsv– double-precision triangular solve with vectorzhemv– complex double Hermitian matrix–vector multiply
Key properties:
- Higher arithmetic intensity than Level-1, but still usually memory-bandwidth sensitive.
- BLAS implementations may already apply some cache-blocking and other optimizations, but benefit less dramatically from cache reuse than Level-3.
Level-3 BLAS: Matrix–Matrix
Level-3 routines work with matrices and do $O(n^3)$ work; these are the main performance workhorses.
Typical operations:
- General matrix–matrix multiply:
gemm
$C \leftarrow \alpha A B + \beta C$ - Symmetric/Hermitian matrix–matrix operations:
symm,hemm - Triangular matrix–matrix multiply:
trmm - Triangular solves with multiple right-hand sides:
trsm
$B \leftarrow A^{-1} B$ for triangular $A$ - Rank-k updates:
syrk,herk
$C \leftarrow \alpha A A^T + \beta C$ (or Hermitian transpose)
Examples of routine names:
dgemm– double-precision general matrix–matrix multiplyztrsm– complex double triangular solve with matrix RHSssyrk– single-precision symmetric rank-k update
Key properties:
- Compute-bound: much higher arithmetic intensity, easier to approach peak flops.
- Critical for HPC: many high-level algorithms are rearranged to use
gemmand friends due to their performance potential. - Heavily optimized with tiling, blocking, vectorization, and often multi-threading and GPUs.
Naming Conventions and Data Types
BLAS routine names encode both:
- Type/precision: 1st letter
- Operation: remaining letters
Common type prefixes:
s– single-precision real (float)d– double-precision real (double)c– single-precision complexz– double-precision complexi– integer (for index-returning routines likeidamax)
Common suffixes:
gemm– general matrix–matrix multiplygemv– general matrix–vector multiplyaxpy– $y \leftarrow \alpha x + y$scal– scale vectordot/dotc/dotu– dot product variantstrsv,trsm,trmm– triangular solve/multiplysyrk/herk– symmetric/Hermitian rank-k updateger– general rank-1 update
So for example:
dgemm= double-precision general matrix–matrix multiplycaxpy= complex single-precision $y \leftarrow \alpha x + y$
Storage Conventions: Column-Major and Leading Dimension
BLAS was historically defined for Fortran, which uses column-major order. Most BLAS implementations preserve this convention even for C interfaces.
Key concepts:
- Matrices are assumed to be stored column by column.
- A matrix $A$ with size $m \times n$ is often described by:
A– pointer to the first elementLDA– leading dimension of the array (stride between columns in memory)LDAis typically at leastmax(1, m), and often equalsm, but can be larger to handle padding or submatrices.
In C/C++ code, you can:
- Arrange your data in column-major to match BLAS directly, or
- Use transposed arguments or wrapper layers that hide the layout differences.
Most BLAS routines have arguments like:
transA–'N'(no transpose),'T'(transpose), or'C'(conjugate transpose)m,n,k– dimensionslda,ldb,ldc– leading dimensionsalpha,beta– scalar coefficients
BLAS Implementations in HPC
BLAS is a specification, not a single library. Many optimized implementations exist:
Common vendor and open-source libraries:
- OpenBLAS – open-source, CPU-optimized (x86, ARM, etc.).
- BLIS – modular framework for BLAS-like operations.
- ATLAS (older, largely superseded).
- Intel oneMKL – tuned for Intel CPUs/GPUs.
- AMD AOCL – tuned for AMD CPUs.
- NVIDIA cuBLAS – BLAS interface for NVIDIA GPUs.
- Cray LibSci, IBM ESSL, etc. – vendor-tuned libraries for specific HPC systems.
On many HPC clusters:
- BLAS is provided through modules (e.g.,
module load intel-oneapi-mklormodule load openblas). - Higher-level libraries like LAPACK/ScaLAPACK are compiled against one of these BLAS implementations.
- Switching modules often changes the underlying BLAS implementation used by your application (sometimes without recompiling).
Key points:
- The API (function names/signatures) is mostly identical; performance and threading behavior differ.
- For maximum performance, use the vendor-tuned BLAS of the system you run on, not a generic reference implementation.
Linking and Using BLAS in HPC Applications
In HPC, even if you don’t call BLAS directly, you often link to it via other packages. For basic usage, you should understand:
Linking from C/C++ or Fortran
Typical link flags (cluster-specific):
-lblas– generic BLAS (may be reference or optimized)-lopenblas– OpenBLAS-lmkl_rtor MKL link advisor flags – Intel oneMKL- Vendor-provided wrapper scripts or module variables, such as:
$(MKL_LIBS),$(BLAS_LIBS)set by environment modulespkg-configor CMakeFindBLASto discover correct link lines
Example (generic, not cluster-specific):
gcc mycode.c -lblas -llapack -o myprogOn real HPC systems, check the documentation or module help; link lines can be more complex for MKL or other vendor libraries.
Threading Considerations
Most modern BLAS implementations are multi-threaded:
- They internally use OpenMP or similar to parallelize large operations (especially Level-3).
- You control the number of threads with environment variables, e.g.:
OMP_NUM_THREADS,MKL_NUM_THREADS,OPENBLAS_NUM_THREADS, etc.
In HPC:
- Be careful when combining multi-threaded BLAS with MPI and/OpenMP in your own code. You may oversubscribe cores if each MPI process spawns many BLAS threads.
- Common practice is to:
- Set BLAS to single-threaded in hybrid MPI+OpenMP codes, and use OpenMP yourself, or
- Use MPI-only and let BLAS handle threading (often per-node).
This choice is application-dependent and often determined by profiling.
BLAS in the Linear Algebra Stack
Within the numerical libraries ecosystem, BLAS is the foundation:
- LAPACK is largely built on BLAS, especially Level-3 operations like
gemm,trsm,*syrk, etc. - Parallel libraries like ScaLAPACK or distributed solver frameworks use local BLAS calls for node-level performance.
- Many domain-specific libraries (e.g., for machine learning, quantum chemistry, CFD) are organized to express computations in terms of BLAS operations wherever possible, in particular
gemm.
Why this matters:
- If your algorithm can be rephrased to use Level-3 BLAS (especially
gemm), it will typically run much faster than an explicit triple loop in plain C/Fortran. - Reusing BLAS gives you performance portability: tuned code for new CPUs/GPUs becomes available simply by linking against an updated library.
When and How to Use BLAS Directly
As an HPC beginner you don’t need to memorize all routines, but you should recognize common use-cases:
- You implement a linear algebra routine yourself (e.g., custom regression, basic iterative solver) and have loops like:
- Vector updates
y = a*x + y→ useaxpy. - Dot products → use
dot. - Dense matrix–matrix multiplies → use
gemm. - You migrate from a naïve implementation to a BLAS-based one to gain performance.
- You are debugging or profiling and see most time spent in routines like
dgemm,ztrsm, etc. — that’s BLAS doing its work.
Typical practical steps:
- Identify linear algebra kernels in your code.
- Replace hand-written loops with appropriate BLAS calls.
- Link against the optimized BLAS library available on your cluster.
- Profile to ensure that BLAS calls dominate runtime (a good sign in many dense linear algebra algorithms).
BLAS in High-Level Languages (User Perspective)
Many high-level environments already use BLAS internally:
- Python/NumPy/SciPy:
numpy.dot,@operator, and many linear algebra operations often map to BLAS/LAPACK.- Performance depends on which BLAS NumPy was compiled against (MKL, OpenBLAS, etc.).
- R, MATLAB, Julia:
- Core linear algebra routines typically call optimized BLAS under the hood.
- In HPC clusters, modules or container images often provide versions of these tools already linked to optimized BLAS.
Even if you never call dgemm yourself, understanding BLAS helps you:
- Interpret performance results (e.g., seeing
dgemmin a profiler). - Choose the right modules or containers.
- Reason about threading and performance tuning for linear algebra workloads.