13.1.1 BLAS

Table of Contents

What BLAS Is (In Practice)

BLAS (Basic Linear Algebra Subprograms) is a standardized collection of low-level routines for common linear algebra operations, especially on vectors and matrices.

At the HPC user level, BLAS is:

A standard API: many libraries implement the same function names and behavior.
The performance-critical kernel underneath higher-level packages (LAPACK, ScaLAPACK, many scientific codes).
Something you almost never write yourself, but often link against or call from C, C++, Fortran, Python, etc.

BLAS is all about well-optimized building blocks (e.g., dot products, matrix–vector multiplies, matrix–matrix multiplies) that can be highly tuned for each architecture.

BLAS Levels: Level-1, Level-2, Level-3

The BLAS interface is traditionally divided into three “levels,” mostly by the shape and computational intensity of operations.

Level-1 BLAS: Vector–Vector

Level-1 routines operate on vectors and do $O(n)$ work.

Typical operations:

Vector scaling: scal
$x \leftarrow \alpha x$
Vector addition: axpy
$y \leftarrow \alpha x + y$
Dot product: dot or dotc
$\alpha \leftarrow x^T y$ (or conjugate-transpose for complex)
Norms: nrm2 (2-norm), asum (1-norm)
Index of max-absolute-value element: amax

Examples of routine names:

daxpy – double-precision axpy
sscal – single-precision scal
cdotc – complex single-precision dot product (conjugated)
idamax – index of max-abs element in a double vector

Key properties:

Memory-bound: performance limited by memory bandwidth more than flops.
Useful as building blocks and for simple vector operations, but less scalable in terms of floating-point throughput.

Level-2 BLAS: Matrix–Vector

Level-2 routines operate on matrix–vector combinations and do $O(n^2)$ work.

Typical operations:

General matrix–vector multiply: gemv
$y \leftarrow \alpha A x + \beta y$
Symmetric/Hermitian matrix–vector multiply: symv, hemv
Solving triangular systems with a vector right-hand side: trsv
$x \leftarrow A^{-1}x$ for triangular $A$
Rank-1 updates: ger
$A \leftarrow A + \alpha x y^T$

Examples of routine names:

sgemv, dgemv – single/double general matrix–vector multiply
dtrsv – double-precision triangular solve with vector
zhemv – complex double Hermitian matrix–vector multiply

Key properties:

Higher arithmetic intensity than Level-1, but still usually memory-bandwidth sensitive.
BLAS implementations may already apply some cache-blocking and other optimizations, but benefit less dramatically from cache reuse than Level-3.

Level-3 BLAS: Matrix–Matrix

Level-3 routines work with matrices and do $O(n^3)$ work; these are the main performance workhorses.

Typical operations:

General matrix–matrix multiply: gemm
$C \leftarrow \alpha A B + \beta C$
Symmetric/Hermitian matrix–matrix operations: symm, hemm
Triangular matrix–matrix multiply: trmm
Triangular solves with multiple right-hand sides: trsm
$B \leftarrow A^{-1} B$ for triangular $A$
Rank-k updates: syrk, herk
$C \leftarrow \alpha A A^T + \beta C$ (or Hermitian transpose)

Examples of routine names:

dgemm – double-precision general matrix–matrix multiply
ztrsm – complex double triangular solve with matrix RHS
ssyrk – single-precision symmetric rank-k update

Key properties:

Compute-bound: much higher arithmetic intensity, easier to approach peak flops.
Critical for HPC: many high-level algorithms are rearranged to use gemm and friends due to their performance potential.
Heavily optimized with tiling, blocking, vectorization, and often multi-threading and GPUs.

Naming Conventions and Data Types

BLAS routine names encode both:

Type/precision: 1st letter
Operation: remaining letters

Common type prefixes:

s – single-precision real (float)
d – double-precision real (double)
c – single-precision complex
z – double-precision complex
i – integer (for index-returning routines like idamax)

Common suffixes:

gemm – general matrix–matrix multiply
gemv – general matrix–vector multiply
axpy – $y \leftarrow \alpha x + y$
scal – scale vector
dot / dotc / dotu – dot product variants
trsv, trsm, trmm – triangular solve/multiply
syrk / herk – symmetric/Hermitian rank-k update
ger – general rank-1 update

So for example:

dgemm = double-precision general matrix–matrix multiply
caxpy = complex single-precision $y \leftarrow \alpha x + y$

Storage Conventions: Column-Major and Leading Dimension

BLAS was historically defined for Fortran, which uses column-major order. Most BLAS implementations preserve this convention even for C interfaces.

Key concepts:

Matrices are assumed to be stored column by column.
A matrix $A$ with size $m \times n$ is often described by:

A – pointer to the first element
LDA – leading dimension of the array (stride between columns in memory)

LDA is typically at least max(1, m), and often equals m, but can be larger to handle padding or submatrices.

In C/C++ code, you can:

Arrange your data in column-major to match BLAS directly, or
Use transposed arguments or wrapper layers that hide the layout differences.

Most BLAS routines have arguments like:

transA – 'N' (no transpose), 'T' (transpose), or 'C' (conjugate transpose)
m, n, k – dimensions
lda, ldb, ldc – leading dimensions
alpha, beta – scalar coefficients

BLAS Implementations in HPC

BLAS is a specification, not a single library. Many optimized implementations exist:

Common vendor and open-source libraries:

OpenBLAS – open-source, CPU-optimized (x86, ARM, etc.).
BLIS – modular framework for BLAS-like operations.
ATLAS (older, largely superseded).
Intel oneMKL – tuned for Intel CPUs/GPUs.
AMD AOCL – tuned for AMD CPUs.
NVIDIA cuBLAS – BLAS interface for NVIDIA GPUs.
Cray LibSci, IBM ESSL, etc. – vendor-tuned libraries for specific HPC systems.

On many HPC clusters:

BLAS is provided through modules (e.g., module load intel-oneapi-mkl or module load openblas).
Higher-level libraries like LAPACK/ScaLAPACK are compiled against one of these BLAS implementations.
Switching modules often changes the underlying BLAS implementation used by your application (sometimes without recompiling).

Key points:

The API (function names/signatures) is mostly identical; performance and threading behavior differ.
For maximum performance, use the vendor-tuned BLAS of the system you run on, not a generic reference implementation.

Linking and Using BLAS in HPC Applications

In HPC, even if you don’t call BLAS directly, you often link to it via other packages. For basic usage, you should understand:

Linking from C/C++ or Fortran

Typical link flags (cluster-specific):

-lblas – generic BLAS (may be reference or optimized)
-lopenblas – OpenBLAS
-lmkl_rt or MKL link advisor flags – Intel oneMKL
Vendor-provided wrapper scripts or module variables, such as:

$(MKL_LIBS), $(BLAS_LIBS) set by environment modules
pkg-config or CMake FindBLAS to discover correct link lines

Example (generic, not cluster-specific):

gcc mycode.c -lblas -llapack -o myprog

On real HPC systems, check the documentation or module help; link lines can be more complex for MKL or other vendor libraries.

Threading Considerations

Most modern BLAS implementations are multi-threaded:

They internally use OpenMP or similar to parallelize large operations (especially Level-3).
You control the number of threads with environment variables, e.g.:

OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, etc.

In HPC:

Be careful when combining multi-threaded BLAS with MPI and/OpenMP in your own code. You may oversubscribe cores if each MPI process spawns many BLAS threads.
Common practice is to:

Set BLAS to single-threaded in hybrid MPI+OpenMP codes, and use OpenMP yourself, or
Use MPI-only and let BLAS handle threading (often per-node).

This choice is application-dependent and often determined by profiling.

BLAS in the Linear Algebra Stack

Within the numerical libraries ecosystem, BLAS is the foundation:

LAPACK is largely built on BLAS, especially Level-3 operations like gemm, trsm, *syrk, etc.
Parallel libraries like ScaLAPACK or distributed solver frameworks use local BLAS calls for node-level performance.
Many domain-specific libraries (e.g., for machine learning, quantum chemistry, CFD) are organized to express computations in terms of BLAS operations wherever possible, in particular gemm.

Why this matters:

If your algorithm can be rephrased to use Level-3 BLAS (especially gemm), it will typically run much faster than an explicit triple loop in plain C/Fortran.
Reusing BLAS gives you performance portability: tuned code for new CPUs/GPUs becomes available simply by linking against an updated library.

When and How to Use BLAS Directly

As an HPC beginner you don’t need to memorize all routines, but you should recognize common use-cases:

You implement a linear algebra routine yourself (e.g., custom regression, basic iterative solver) and have loops like:

Vector updates y = a*x + y → use axpy.
Dot products → use dot.
Dense matrix–matrix multiplies → use gemm.

You migrate from a naïve implementation to a BLAS-based one to gain performance.
You are debugging or profiling and see most time spent in routines like dgemm, ztrsm, etc. — that’s BLAS doing its work.

Typical practical steps:

Identify linear algebra kernels in your code.
Replace hand-written loops with appropriate BLAS calls.
Link against the optimized BLAS library available on your cluster.
Profile to ensure that BLAS calls dominate runtime (a good sign in many dense linear algebra algorithms).

BLAS in High-Level Languages (User Perspective)

Many high-level environments already use BLAS internally:

Python/NumPy/SciPy:

numpy.dot, @ operator, and many linear algebra operations often map to BLAS/LAPACK.
Performance depends on which BLAS NumPy was compiled against (MKL, OpenBLAS, etc.).

R, MATLAB, Julia:

Core linear algebra routines typically call optimized BLAS under the hood.

In HPC clusters, modules or container images often provide versions of these tools already linked to optimized BLAS.

Even if you never call dgemm yourself, understanding BLAS helps you:

Interpret performance results (e.g., seeing dgemm in a profiler).
Choose the right modules or containers.
Reason about threading and performance tuning for linear algebra workloads.

Comments

Please login to add a comment.

Don't have an account? Register now!

13.1.1 BLAS

What BLAS Is (In Practice)

BLAS Levels: Level-1, Level-2, Level-3

Level-1 BLAS: Vector–Vector

Level-2 BLAS: Matrix–Vector

Level-3 BLAS: Matrix–Matrix

Naming Conventions and Data Types

Storage Conventions: Column-Major and Leading Dimension

BLAS Implementations in HPC

Linking and Using BLAS in HPC Applications

Linking from C/C++ or Fortran

Threading Considerations

BLAS in the Linear Algebra Stack

When and How to Use BLAS Directly

BLAS in High-Level Languages (User Perspective)

Comments

Where to Move