13.1.1 BLAS

Table of Contents

Historical context and role of BLAS

The Basic Linear Algebra Subprograms, usually called BLAS, are a standardized collection of low level routines for common linear algebra operations on vectors and matrices. BLAS is not a single implementation. It is a specification that many high performance libraries implement in a compatible way. In HPC, BLAS is the fundamental building block on which higher level libraries such as LAPACK and ScaLAPACK are constructed.

BLAS was introduced in stages. The original Level 1 BLAS defined vector operations. Later, Level 2 and Level 3 BLAS added matrix vector and matrix matrix operations and introduced more performance oriented routines. For HPC users and developers, understanding BLAS is valuable even if you never write your own BLAS. The interface is stable, the performance properties are well understood, and almost every serious numerical code makes use of BLAS either directly or indirectly.

In practice, when you link to BLAS in an HPC setting, you usually do not use a generic reference implementation. You usually use a highly optimized vendor or open source implementation that is tuned for the specific architecture of the cluster.

BLAS is a standard API, not a single library. Different implementations (OpenBLAS, Intel MKL, BLIS, cuBLAS, vendor BLAS) provide the same functions and naming conventions but with different performance characteristics.

Levels of BLAS and typical operations

The BLAS specification is organized into three levels, each corresponding to operations of increasing computational complexity and typical performance.

Level 1 BLAS: vector operations

Level 1 BLAS routines operate on vectors and perform $O(n)$ work. These include operations such as scaling, copying, and dot products. Although simple, they form the basis for many iterative methods and low level algorithmic steps.

Common examples include:

axpy: Compute a scaled vector addition. In mathematical form, given a scalar $a$ and vectors $x$ and $y$, the routine performs
$$ y \leftarrow a x + y. $$

dot: Compute an inner product. For vectors $x$ and $y$,
$$ \alpha \leftarrow x^T y. $$

scal: Scale a vector by a scalar,
$$ x \leftarrow a x. $$

copy and swap: Copy one vector to another or swap the contents of two vectors.

These operations are memory bound on most modern processors. That means that their performance is often limited by the rate at which data can be moved through the memory system, not by the floating point units. They still benefit from vectorization and careful implementation, but their potential speedup on a single core is limited by memory bandwidth.

Level 2 BLAS: matrix vector operations

Level 2 BLAS routines work with a matrix and vectors and perform $O(n^2)$ work on $O(n^2)$ data. A typical operation is matrix vector multiplication. These routines are still often bandwidth limited, because each data element is usually used only a small number of times.

A canonical Level 2 routine is gemv, the general matrix vector multiply. Given a matrix $A$ and vectors $x$ and $y$, the core operation is
$$ y \leftarrow \alpha A x + \beta y. $$

There are also specialized Level 2 routines for symmetric, Hermitian, triangular, and banded matrices. These exploit structure to reduce storage and computation.

Level 3 BLAS: matrix matrix operations

Level 3 BLAS routines operate on matrices and perform $O(n^3)$ work on $O(n^2)$ data. They are compute bound on well designed architectures, which means their performance can approach a large fraction of the theoretical peak floating point throughput if implemented carefully.

The most important Level 3 routine is gemm, the general matrix matrix multiplication. Given matrices $A$, $B$, and $C$, it computes
$$ C \leftarrow \alpha A B + \beta C. $$

Variants of Level 3 routines exist for symmetric, Hermitian, and triangular matrices and for solving linear systems with triangular matrices. Linear algebra libraries and high level algorithms are usually expressed in terms of Level 3 BLAS whenever possible, because that level provides the best opportunity to optimize cache reuse, exploit SIMD instructions, and efficiently parallelize computations.

For performance critical dense linear algebra in HPC, always prefer Level 3 BLAS (such as gemm) formulations when possible, because they allow much higher arithmetic intensity and better use of modern CPUs and GPUs.

BLAS routine naming conventions

BLAS routines follow an old but very stable naming pattern. Understanding this pattern is necessary to read documentation and sample codes. Although exact names can vary slightly between languages and bindings, the classic Fortran style names are still widely seen.

A typical BLAS routine name has the form:

<type><level><operation>

The first letter encodes the data type. Common codes are:

S for single precision real.

D for double precision real.

C for single precision complex.

Z for double precision complex.

The next part is a short code for the operation and BLAS level. Examples:

AXPY indicates the Level 1 scaled vector addition.

GEMV indicates the Level 2 general matrix vector operation.

GEMM indicates the Level 3 general matrix matrix operation.

Putting this together, dgemm is the double precision general matrix matrix multiply, and saxpy is the single precision $y \leftarrow a x + y$ operation. For historical reasons, BLAS names are case insensitive in Fortran, but in C and many modern languages the lowercase versions are usually exposed as function names.

There are also routine names with additional letters that indicate matrix structure. For example:

DSYMV operates on a double precision symmetric matrix.

ZTRSM solves a triangular system with multiple right hand sides for complex double precision data.

The full naming scheme is covered by BLAS documentation and is consistent across implementations.

BLAS implementations in HPC environments

Since BLAS is a specification, multiple implementations exist, each with different optimization strategies. On an HPC system you normally do not compile your own BLAS. Instead, you load a module that exposes an optimized BLAS library for the specific hardware.

Common BLAS implementations include:

OpenBLAS, an open source package based on the older GotoBLAS, with architecture specific kernels for many CPUs.

BLIS, a modern framework that provides a BLAS compatible interface and focuses on clean design and high performance.

Netlib BLAS, the original reference implementation, meant primarily for correctness and portability, not performance.

Vendor optimized BLAS, such as Intel MKL on Intel CPUs and AMD optimized libraries on AMD processors. These are typically the fastest choices on the corresponding hardware.

cuBLAS, which provides a BLAS like interface on NVIDIA GPUs and is accessed through CUDA.

In a typical cluster environment, you select one by loading a relevant module. For example, you might run a command like module load mkl or module load openblas. The system documentation will specify which modules correspond to BLAS and how they interact with other numerical libraries.

On HPC clusters, always use the vendor tuned or site recommended BLAS implementation instead of the reference Netlib BLAS. The performance difference can easily be one or two orders of magnitude on large problems.

Linking to BLAS in applications

From an application developer perspective, using BLAS involves two main steps. First, you include the appropriate interface in your code. Second, you link your program against a BLAS library at compile time or through a build system such as Make or CMake.

The details depend on the language. In Fortran, BLAS is naturally integrated with the language calling conventions. One can declare external procedures and call them directly. In C or C++, BLAS routines are usually accessed through header files that declare the function prototypes. An example for dgemm in C style pseudo code is:

/* C-style prototype for DGEMM from CBLAS */
void cblas_dgemm(const enum CBLAS_ORDER Order,
                 const enum CBLAS_TRANSPOSE TransA,
                 const enum CBLAS_TRANSPOSE TransB,
                 const int M, const int N, const int K,
                 const double alpha,
                 const double *A, const int lda,
                 const double *B, const int ldb,
                 const double beta,
                 double *C, const int ldc);

Most BLAS libraries provide a CBLAS interface that is more natural for C programmers. In that interface you specify row major or column major layout and transposition flags through enums.

When compiling, you typically add a flag such as -lblas to link with a system BLAS library, or you link to a more specific library such as -lopenblas or -lmkl_intel_lp64 -lmkl_core -lmkl_sequential depending on the environment. On many HPC systems, the module system configures these link flags for you, and you only need to use the compiler wrappers that the site provides.

Data layout and calling conventions

BLAS originated in Fortran, so the standard layout is column major. That means the elements of each column of a matrix are contiguous in memory. In mathematical terms, if $A$ is an $m \times n$ matrix, then the element $A_{ij}$ is stored at an offset proportional to $i$ plus a stride of the leading dimension times $j$, with $i$ varying faster than $j$.

The BLAS interface uses the concept of leading dimension, written as lda, ldb, and ldc for matrices A, B, and C. For an $m \times n$ matrix stored in a full column major array with no padding, the leading dimension is at least m. It can be larger if the matrix is a submatrix or if padding is used.

In C code, which typically uses row major layout, one must be careful. You can either treat the row major data as the transpose of a column major matrix or use the CBLAS variants that accept an order flag. Many errors and performance issues arise from incorrect leading dimensions and mismatches between transposition flags and actual data layout.

There are similar considerations for vectors. BLAS routines allow a stride parameter, often called incx or incy, which specifies the memory distance between consecutive elements of the vector. Using a stride of 1 ensures contiguous storage and best performance. Non unit strides are supported but can reduce throughput.

For efficient BLAS use, store matrices contiguously and choose unit strides whenever possible. Misaligned data, non unit strides, and incorrect leading dimensions can significantly reduce performance or even cause incorrect results.

Precision, complex numbers, and numerical aspects

BLAS supports both real and complex data types at single and double precision. The choice of type affects not only accuracy but also performance and memory usage.

Double precision (D and Z routines) is typically the default for many scientific applications on CPUs, because it offers about 15 to 16 decimal digits of precision. Single precision uses half the memory and can be faster, especially on GPUs, but with reduced accuracy. Mixed precision strategies, where some steps are performed in single precision and others in double precision, are increasingly common, but those strategies are usually built at a higher level than BLAS itself.

Complex routines such as ZGEMM operate on complex double precision matrices. They are particularly important in areas like quantum mechanics and signal processing. Internally, complex operations require more floating point instructions, so they run slower than their real counterparts, but they are still highly optimized.

From a numerical perspective, BLAS routines are designed to be stable building blocks, but they do not guarantee perfect conditioning of the problem itself. Using BLAS correctly means understanding that they compute the algebraic operations that you request within the limits of floating point arithmetic. Issues such as ill conditioned matrices or accumulation of rounding errors must be handled through algorithm design at a higher level.

Performance characteristics and tuning considerations

One of the main reasons BLAS is central to HPC is its predictable performance behavior. For each routine, the number of floating point operations and the data movement requirements are well known. This allows performance modeling and tuning.

For example, a dgemm operation on two $n \times n$ matrices performs about $2 n^3$ floating point operations. The data volume is on the order of $3 n^2$ elements. This leads to an arithmetic intensity of order $O(n)$. For sufficiently large $n$, dgemm becomes compute bound and can use the floating point units very effectively. In contrast, a daxpy on $n$ elements performs about $2 n$ floating point operations on $3 n$ elements, which is an arithmetic intensity less than 1 and therefore constrained by memory bandwidth.

Optimized BLAS implementations exploit blocking to fit submatrices in cache, use SIMD vector instructions through intrinsics or compiler auto vectorization, and parallelize across multiple cores through threads. When you call a high quality dgemm, you benefit from all this low level tuning automatically.

At the application level, you can influence BLAS performance by choosing problem sizes that fit well in memory hierarchies, avoiding extremely small matrices in performance critical loops, and configuring the number of threads used by the BLAS library. For threaded BLAS, environment variables such as OMP_NUM_THREADS, MKL_NUM_THREADS, or library specific equivalents control thread counts. On MPI based codes, it is important to coordinate these settings so that you do not oversubscribe cores.

In hybrid MPI plus threaded BLAS codes, ensure the total number of threads per node matches the available cores. Otherwise, oversubscription can severely degrade BLAS performance.

BLAS on GPUs and accelerators

On GPUs, the same BLAS interface idea appears through libraries such as cuBLAS for NVIDIA GPUs and similar libraries for other accelerators. These provide GPU optimized implementations of BLAS routines, especially Level 3 operations. The calling sequence includes extra arguments to specify device pointers and CUDA streams or similar constructs.

Although the execution model on a GPU differs significantly from CPUs, the conceptual role of BLAS remains the same. Higher level linear algebra libraries build upon GPU BLAS just as CPU bound libraries build upon CPU BLAS. For performance portability, many frameworks offer backends that choose an appropriate BLAS implementation depending on whether the computation executes on CPU or GPU.

Using GPU BLAS efficiently amounts to the same high level principles. Transfer data in sufficiently large chunks to the device, keep it there for multiple BLAS calls if possible, and favor Level 3 routines with high arithmetic intensity. Data transfers between host and device can dominate run time if not managed carefully.

Practical usage patterns in HPC codes

In real HPC applications, BLAS often appears in one of two ways. Either it is called directly to perform dense linear algebra kernels, or it is hidden inside larger libraries. Many popular packages in areas such as finite element methods, optimization, and machine learning internally invoke BLAS for dense subproblems.

Direct use typically follows patterns such as factorizing a matrix with higher level routines that rely on BLAS, then performing repeated TRSM or GEMM calls to apply the factorization. Even if you do not design your own algorithms, recognizing that these steps are implemented through BLAS is useful when you interpret performance traces or profile output.

At scale, codes that spend most of their time in BLAS kernels are sometimes called BLAS bound. They can benefit greatly from small changes in how matrices are assembled and batched, or from tuning the BLAS library version and configuration. Profiling tools in HPC environments often annotate BLAS calls explicitly, which helps identify hot spots.

Finally, because BLAS has a stable interface, it plays an important role in longevity and portability of HPC codes. A program written decades ago that relies on dgemm and daxpy can still be compiled today and linked against a modern optimized BLAS library. This separation between numerical algorithms and low level performance tuning is one of the reasons BLAS remains a cornerstone of scientific computing.

Comments

Please login to add a comment.

Don't have an account? Register now!