13.1.2 LAPACK

Table of Contents

Historical context and design goals

LAPACK, the Linear Algebra PACKage, is a widely used numerical library for dense linear algebra on modern architectures. It was designed as the successor to LINPACK and EISPACK, which targeted much older vector and scalar machines. While LINPACK and EISPACK focused on routines that worked well with Level 1 and Level 2 BLAS operations, LAPACK was built to exploit Level 3 BLAS, which provides matrix matrix operations with much better data reuse.

The key design idea is that most of the floating point work in LAPACK routines is expressed in terms of BLAS calls. LAPACK itself focuses on algorithms and data layout decisions and delegates low level, performance critical kernels to optimized BLAS implementations. This separation makes LAPACK portable, since it is written in standard Fortran, and allows vendors and system builders to provide tuned BLAS libraries that automatically accelerate LAPACK based applications.

LAPACK is primarily targeted at dense matrices. Sparse linear algebra, iterative methods, and domain specific solvers are handled by other libraries and are treated elsewhere in the course.

Data types, naming conventions, and interfaces

LAPACK originally provided a Fortran 77 interface and is still largely organized according to Fortran conventions. This is important for HPC users because many higher level libraries, including distributed memory linear algebra and scientific frameworks, use LAPACK routines underneath. Even if you call it from C, C++, Python, or other languages, you will encounter the LAPACK naming scheme and calling conventions.

Each LAPACK routine name is a short code that encodes four pieces of information: the data type, the matrix type, the operation class, and sometimes a refinement or driver level. For example, the routine DGESV solves a general double precision linear system.

The first character encodes the arithmetic type:
S means single precision real.
D means double precision real.
C means single precision complex.
Z means double precision complex.

The next one or two characters describe the matrix class, for example:
GE means general dense matrix.
PO means symmetric or Hermitian positive definite matrix, stored in full.
SY or HE means symmetric or Hermitian matrix.
GB means general band matrix.
TR means triangular matrix.
There are many more such codes, but these are the ones most frequently encountered in basic applications.

The last two or three characters indicate the operation. For instance:
SV means solve a system of linear equations, using a driver routine that handles factorization and solve.
TRF means compute a triangular factorization, such as LU.
TRS means solve a system given a previously computed factorization.
EV or EVD relate to eigenvalue and eigenvector computations.
SVD is used for singular value decompositions.

So DGETRF computes an LU factorization of a general double precision dense matrix, and DGETRS solves a linear system using that factorization. This naming pattern is consistent across LAPACK, and learning to decode these names is an efficient way to navigate the library.

In Fortran, LAPACK routines follow a fixed pattern of arguments. Typically, matrix dimensions come first, followed by leading dimensions, then the actual arrays, and finally workspace arrays and an integer INFO output that reports success or failure. C and C++ interfaces usually preserve the order and types but adjust indexing conventions. High level languages such as Python hide much of this detail, but understanding the underlying signatures is valuable for HPC users who need fine control or want to diagnose performance issues.

LAPACK routine names encode type, matrix structure, and operation. For example, DGESV: D double precision, GE general matrix, SV system solve. Correctly matching the routine to the problem type and matrix structure is essential for both correctness and performance.

Basic usage patterns for linear systems

From an application point of view, LAPACK offers two broad categories of routines: drivers and computational routines.

Driver routines perform an entire high level task in a single call. For dense linear systems, xGESV is the standard driver, where x is S, D, C, or Z. If you have a general dense matrix $A$ and a right hand side vector or matrix $B$, you can call DGESV to compute $X$ such that $AX = B$. Internally DGESV calls routines like DGETRF and DGETRS, but you do not need to manage these steps explicitly.

Computational routines expose individual phases of the algorithm. xGETRF computes an LU factorization with partial pivoting of a general matrix. xGETRS solves one or more systems using that factorization. This split is important when you need to solve many systems with the same coefficient matrix but different right hand sides. In that case you compute the factorization once, then reuse it for each new right hand side. This can significantly reduce the total cost compared to calling the driver repeatedly.

The typical dense system workflow is as follows. First, store your matrix in column major order in a two dimensional array A, and define a leading dimension that reflects the number of rows in memory. Second, allocate arrays for the right hand side B and the pivot indices IPIV. Third, call xGESV or the factorization and solve routines. Finally, check the INFO parameter. If INFO is zero the operation succeeded. A positive INFO value indicates a singularity or numerical problem. A negative value indicates that you supplied an invalid argument.

Although LAPACK supports many specialized matrix types, the usage pattern remains similar. For example, if the matrix is symmetric positive definite, xPOSV is the corresponding dense system driver, which internally uses a Cholesky factorization. Choosing xPOSV instead of xGESV both reduces computation and improves numerical stability, but requires that your problem satisfies the positive definiteness assumptions.

Whenever possible use a driver routine that matches the matrix structure, such as xPOSV for positive definite systems, instead of a general purpose routine like xGESV. This can substantially improve performance and numerical robustness.

Factorization routines and reuse of decompositions

A major strength of LAPACK is the range of factorization routines that match specific algebraic properties. The most common are LU, Cholesky, and QR factorization routines.

LU factorization is provided for general matrices via xGETRF. It writes the factors into the original matrix storage and produces a pivot vector that records row interchanges. This factorization enables solution of nonsingular linear systems and computation of determinants. Once xGETRF has been called, xGETRS solves linear systems with the same coefficient matrix and different right hand sides without repeating the expensive factorization step.

Cholesky factorization uses routines like xPOTRF for symmetric or Hermitian positive definite matrices. The output factor is typically a triangular matrix $L$ so that $A = LL^T$ or $A = LL^H$, depending on whether the matrix is real or complex. The associated solve routine, xPOTRS, uses this factorization. Cholesky is roughly twice as fast as LU for the same matrix size and is generally more stable, which is why recognizing positive definiteness is so valuable.

QR factorization routines are provided via xGEQRF for general matrices, with xORGQR or xUNGQR used to form explicit orthogonal or unitary matrices if needed. In many HPC applications QR factorizations support least squares problems and rank revealing decompositions, which are particularly important when the matrix is rectangular or potentially rank deficient.

Many higher level tasks, such as computing condition numbers, error estimates, or inverse matrices, are built on top of these factorizations. LAPACK therefore provides routines like xTRCON and xTRTRI that operate on the factors. A common pattern in HPC code is to factor a matrix once, reuse the factorization for solves, and occasionally call auxiliary routines to estimate errors or condition numbers without repeating the main factorization.

Factor matrices returned by LAPACK overwrite the original input matrix storage. Preserve a copy of the original matrix if you need it later, or plan to reconstruct it from the factors. Do not assume that the input matrix remains unchanged after a factorization call.

Eigenvalue and singular value computations

Beyond linear systems, LAPACK offers comprehensive support for eigenvalue problems and singular value decomposition. These operations are central to many HPC applications, including stability analysis, data reduction, and spectral methods.

For eigenvalues of dense matrices there are several drivers, depending on matrix type and desired output. xGEEV computes eigenvalues and optionally left and right eigenvectors of a general, possibly nonsymmetric matrix. For symmetric or Hermitian matrices, xSYEV and related routines exploit symmetry to improve both performance and numerical accuracy. There are also routines that return only a subset of the spectrum or that use divide and conquer algorithms for better parallel performance.

The singular value decomposition is computed by routines such as xGESVD and xGESDD. The SVD factors a matrix $A$ into $U \Sigma V^T$ or $U \Sigma V^H$ and is used extensively in numerical linear algebra and data analysis. In LAPACK, you can choose whether you want all singular vectors, a reduced set, or only the singular values. This control is important when dealing with large matrices where storage and computation for all singular vectors may be excessive.

These eigen and singular value drivers typically follow a similar argument pattern to the linear system drivers: input dimensions, leading dimensions, matrix data, output arrays for eigenvalues or singular values, and workspace arrays. Many driver routines can automatically query the workspace size by being called with a minimal workspace and a special flag. This mechanism allows you to determine the optimal workspace size that gives good performance on a given installation.

In practice, high performance eigenvalue and SVD computations often rely heavily on Level 3 BLAS usage within LAPACK. The benefits of using vendor optimized BLAS are even more pronounced here, because these algorithms tend to be more compute intensive and place strong demands on memory bandwidth and cache behavior.

LAPACK, BLAS, and vendor implementations

LAPACK is typically used together with a highly optimized BLAS library. BLAS routines implement the core vector and matrix kernels that dominate the flop count of most LAPACK algorithms. In an HPC environment you rarely use a “reference” BLAS implementation, because it is much slower and not tuned for the specific CPUs and memory systems of a given cluster.

Instead, vendors and open source communities provide tuned BLAS libraries such as OpenBLAS, BLIS, vendor specific math libraries, or BLAS from compiler suites. These libraries often include a copy or fork of LAPACK, so what you link against may be a combined BLAS / LAPACK package. For example, on many platforms you link a single library that provides both components, and the performance of LAPACK then depends directly on the underlying BLAS kernels.

The key point for HPC users is that LAPACK itself is relatively portable and stable. The major performance differences across systems come from the BLAS layer. Consequently, when you move code between clusters, you pay close attention to which linear algebra stack is available and which environment modules or link options are required to access the optimized implementation.

Because LAPACK uses column major storage and Fortran style array layouts, C and C++ codes must either adopt compatible layouts or use adapters when calling the routines. Many modern C and C++ linear algebra libraries provide thin wrappers around LAPACK that take care of argument order, leading dimensions, and type conversions. Similarly, popular high level languages in HPC, such as Python via NumPy and SciPy, call into LAPACK through an intermediary C or C++ layer.

LAPACK performance depends strongly on the BLAS implementation. Always link against an optimized BLAS / LAPACK stack provided by your HPC system, typically through environment modules, rather than using a slow reference implementation.

Using LAPACK in typical HPC software stacks

In an HPC environment you rarely call LAPACK directly in every part of an application. Instead, many higher level libraries and frameworks incorporate LAPACK as their dense linear algebra backend. For example, distributed memory libraries for matrices often rely on ScaLAPACK or similar packages, which in turn use LAPACK like algorithms locally on each process. Scientific codes that solve partial differential equations, perform optimization, or run statistical models may also use LAPACK internally for subproblems.

From a software stack perspective, LAPACK usually sits just above BLAS and just below more domain specific packages. When you choose a math library stack through environment modules, you are implicitly selecting a combination of BLAS, LAPACK, and sometimes FFT and special function libraries. Documentation for these stacks typically lists which LAPACK version is included and how to link against it from C, C++, or Fortran codes.

HPC clusters often provide multiple math library stacks for compatibility and performance reasons. For example, there may be a fully open source stack based on OpenBLAS and LAPACK, and another stack delivered by a compiler vendor with their own tuned implementations. For a given application, testing performance with several stacks can be worthwhile, especially for workloads dominated by dense linear algebra.

In high level languages, such as Python with NumPy, you may not need to manage LAPACK linkage manually, but it still matters which underlying LAPACK is used. Some Python distributions ship with their own math libraries, while others are built against system libraries. When performance matters or when running at scale on HPC systems, you often configure or choose Python environments explicitly so that they use the same optimized BLAS / LAPACK stack as compiled codes on the cluster.

Numerical stability, error reporting, and robustness

LAPACK routines are designed with numerical stability and error reporting in mind. Most drivers and computational routines use well studied algorithms that, under reasonable conditions, deliver results that are close to the exact result in a mathematically sensible way. However, LAPACK does not hide numerical issues. Instead, it exposes them via the INFO parameter and auxiliary routines that estimate condition numbers and error bounds.

The INFO parameter conveys several important situations. If INFO is zero the routine completed successfully. If it is negative then the caller supplied an invalid argument, such as a negative dimension or an invalid flag. If it is positive, the routine encountered a numerical or structural problem, such as a singular matrix during factorization or a failure to converge in an eigenvalue iteration. For robust HPC applications, always check INFO and take appropriate action, which may include retrying the computation with different settings, reporting a failure upstream, or switching to a more robust algorithm.

Many LAPACK routines also provide optional refinement steps and error estimates. For example, drivers that solve linear systems often support iterative refinement internally, which can improve the accuracy of the solution using residual computations. Some routines return condition number estimates that indicate how sensitive the problem is to perturbations in the data. In an HPC simulation code, this information can be useful to detect ill conditioned problems before they lead to misleading results.

When running at scale, it is also important to consider the reproducibility of LAPACK based computations. LAPACK itself is deterministic for a fixed input and hardware, but its dependence on BLAS means that low level parallelism in BLAS kernels may introduce small variations in floating point rounding, especially when multi threaded BLAS is used. For applications where bitwise reproducibility is important, you may need to control BLAS threading or use specific library options that favor reproducibility over peak speed.

Always check the INFO parameter from LAPACK calls. A nonzero value indicates either invalid input or a numerical problem such as singularity or nonconvergence. Ignoring INFO can silently propagate invalid or inaccurate results through an HPC simulation.

Practical considerations for HPC users

From the point of view of someone learning HPC, the key lessons about LAPACK are practical. First, LAPACK is the standard dense linear algebra library in many scientific codes, and understanding its role helps you interpret performance profiles and debugging messages. Second, even if you mostly work in higher level languages, you benefit from knowing which LAPACK routines are called for basic operations such as solving linear systems or computing eigenvalues.

At the build and deployment level, you will interact with LAPACK primarily through the choice of environment modules and compiler flags. When preparing job scripts and build configurations for HPC clusters, ensure that your application links to the intended BLAS / LAPACK stack. This includes setting appropriate module loads, using the linker wrappers provided by vendor libraries, and checking that multi threading in BLAS does not conflict with other levels of parallelism in your code.

Finally, LAPACK is a mature and stable library, and its interfaces are widely documented. When performance bottlenecks in your applications trace back to dense linear algebra, optimizing around LAPACK often means changing problem formulations or data layouts, rather than modifying LAPACK itself. For example, rearranging computations to reuse factorizations, choosing matrix structures that allow more specialized routines, or changing algorithms so that more time is spent in Level 3 BLAS calls are common strategies for better performance with LAPACK based codes in HPC environments.

Comments

Please login to add a comment.

Don't have an account? Register now!