13.1.3 ScaLAPACK

Table of Contents

Overview and Purpose of ScaLAPACK

ScaLAPACK is a numerical library for performing dense linear algebra on distributed memory parallel computers. It extends the ideas of LAPACK, which works on a single shared-memory machine, to a world where your matrix is too large to fit in the memory of one node and computation must be distributed across many MPI processes.

Where LAPACK assumes a single process operating on a matrix stored in standard column-major order, ScaLAPACK assumes multiple MPI processes operating collectively on a matrix that is split into blocks and scattered across a grid of processes. ScaLAPACK is therefore a key building block when you need to solve very large dense problems on an HPC cluster.

The design of ScaLAPACK is based on three pillars: MPI for communication, a process grid abstraction for organizing work, and a block-cyclic data distribution to balance load and improve locality. Understanding these ideas is essential for using the library effectively.

ScaLAPACK targets dense matrices on distributed memory systems and uses MPI + block-cyclic distribution as its core model.

Relationship to BLAS, LAPACK, and MPI

ScaLAPACK does not replace BLAS or LAPACK. Instead, it builds on them. Each ScaLAPACK routine calls local LAPACK-like routines, which in turn call optimized BLAS on each process. The global matrix is distributed, but each process holds a local submatrix which can still be handled as usual with dense BLAS.

In practice, a scalable ScaLAPACK build requires:

An MPI implementation, which provides all interprocess communication.
Optimized BLAS and LAPACK for the local node, typically vendor-tuned or high performance open source.
The ScaLAPACK library itself, which offers distributed equivalents of key LAPACK routines.

From a programming perspective, you still write an MPI program, but instead of manually managing distributed matrix operations, you call ScaLAPACK routines that encapsulate the global algorithms and communication patterns.

Process Grids and Block-Cyclic Distribution

A central abstraction in ScaLAPACK is the two-dimensional process grid. Suppose your MPI program uses $P$ processes. ScaLAPACK arranges them logically into a grid of $P_r$ rows and $P_c$ columns, with $P_r P_c = P$. Each process is labeled by its coordinates $(p, q)$ in this grid, where $0 \le p < P_r$ and $0 \le q < P_c$.

Matrices are distributed across this process grid in a 2D block-cyclic fashion. To define the distribution, you choose block sizes $b_r$ and $b_c$, which are the row and column dimensions of each tile. Conceptually, the global matrix $A$ is partitioned into blocks of size $b_r \times b_c$, and these blocks are assigned to processes in a round robin pattern over the grid.

If $A$ is an $M \times N$ matrix, then block $A_{ij}$, which covers rows $i b_r$ to $(i+1) b_r - 1$ and columns $j b_c$ to $(j+1) b_c - 1$, is stored on process

$$
p = (i + p_0) \bmod P_r,
\quad
q = (j + q_0) \bmod P_c,
$$

where $(p_0, q_0)$ define the position of the first row and first column of the global matrix on the process grid. In most simple cases $p_0 = 0$ and $q_0 = 0$.

Block-cyclic distribution spreads both computation and memory fairly evenly while preserving some locality for contiguous blocks. Large blocks improve locality, since each process owns larger contiguous chunks, but can harm load balance for irregular operations. Smaller blocks improve load balance at the cost of more communication and overhead. Choosing $b_r$ and $b_c$ is therefore a performance tuning decision.

ScaLAPACK matrices are stored in a 2D block-cyclic distribution across a 2D process grid of MPI processes. Correct usage requires that the application and ScaLAPACK agree on this distribution.

The BLACS Layer and ScaLAPACK Descriptors

To manage process grids and communication, ScaLAPACK uses the BLACS, the Basic Linear Algebra Communication Subprograms. BLACS provide routines to create and manipulate process grids on top of MPI communicators and to perform collective and point-to-point operations that are aware of these grids.

Before you can call a ScaLAPACK routine, you must:

Initialize MPI.
Initialize BLACS and create a BLACS context that defines the process grid.
Define the global matrix size, block sizes, and process grid parameters.
Set up a ScaLAPACK array descriptor that encodes how the global matrix is mapped to local memory.
Allocate the local matrix array on each process, with the correct local dimensions.

The ScaLAPACK descriptor is a small integer array, often of length 9, that tells the library everything it needs to know about the global matrix and its distribution. For example, in the double precision real case, you typically use DESCINIT to fill this descriptor. The descriptor fields include global dimensions, block sizes, process grid information, and the leading dimension of the local array.

Conceptually, if $DESC_A$ describes matrix $A$, then any ScaLAPACK routine that works on $A$ will use $DESC_A$ to interpret the local array A_local correctly. Matching the descriptor to the actual layout of A_local is critical. If the descriptor does not match the local storage, the routine will produce incorrect results or crash.

Every distributed matrix in ScaLAPACK must have a correct descriptor that matches its global size, block sizes, process grid, and local leading dimension.

Calling ScaLAPACK Routines

From the user point of view, ScaLAPACK routines have interfaces that resemble their LAPACK counterparts, but with additional arguments for the distributed setting. A typical ScaLAPACK routine name starts with P or PD or PZ to indicate parallel double precision, parallel complex, and so on.

For example, PDGESV solves a distributed dense linear system $A x = b$ using LU factorization with partial pivoting, in double precision real arithmetic, on a distributed matrix $A$ and right-hand side $b$. Its core arguments include:

Global problem sizes (such as $N$, the matrix order).
The local arrays that hold the pieces of the global matrices on each process.
The ScaLAPACK descriptors for each matrix.
Workspace arrays which size may depend on the problem and the process grid.

Although the calling sequences are long, the conceptual pattern for many ScaLAPACK routines resembles LAPACK: you provide the global dimensions, the distributed matrices, and you call a routine that performs a standard dense operation such as factorization, solve, eigenvalue computation, or singular value decomposition.

Error reporting follows the LAPACK style, with an integer INFO argument that is negative if an input argument is invalid and positive if the factorization encounters a singularity or fails to converge. Since the computation is global, the same INFO value is usually returned on all processes, but it is common to check INFO only on a designated root rank in your MPI program.

Core Functionality and Problem Classes

ScaLAPACK covers many of the dense linear algebra problem classes that LAPACK supports, but in a distributed memory context. The main categories include:

Linear systems with dense coefficient matrices, including both direct factorization and iterative refinement using routines analogous to GESV and POSV.
Least squares problems and related factorizations, such as QR and RQ factorizations for overdetermined and underdetermined systems.
Symmetric and Hermitian eigenvalue problems, including routines that compute all eigenvalues or a subset and optionally eigenvectors.
Singular value decomposition for dense matrices.

The routine names often mirror the LAPACK names. For example:

PDGESV is the parallel double precision version of DGESV.
PDSYEV is the parallel double precision symmetric eigenvalue solver, related to DSYEV.
PDGEMM is a parallel general matrix multiply, analogous to DGEMM in BLAS, but working on distributed matrices.

When designing a distributed algorithm that relies on dense matrix operations, it is useful to first identify the LAPACK routines that would solve the problem on one node and then look up the corresponding ScaLAPACK routines for distributed execution.

Use ScaLAPACK for dense global problems where the data are distributed explicitly across many processes and LAPACK for single node or shared memory problems.

Practical Usage Pattern in an MPI Program

Using ScaLAPACK in an application typically follows a standard sequence of steps. The details of syntax and language bindings depend on whether you use Fortran or C, but the logic is similar.

First, you initialize MPI and determine the total number of processes P and each process rank. Then you decide on a process grid layout, such as $P_r \times P_c$, which is often chosen to be as square as possible for load balance and communication balance. You then use BLACS routines to create a BLACS context that defines this grid.

Next, you define the global matrix sizes $M$ and $N$, choose block sizes $b_r$ and $b_c$, and compute the local matrix sizes on each process. Each process then allocates its local arrays accordingly. You then call the ScaLAPACK descriptor initialization routine to describe the distribution of each matrix.

After setup, the application must fill the distributed matrices with data. This can happen in several ways. One approach is to have one process generate or read the full matrix and then distribute it according to the ScaLAPACK layout. Another approach is to have every process generate only its local part of the matrix, for example in a simulation where the matrix entries are defined by a formula that can be evaluated independently.

Once the matrices are ready, you call the appropriate ScaLAPACK routines for factorization, solve, eigenvalue computation, or SVD. After the computation, if you need global results on one process or on disk, you either gather the distributed matrix back to a single process, or you perform further distributed operations and then use parallel I O.

Finally, you free local arrays, shut down the BLACS grid, and finalize MPI. The correctness and performance of the entire process depend on respecting the ScaLAPACK distribution model.

Performance Considerations and Scaling

ScaLAPACK aims to achieve good scalability for dense linear algebra workloads on distributed memory systems. However, its performance depends heavily on several factors that the user controls.

The first factor is the process grid shape. For many dense operations, a nearly square process grid, where $P_r$ and $P_c$ are close, reduces communication volume and balances communication among processes. Highly rectangular grids, such as $1 \times P$ or $P \times 1$, often lead to poor scalability for 2D algorithms.

The second factor is the block size. Very small blocks increase the overhead due to metadata, communication, and synchronization. Very large blocks can lead to imbalanced work or poor overlap between computation and communication. In practice, a moderate block size, often tens or a few hundreds of rows and columns, is chosen based on the machine and the problem. Tuning block sizes is a simple way to explore performance improvements.

The third factor is the quality of the underlying BLAS. Since a large fraction of the runtime in dense linear algebra is spent in local BLAS operations, high performance BLAS libraries are essential. Poorly optimized BLAS will undermine the parallel performance no matter how well the data are distributed.

Network performance also matters. ScaLAPACK routines rely on collective communications among processes. On networks with high latency or low bandwidth, these collectives can become bottlenecks, especially when scaling to large process counts. High performance interconnects and MPI implementations with efficient collective algorithms are therefore beneficial.

Finally, algorithmic properties of the problems themselves affect scalability. Some operations, such as matrix multiply, have a high ratio of arithmetic work to communication and tend to scale well. Others, such as certain eigenvalue and SVD computations, have more complex communication patterns and are more sensitive to latency and synchronization.

To obtain good performance with ScaLAPACK, choose a balanced process grid, sensible block sizes, and high quality BLAS, and be aware that some algorithms scale better than others.

Integration into HPC Software Stacks

On many HPC systems, ScaLAPACK is provided as part of the standard software stack, coordinated with vendor MPI and BLAS libraries. It is often accessible through environment modules or as part of bundled math libraries.

On some platforms, vendor libraries provide ScaLAPACK compatible interfaces along with highly tuned BLAS and LAPACK, sometimes using the same routine names or providing wrappers. When using such stacks, your build process must link against the correct library sequence, typically MPI, BLACS, ScaLAPACK, LAPACK, and BLAS, while respecting the particular conventions of the system.

Numerous higher level scientific codes and libraries build on ScaLAPACK or its ideas. In some environments, alternative libraries such as Elemental or distributed dense linear algebra layers built over runtime systems replace or complement ScaLAPACK. However, ScaLAPACK remains a widely used reference implementation for distributed dense linear algebra and is a common target when porting algorithms to large HPC systems.

Understanding ScaLAPACK, its distribution model, and its constraints gives you a foundation for using existing distributed solvers efficiently and for reasoning about how dense linear algebra behaves at scale on modern HPC clusters.

Comments

Please login to add a comment.

Don't have an account? Register now!