Table of Contents
What ScaLAPACK Is About
ScaLAPACK (Scalable Linear Algebra PACKage) is a library for performing dense linear algebra on distributed-memory parallel computers, typically using MPI. Conceptually, it extends LAPACK’s functionality from a single node (or shared-memory system) to a whole cluster.
Where LAPACK assumes matrices live in the memory of a single process, ScaLAPACK assumes:
- Matrices are too large for one node.
- Data must be distributed across many processes.
- Communication happens via MPI.
ScaLAPACK is therefore central when you want to solve large dense problems—too big for a single node—using many nodes in an HPC cluster.
Key Ideas: Distributed Dense Linear Algebra
Block-Cyclic Data Distribution
ScaLAPACK is built on the idea of a block-cyclic distribution of matrices over a 2D process grid.
- MPI processes are arranged in a logical $P \times Q$ process grid.
- A global matrix $A$ of size $M \times N$ is divided into blocks of size $M_b \times N_b$.
- These blocks are then distributed in a round-robin (cyclic) fashion across the process grid.
This helps:
- Balance the workload (each process gets roughly the same amount of data).
- Avoid bottlenecks where a single process has all “hard” rows/columns.
You don’t work directly with the global matrix. Each process stores its local pieces (blocks) and ScaLAPACK routines know how to combine them to implement algorithms.
BLACS and PBLAS
ScaLAPACK builds on two important layers:
- BLACS (Basic Linear Algebra Communication Subprograms)
A communication layer providing MPI-based operations tailored for linear algebra on a 2D process grid (broadcasts, reductions, point-to-point, etc., but in matrix-oriented terms). - PBLAS (Parallel BLAS)
Parallel analogs of BLAS operations (e.g., matrix-matrix multiply, matrix-vector multiply) that work on distributed matrices using the block-cyclic layout and BLACS for communication.
ScaLAPACK’s higher-level routines (eigenvalue solvers, factorizations, etc.) call PBLAS and BLACS internally.
Problem Types ScaLAPACK Targets
ScaLAPACK covers many of the same dense operations as LAPACK, but for distributed matrices:
- Linear systems: $A x = b$ (LU factorization and solves)
- Least-squares problems: via QR or SVD
- Eigenvalue problems:
- Symmetric/Hermitian eigenproblems
- General (non-symmetric) eigenproblems
- Singular value decompositions (SVD)
- Auxiliary routines: matrix balancing, scaling, etc.
If you know the LAPACK routine name, often the ScaLAPACK counterpart is similar but with:
- Prefixes
p(for “parallel”) and - Additional arguments for the distributed layout and BLACS context.
Naming Conventions and Routine Structure
ScaLAPACK routine names are close to LAPACK, but with:
- A type prefix:
s= single precision reald= double precision realc= single precision complexz= double precision complex- A leading
pfor “parallel” in many routines. - A mnemonic similar to LAPACK for the operation.
Examples:
pdgesv– double precision (d) parallel (p) GEneral system of linear equations (SV).pdgetrf– LU factorization of a distributed dense matrix.pdgetrs– Solve a system using a distributed LU factorization.pdsyev– eigenvalues/eigenvectors of a distributed symmetric matrix.pdgeqrf– QR factorization of a distributed matrix.
These functions typically:
- Take a BLACS context and process grid parameters.
- Expect matrix arguments given in ScaLAPACK format, including local array dimensions and block sizes.
- Follow similar argument ordering to LAPACK, with extra parameters for the distribution.
Working with ScaLAPACK in Practice
Typical Workflow
At a high level, using ScaLAPACK involves:
- Initialize MPI:
MPI_Init, obtainMPI_COMM_WORLD, rank, and size. - Set up a BLACS process grid:
- Convert
MPI_COMM_WORLDinto a BLACS context. - Define a 2D process grid: size
nprow × npcol. - Describe the matrix distribution:
- Choose block sizes
MB,NB(tuning parameters). - Use ScaLAPACK descriptor arrays (e.g.,
DESC_A) that encode: - Global matrix size
- Block sizes
- Process grid mapping
- Leading dimensions of local storage
- Allocate and initialize local data:
- Each process allocates only its local blocks.
- Initialize from file, analytic formula, or by reading partitions of a dataset.
- Call ScaLAPACK routines:
- For example
pdgesv,pdpotrf,pdsyev, etc. - Routines use MPI communication internally via BLACS.
- Gather results (if needed):
- Often you may need global results on one process or in a shared file.
- Use PBLAS / ScaLAPACK helper routines or separate MPI I/O/gathers.
- Finalize:
- Free BLACS contexts.
- Call
MPI_Finalize.
Data Descriptors
Every distributed matrix is described by a descriptor (an integer array, typically of size 9). This descriptor encodes:
- The BLACS context handle.
- Global size
M,N. - Block sizes
MB,NB. - Process row/column owning the first block.
- Leading dimension of the local array.
ScaLAPACK routines use these descriptors to understand how data is distributed without you passing MPI communicators directly.
Integration with BLAS/LAPACK/MPI
Under the hood:
- Each process calls BLAS and LAPACK locally on its blocks.
- Inter-process communication is handled by BLACS (built on MPI).
- PBLAS routines combine local BLAS operations with communication to implement parallel linear algebra operations.
This layered design means:
- The numerical kernels are still tuned BLAS on each node.
- Parallelism and communication patterns are handled separately.
Performance and Scalability Considerations
ScaLAPACK is designed for strong and weak scaling of dense problems across many processes. Performance depends on several key choices:
Process Grid Shape
- A 2D process grid (e.g.,
nprow × npcol) is usually preferable to 1D. - For square or nearly square matrices, a square process grid (
nprow ≈ npcol) often gives better load balance and communication.
Choosing:
nprow × npcol = number_of_MPI_processes.- Avoid extremely skinny grids (e.g.,
1 × P) unless the matrix is heavily rectangular in a compatible way.
Block Size Selection
Block sizes MB and NB affect:
- Load balance.
- Amount of parallelism.
- Local BLAS efficiency (cache usage).
- Communication frequency.
Heuristics:
- Block sizes not too small: often 64–256 is a reasonable starting range.
- Use block sizes in multiples of cache-friendly BLAS tile sizes.
- Some vendor libraries and documentation provide recommended values.
Too small blocks → too much communication and overhead.
Too large blocks → less ability to overlap work and poorer load balance.
Communication vs Computation
Dense linear algebra can be communication-heavy at large scale. ScaLAPACK routines aim to:
- Minimize communication volume where possible.
- Use efficient collective operations via BLACS.
However:
- At very high process counts or on slow interconnects, communication can dominate.
- Parallel efficiency can decrease if the problem size per process becomes too small.
In practice:
- Aim for enough work per process (e.g., global matrix dimension large compared to $\sqrt{\text{#processes}}$).
- Use performant interconnects (InfiniBand, high-speed Ethernet with MPI tuned accordingly).
Node-Level Parallelism
Within each process, you can still use:
- Multithreaded BLAS (e.g., OpenMP threads inside BLAS).
- MPI-only with one process per core (or mix of the two).
This becomes a hybrid parallel setup (MPI + threads). Balancing MPI ranks vs. threads per rank can significantly affect overall performance, but the principle for ScaLAPACK is: each process simply sees its local blocks, and local computations can use threads.
When to Use ScaLAPACK (and When Not To)
Use ScaLAPACK when:
- Matrices are too large to fit in memory on a single node.
- You want to scale dense linear algebra across many nodes.
- You need standard operations (LU/QR/Cholesky/SVD/eigenvalues) on dense, globally coupled problems.
ScaLAPACK is not designed for:
- Sparse matrices (where most entries are zero and special formats are needed).
- Extremely irregular data structures.
- Problems where dense factorization cost is prohibitive and iterative or sparse methods are more appropriate.
In such cases, specialized distributed sparse solvers or iterative methods are usually better.
Using ScaLAPACK on HPC Systems
On many clusters, ScaLAPACK is provided as part of the system math libraries:
- Intel oneAPI MKL includes ScaLAPACK.
- OpenBLAS-based builds may come with ScaLAPACK.
- Vendor-optimized libraries (Cray LibSci, IBM ESSL, etc.) provide compatible interfaces.
Typical steps:
- Load appropriate modules (e.g.,
module load mklormodule load cray-libsci). - Link your code with MPI, BLAS, LAPACK, and ScaLAPACK:
- Often via compiler wrappers and specific link flags (documented by your system).
Example (conceptual, details are system-specific):
mpif90 mycode.f90 -L$MKLROOT/lib -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 \
-lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm
Always check your cluster’s documentation; some provide wrapper commands (like mkl_link_tool, craype, or pkg-config-style scripts) that simplify linking.
Practical Tips and Common Pitfalls
- Initialization order matters:
- MPI init → BLACS context setup → descriptor creation → ScaLAPACK calls.
- Consistent grids:
- All descriptors used together in a routine must share compatible BLACS contexts and process grids.
- Error handling:
- ScaLAPACK routines usually return
INFO: INFO = 0: success.INFO < 0: illegal argument.INFO > 0: algorithm-specific issues (e.g., matrix singularity).- Testing small first:
- Start with small matrix sizes and a small process grid.
- Verify correctness before scaling up.
- I/O strategy:
- Global matrices don’t “exist” in one process.
- Plan I/O: write distributed data in parallel, or gather to one process only when sizes allow.
- Reproducibility:
- Parallel factorizations/eigenvalue routines may show small differences in results vs. serial LAPACK because of different operation order and rounding.
- For most practical purposes these are acceptable, but bitwise reproducibility may require careful control of process counts and libraries.
Beyond Classical ScaLAPACK
ScaLAPACK is a mature and widely used standard, but newer projects extend or replace parts of it to better exploit:
- GPUs and heterogeneous nodes.
- More advanced communication-avoiding algorithms.
- Modern runtime systems.
Examples include libraries like ELPA, Elemental, SLATE, and vendor-specific distributed solvers. Despite that, ScaLAPACK remains a fundamental reference point and is still heavily used in production codes, especially for distributed dense eigenproblems and factorizations on CPU-based systems.