Kahibaro
Discord Login Register

ScaLAPACK

What ScaLAPACK Is About

ScaLAPACK (Scalable Linear Algebra PACKage) is a library for performing dense linear algebra on distributed-memory parallel computers, typically using MPI. Conceptually, it extends LAPACK’s functionality from a single node (or shared-memory system) to a whole cluster.

Where LAPACK assumes matrices live in the memory of a single process, ScaLAPACK assumes:

ScaLAPACK is therefore central when you want to solve large dense problems—too big for a single node—using many nodes in an HPC cluster.

Key Ideas: Distributed Dense Linear Algebra

Block-Cyclic Data Distribution

ScaLAPACK is built on the idea of a block-cyclic distribution of matrices over a 2D process grid.

This helps:

You don’t work directly with the global matrix. Each process stores its local pieces (blocks) and ScaLAPACK routines know how to combine them to implement algorithms.

BLACS and PBLAS

ScaLAPACK builds on two important layers:

ScaLAPACK’s higher-level routines (eigenvalue solvers, factorizations, etc.) call PBLAS and BLACS internally.

Problem Types ScaLAPACK Targets

ScaLAPACK covers many of the same dense operations as LAPACK, but for distributed matrices:

If you know the LAPACK routine name, often the ScaLAPACK counterpart is similar but with:

Naming Conventions and Routine Structure

ScaLAPACK routine names are close to LAPACK, but with:

Examples:

These functions typically:

  1. Take a BLACS context and process grid parameters.
  2. Expect matrix arguments given in ScaLAPACK format, including local array dimensions and block sizes.
  3. Follow similar argument ordering to LAPACK, with extra parameters for the distribution.

Working with ScaLAPACK in Practice

Typical Workflow

At a high level, using ScaLAPACK involves:

  1. Initialize MPI:
    MPI_Init, obtain MPI_COMM_WORLD, rank, and size.
  2. Set up a BLACS process grid:
    • Convert MPI_COMM_WORLD into a BLACS context.
    • Define a 2D process grid: size nprow × npcol.
  3. Describe the matrix distribution:
    • Choose block sizes MB, NB (tuning parameters).
    • Use ScaLAPACK descriptor arrays (e.g., DESC_A) that encode:
      • Global matrix size
      • Block sizes
      • Process grid mapping
      • Leading dimensions of local storage
  4. Allocate and initialize local data:
    • Each process allocates only its local blocks.
    • Initialize from file, analytic formula, or by reading partitions of a dataset.
  5. Call ScaLAPACK routines:
    • For example pdgesv, pdpotrf, pdsyev, etc.
    • Routines use MPI communication internally via BLACS.
  6. Gather results (if needed):
    • Often you may need global results on one process or in a shared file.
    • Use PBLAS / ScaLAPACK helper routines or separate MPI I/O/gathers.
  7. Finalize:
    • Free BLACS contexts.
    • Call MPI_Finalize.

Data Descriptors

Every distributed matrix is described by a descriptor (an integer array, typically of size 9). This descriptor encodes:

ScaLAPACK routines use these descriptors to understand how data is distributed without you passing MPI communicators directly.

Integration with BLAS/LAPACK/MPI

Under the hood:

This layered design means:

Performance and Scalability Considerations

ScaLAPACK is designed for strong and weak scaling of dense problems across many processes. Performance depends on several key choices:

Process Grid Shape

Choosing:

Block Size Selection

Block sizes MB and NB affect:

Heuristics:

Too small blocks → too much communication and overhead.
Too large blocks → less ability to overlap work and poorer load balance.

Communication vs Computation

Dense linear algebra can be communication-heavy at large scale. ScaLAPACK routines aim to:

However:

In practice:

Node-Level Parallelism

Within each process, you can still use:

This becomes a hybrid parallel setup (MPI + threads). Balancing MPI ranks vs. threads per rank can significantly affect overall performance, but the principle for ScaLAPACK is: each process simply sees its local blocks, and local computations can use threads.

When to Use ScaLAPACK (and When Not To)

Use ScaLAPACK when:

ScaLAPACK is not designed for:

In such cases, specialized distributed sparse solvers or iterative methods are usually better.

Using ScaLAPACK on HPC Systems

On many clusters, ScaLAPACK is provided as part of the system math libraries:

Typical steps:

  1. Load appropriate modules (e.g., module load mkl or module load cray-libsci).
  2. Link your code with MPI, BLAS, LAPACK, and ScaLAPACK:
    • Often via compiler wrappers and specific link flags (documented by your system).

Example (conceptual, details are system-specific):

mpif90 mycode.f90 -L$MKLROOT/lib -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 \
                  -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm

Always check your cluster’s documentation; some provide wrapper commands (like mkl_link_tool, craype, or pkg-config-style scripts) that simplify linking.

Practical Tips and Common Pitfalls

Beyond Classical ScaLAPACK

ScaLAPACK is a mature and widely used standard, but newer projects extend or replace parts of it to better exploit:

Examples include libraries like ELPA, Elemental, SLATE, and vendor-specific distributed solvers. Despite that, ScaLAPACK remains a fundamental reference point and is still heavily used in production codes, especially for distributed dense eigenproblems and factorizations on CPU-based systems.

Views: 8

Comments

Please login to add a comment.

Don't have an account? Register now!