Table of Contents
Historical background: how Linux took over HPC
In early supercomputing, most systems ran proprietary UNIX variants or vendor-specific operating systems. These had three major drawbacks for long‑term scientific use:
- Tight coupling to specific hardware vendors
- High licensing costs
- Limited flexibility for customization
Linux changed this landscape by being:
- Free and open-source
- Portable to many architectures (x86, ARM, Power, etc.)
- UNIX-like, so familiar to existing HPC users
As commodity x86 hardware became powerful and cheap, clusters built from standard servers plus Linux quickly outcompeted custom supercomputers on price/performance. Over time:
- Hardware vendors added strong Linux support (drivers, tools)
- Major schedulers, MPI implementations, and scientific codes standardized on Linux
- National labs and universities adopted Linux-based clusters for most new systems
Today, essentially all systems at the top of the TOP500 list run some Linux distribution.
Key technical reasons Linux fits HPC so well
1. Open-source and modifiable
For HPC centers and vendors, being able to see and modify the OS code is crucial:
- Kernel customization
Administrators can: - Strip out unnecessary features to reduce overhead
- Patch for performance or scalability
- Tune scheduling and memory behavior for large parallel jobs
- Fast support for new hardware
When new CPUs, interconnects, or accelerators appear, vendors and the community can: - Add drivers directly to the kernel
- Optimize subsystems (e.g., networking stack, I/O stack) for specific workloads
- Long-term maintainability
Scientific applications often need to run for decades. With open source: - There’s no dependence on a single company’s OS roadmap
- Bugs can be fixed even if the original vendor loses interest
For HPC users, this openness means:
- Greater transparency when debugging strange behavior
- Ability to collaborate with system staff on deep, system-level performance tuning
- A large ecosystem of open-source tools (profilers, debuggers, libraries)
2. Stability and robustness for long jobs
HPC workloads often involve:
- Jobs running for hours, days, or even weeks
- Thousands or tens of thousands of processes working together
Linux is favored because it offers:
- Proven stability at large scale
It has been stress-tested in production clusters worldwide and on many generations of hardware. - Mature process and memory management
Essential for: - Running many user jobs on shared login nodes
- Managing large memory jobs on compute nodes
- Avoiding OS-level crashes under heavy load
- Predictable updates
Enterprise/HPC-focused distributions (e.g., RHEL, Rocky, SUSE, Ubuntu LTS) provide: - Long-term support (LTS) releases
- Security and bug fixes without disruptive changes
- Controlled upgrade paths that preserve compatibility with existing applications
3. Performance and scalability features
Linux includes or supports many features that matter specifically to HPC performance:
- Efficient process and thread scheduling
While details of scheduling belong elsewhere, Linux provides: - Configurable CPU affinity (pinning processes/threads to cores)
- NUMA-aware scheduling on multi-socket nodes
- Hook points for HPC runtimes to manage core placement
- Advanced networking and interconnect support
HPC heavily depends on low-latency, high-bandwidth networks: - Native support for InfiniBand, high-speed Ethernet, and vendor-specific interconnects
- Kernel-bypass technologies (e.g., RDMA) to reduce communication overhead
- Large memory and huge page support
Important for: - Memory-intensive simulations
- Reducing TLB pressure and improving memory throughput
- Filesystem and I/O flexibility
Linux supports: - Parallel filesystems (Lustre, GPFS, BeeGFS, etc.)
- Custom I/O schedulers and tuning for streaming or random access workloads
These features combine to allow:
- Efficient scaling from single-node jobs to multi-thousand-node parallel runs
- Fine-grained tuning of performance-critical paths (compute, network, storage)
4. Hardware and vendor ecosystem support
HPC systems combine components from many vendors: CPUs, GPUs, interconnects, storage, etc. Linux dominates in this environment because:
- All major hardware vendors support Linux first
They provide: - Production-grade drivers
- Performance libraries (e.g., BLAS, communication libraries)
- Diagnostic and management tools
- Rapid support for new architectures
As new architectures (e.g., ARM, RISC-V, specialized accelerators) emerge: - Linux is usually the first OS ported and supported at scale
- This gives researchers access to cutting-edge hardware without waiting for proprietary OS support
- Management and monitoring integration
Cluster management tools (for provisioning, monitoring, configuration) are mostly designed around Linux, simplifying: - Automated node installation
- Health and performance monitoring
- Centralized configuration management
5. Software ecosystem for scientific computing
Most HPC software is developed and tested primarily on Linux. This creates a strong network effect:
- Compilers and toolchains
The main HPC compilers (GCC, Clang/LLVM, Intel, vendor-specific compilers) have: - First-class Linux support
- Integration with Linux-specific features like perf events, debug info formats, and system libraries
- Numerical libraries and parallel runtimes
Widely used libraries and frameworks (BLAS, LAPACK, ScaLAPACK, FFT libraries, MPI implementations, OpenMP runtimes, CUDA, ROCm, etc.) are: - Officially supported on Linux
- Often optimized specifically for Linux environments
- Development and debugging tools
HPC workflows rely on: - Command-line tools
- Linux debuggers, profilers, performance counters
- Scripting languages (Python, Bash, etc.) that integrate naturally into Linux shells
- Package and environment management
Linux-native tools (e.g., environment modules, Spack, conda) are widely used to: - Provide multiple compiler and library versions side-by-side
- Keep complex software stacks manageable and reproducible
For you as a user, this means:
- Most examples, tutorials, and documentation assume Linux
- You can move your code between different clusters with fewer surprises
- Help and community support are easier to find
6. Licensing, cost, and accessibility
From a center’s perspective, Linux has major practical advantages:
- No per-node OS licensing costs
Especially important when a cluster has hundreds or thousands of nodes. - Flexible support models
Centers can: - Use community-supported distributions (e.g., Debian, Ubuntu, Fedora)
- Purchase enterprise support if needed (e.g., RHEL, SUSE)
- Combine in-house expertise with vendor support
- Lower barriers for education and research
Because Linux is freely available: - Students and researchers can install similar environments on personal machines
- Training clusters and teaching environments can be built cheaply
- Course materials and tutorials can assume a common, accessible platform
7. Customization for specialized workflows
Many HPC workloads are unconventional from a general-purpose desktop OS perspective:
- Massive MPI jobs
- Real-time or near real-time data processing
- Highly specialized workflows (e.g., weather forecasting, quantum chemistry, Lattice QCD)
Linux accommodates these through:
- Configurable kernel options and patches
For instance, system integrators can: - Enable or disable specific kernel features
- Apply HPC-focused patches for improved scaling
- Adjust kernel parameters to favor throughput over interactivity
- Flexible init and service management
Services not needed on compute nodes can be disabled to: - Reduce noise (background activity that interferes with tightly coupled computations)
- Minimize overhead and potential failure points
- Deep integration with schedulers and resource managers
Linux provides interfaces that schedulers use to: - Enforce resource limits (CPU, memory, I/O)
- Track and account usage per job
- Control cgroups and namespaces, which are also foundational for containers
8. HPC users and Linux skills
Because Linux dominates HPC:
- Most cluster access is via Linux command line
Even if you use Windows or macOS locally, you’ll typically connect to a Linux login node. - Most examples and documentation are Linux-centric
Commands, scripts, and job examples usually assume a Bash shell and Linux filesystem conventions. - Linux familiarity becomes a core HPC skill
For scientific and engineering computing, being comfortable with: - Basic shell usage
- Files and permissions
- Environment configuration
is often as important as knowing the programming language you use.
This course’s Linux-related chapters are designed with that reality in mind: understanding Linux is not an optional extra in HPC; it’s part of the foundation.
Summary: why other OSes are rare in HPC
Putting it all together, Linux dominates HPC because it uniquely combines:
- Open-source flexibility and transparency
- Proven stability at very large scale
- High performance and support for advanced interconnects and filesystems
- Broad hardware and vendor ecosystem support
- A rich scientific and parallel computing software stack
- Low cost and accessible licensing
- Strong customization capabilities for specialized workloads
Other operating systems exist in HPC-related contexts (e.g., for data pre-/post-processing, or on user desktops), but when it comes to the actual compute nodes of large clusters and supercomputers, Linux is overwhelmingly the standard choice.