Table of Contents
Role of the Operating System in HPC
Every high performance computing system runs on top of an operating system. In almost all cases this system is a variant of Unix, most often Linux. For an end user, the operating system provides the basic contract that lets you interact with the machine. It decides how to start your program, how memory is allocated to it, how it accesses files, and how it communicates with other programs. On an HPC cluster, the operating system also cooperates with the resource manager and scheduler to control who can run where and when.
On a laptop you might mainly think of the operating system as the graphical interface and the set of applications you see on screen. On an HPC system you usually interact with the operating system through a text based shell. Learning to use this environment is essential, because it is the main way to submit jobs, inspect results, manage data, and control software versions. The same underlying concepts also influence performance. For example, process creation, memory allocation, and file system semantics are all shaped by operating system design and can help or limit your code.
Although HPC systems vary in scale and hardware, their operating system environments have many common features. Understanding these will let you move between different clusters more easily and will make documentation clearer when it speaks about processes, files, permissions, or environment variables.
Why Linux Dominates HPC
Linux is the dominant operating system in HPC, both on top supercomputers and on modest research clusters. This is not an accident. Several technical and practical reasons, taken together, make Linux the default choice.
A key property of Linux is that it is open source. The kernel and most system components can be inspected and modified. For HPC system administrators, this means they can tailor the operating system to their specific hardware, integrate low level drivers for high speed interconnects, adjust scheduling and memory behavior, and remove unnecessary components that add overhead. This flexibility is difficult to achieve with proprietary systems.
Linux also runs on a wide range of processor architectures, including x86 64, Arm, and various accelerator platforms. This portability is important in HPC, where new architectures appear frequently. Vendors of CPUs, GPUs, and high speed network hardware almost always release their first and most complete drivers for Linux. This shortens the time between hardware release and real scientific use.
From the user perspective, Linux offers a rich ecosystem of development tools. Compilers, debuggers, profilers, and mathematical libraries have strong Linux support. Popular parallel programming models like MPI and OpenMP are tested and optimized primarily on Linux environments. Scripts and workflows that rely on standard Unix tools tend to run with minimal change across clusters at different institutions.
Linux also fits well with the multi user, batch oriented nature of HPC. Its permission model and process control mechanisms allow many users to share a system securely while running long jobs under a scheduler. PAM, SSH, and filesystem permissions help enforce access control, while mechanisms like cgroups and namespaces, available in modern kernels, support resource isolation and containers.
Finally, there is a cultural factor. Many scientific software packages were born on Unix like systems. Shell scripts, Makefiles, and build systems assume a Linux or Unix environment. Once such an ecosystem is established, it reinforces itself. New tools tend to target the environment where their users already are.
The Linux Command Line in HPC Context
On HPC systems users normally connect remotely through the Secure Shell protocol, using an ssh client. After logging in, you are placed in a shell, commonly bash or zsh. Everything you do, from compiling code to launching parallel jobs, is initiated from this command line environment.
The command line is not only a way to run single programs. It is also a scripting environment. You can capture sequences of commands in shell scripts, automate pre and post processing, or glue together components of a workflow. These scripts are particularly important in HPC because jobs are usually non interactive. Once submitted to the scheduler, they run unattended and rely entirely on scripted actions.
In this environment, basic concepts like the current working directory, standard input, standard output, and standard error become crucial. A typical HPC job script changes to a project directory, runs one or more commands, and writes text output to files. Understanding how to redirect output, append to files, or pipe data between commands is central for effective use.
Although the next chapter will go into basic Linux usage, it is useful here to place that usage in the HPC setting. Many commands are used repeatedly in cluster workflows. You inspect directory contents with simple tools, view log files produced by jobs, check disk usage quotas, and manipulate large numbers of files. Because HPC clusters are shared systems, you also need to be aware of policies around login node usage and avoid running long or intensive computations directly in interactive shells on those nodes.
The Linux command line is also the primary way to control your environment. You can view and manipulate environment variables, modify your shell configuration to load default modules, or set paths to custom software installs in your home or project directories. All of this is mediated by shell commands that interact with the underlying operating system.
Filesystems, Paths, and Permissions
In Linux, every file and directory is part of a single tree that starts at the root /. On an HPC cluster this tree often spans multiple physical storage systems, some local to nodes and some shared across the cluster. This unified view is convenient because the same paths that you use on a login node are usually valid on compute nodes where your jobs run.
HPC environments typically offer several distinct filesystem areas, each with its own purpose and performance characteristics. A common pattern is a relatively small home directory for configuration files and small datasets, one or more project or group directories for larger shared data, and a high performance scratch area for temporary I/O intensive work. These areas usually differ in backup policy and expected lifetime of data.
The Linux notion of the working directory is directly relevant. When your job script starts it inherits a working directory. Commands that use relative paths interpret them with respect to that directory. Many job failures are caused by missing data because a script assumed one directory but executed in another. For robust HPC workflows, scripts often start with an explicit cd to a known location.
Permissions govern who can read, write, or execute files. Linux uses user, group, and other permissions at the filesystem level. On a shared cluster, respecting these permissions is important both for security and for collaboration. Data in a group project directory might be readable and writable by members of a research group, while home directories might be more restrictive. Some centers enforce additional controls using access control lists, but the same conceptual model of controlled sharing applies.
Symbolic links are used often in HPC data organization. For example, a path like latest might be a link to the most recent simulation output directory, while the actual data resides in versioned folders. Understanding that a symbolic link is just a reference to another path helps avoid confusion when moving or cleaning data.
From the operating system perspective, files opened by your code are represented as file descriptors. There are always three standard descriptors with numbers 0, 1, and 2 associated with input, output, and error. Redirection syntax in the shell manipulates these descriptors. Your parallel codes and I/O libraries, such as those described in later chapters, all build on top of these basic operating system abstractions.
Processes, Users, and the Multi User Environment
Linux runs programs as processes that are associated with users. On an HPC cluster, every login account corresponds to a user identity in the operating system. Each process you start, whether an interactive shell or a large parallel run, executes with your user id and belongs to one or more groups. This mapping enables the operating system and the scheduler to account for resource usage and enforce policies.
A process has a state, an address space, and open resources like files or network sockets. Processes can create child processes. In a job context, a single job script may launch many processes, for example mpirun starting an MPI program with dozens or thousands of ranks. The operating system tracks each process with a process identifier, or PID. Monitoring tools show these PIDs and the resources they consume.
Because many users run jobs concurrently on a cluster, Linux mechanisms for process isolation and scheduling are vital. Each process sees its own memory and cannot directly interfere with others. The kernel uses preemptive scheduling to share CPU time between processes according to a policy. On compute nodes, the resource manager usually configures which processes can be created and how much CPU or memory they may use. These limits are enforced through kernel features rather than trust alone.
User accounts are integrated with the rest of the HPC environment. Often, cluster logins are tied to institutional authentication systems so that a single identity suffices across services. Files created on shared filesystems record the user and group ownership so that access is consistent even when data is moved between nodes. For debugging permissions issues, it is important to understand which user and group your processes run as.
The parent child relationship between processes also plays a role in job control. When you submit a batch job, the scheduler will create a new process to run your job script. That script then launches other programs. If it terminates unexpectedly, its children may be signaled or reparented by the system. Understanding this hierarchy helps with interpreting log messages and cleaning up stray processes if something goes wrong.
Shells, Environments, and Environment Variables
The shell, such as bash, is more than a command interpreter. It maintains an environment, which is a collection of name value pairs called environment variables. Every process in Linux has such an environment. When you start a program from the shell, the program by default inherits the shell environment.
Environment variables are critical in HPC for controlling library paths, compiler behavior, MPI settings, and application configurations. Variables like PATH tell the operating system where to look for executables when you type a command. LD_LIBRARY_PATH influences how shared libraries are located at runtime. Many scientific programs read configuration from environment variables rather than from complex configuration files.
When you log into a cluster, your shell reads start up files such as .bashrc or .profile. These scripts can set environment variables, modify your prompt, load default modules, or customize other aspects of your session. At many centers, the system wide versions of these files initialize the module system and basic paths. You can extend them to load commonly used tools. However, it is important to keep them reasonably fast and robust, since every new shell session will run them.
The environment model is especially important in batch jobs. A job script is executed in a non interactive shell that usually processes your shell configuration. The environment at job submission time can be passed to the job by the scheduler. If you rely heavily on interactive shell state that is not reproducible in scripts, jobs may run differently from manual tests. Good practice is to define all job relevant environment settings explicitly in the job script so that the behavior is predictable.
Special environment variables also communicate system level information. Many schedulers export variables that identify the job id, the list of allocated nodes, or the number of tasks. Some MPI implementations consult environment variables to choose network transports or buffer sizes. The operating system does not interpret the meaning of most of these variables but carries them through to processes that need them.
Software Management and the Linux Stack
In an HPC environment, the operating system provides a foundation on which a large software stack is built. Compilers, MPI implementations, numerical libraries, and domain specific applications must all coexist on the same shared system. Linux supports this through its native packaging mechanisms and through conventions on installation directories and library handling.
System administrators usually install a baseline set of tools through the operating system package manager, such as apt, dnf, or zypper, depending on the Linux distribution. This base covers the kernel, core utilities, login services, and sometimes default development tools. On top of that, HPC specific software is often installed in separate directories such as /opt or specialized tree structures. The goal is to allow multiple versions of compilers and libraries to coexist without conflict.
Your programs link to libraries using the standard ELF and shared library mechanisms on Linux. At build time, compilers and linkers look up header files and libraries using search paths that depend on environment variables and flags. At runtime, the dynamic linker locates shared libraries based on embedded paths, standard directories, and environment variables like LD_LIBRARY_PATH. This behavior is the same on a personal Linux machine, but in HPC the number of versions and combinations is larger.
The operating system dictates default path search behavior. System library directories have priority, but module systems, covered in a separate chapter, modify the environment so that alternate directories are found first. This avoids conflicts between, for example, a vendor optimized MPI library and a generic system one. As a user, you should be aware that loading a module changes the environment and thus what the operating system will do when you run a compiler or a binary.
Containers, which are discussed in more detail elsewhere, rely heavily on Linux kernel features such as namespaces and cgroups. From the perspective of the operating system, a container is just another process tree with additional isolation. In HPC, container tools like Singularity or Apptainer provide a way to package user space software stacks while still sharing the cluster kernel. The kernel features that make this possible are part of the modern Linux operating system and are configured by administrators to match site policies.
Resource Limits, Accounting, and Fair Usage
Because HPC systems are shared, the Linux operating system is configured with resource limits and accounting mechanisms. These features protect the stability of the system and help schedulers enforce fair usage policies. While the scheduler itself is covered elsewhere, many of the controls it uses come from the underlying operating system.
Linux offers both soft and hard limits on resources for each process. Typical limits include the maximum size of memory segments, the number of open files, stack size, and core file size. Commands like ulimit can show or adjust these values within allowed ranges. On a cluster, administrators and schedulers often set them so that runaway jobs cannot exhaust critical resources. For example, limiting the number of open file descriptors prevents a single job from blocking filesystem operations for others.
Resource control for CPU and memory is often implemented using cgroups. A cgroup is a kernel level construct that groups processes and enforces quotas on their resource usage. When your job starts, the scheduler may place it into a dedicated cgroup that matches the allocation specified in your job submission. From the perspective of your processes, they see only the resources within that allocation. Attempts to exceed memory or CPU quotas can lead to throttling or termination.
Accounting is another side of the same coin. The Linux kernel records resource usage such as CPU time and memory in process structures. System tools and scheduler components read this data to generate reports. Centers may use this information to implement fair share policies, to bill projects according to node hours, or to identify misbehaving jobs. As a user you may encounter job summaries that report maximum resident set size, average CPU utilization, and I/O counts. All of these statistics trace back to what the kernel observed while your processes ran.
File quota systems add an additional layer that interacts with the filesystem. Even if the operating system allows you to create files freely from a process perspective, the filesystem may enforce per user or per project storage limits. When quotas are exceeded, file operations fail. The error codes are surfaced through the usual Linux system call mechanisms and propagate to applications and libraries. For large simulations, it is important to remember that the underlying operating system and storage configuration may start rejecting writes in such cases.
Cluster policies around login node usage also rely on operating system controls. Administrators can identify interactive sessions that use too much CPU or memory and may terminate them. Some centers use tools that automatically monitor processes on login nodes and enforce limits. This environment encourages users to move heavy computations into scheduled jobs on compute nodes, where resource usage is managed systematically.
Security, Access, and Remote Use
Security on an HPC cluster is largely coordinated through the operating system. User authentication typically happens when you establish an SSH connection. The SSH daemon on the cluster interacts with Linux authentication modules that verify your credentials, often against a central directory service. Once authenticated, the login shell you receive runs with your user id, and your access to files and directories is governed by the kernel.
SSH also handles encrypted communication, port forwarding, and sometimes file transfer through scp or sftp. These operations use standard Linux networking and file APIs behind the scenes. Many users rely on SSH keys stored in their home directories with permissions carefully restricted by the operating system. If the permissions of a private key file are too loose, SSH may refuse to use it, which is an example of security policy enforced jointly by an application and the filesystem.
Within the cluster, the operating system maintains separation between users primarily through filesystem permissions and process ownership. Users normally cannot send signals to processes owned by other users, nor can they read other users private data without explicit sharing. At the same time, collaboration within groups is supported by shared directories where group permissions enable wider access.
Some HPC centers employ additional restrictions. For example, they may disable direct network access from compute nodes to the outside world, so processes there cannot freely open outbound connections. This is often implemented through firewall rules and network namespace settings managed by the operating system. Understanding such constraints is important when designing workflows that involve external services or data transfers.
Remote visualization and interactive tools also pass through the Linux environment. When you run graphical programs using X11 forwarding or similar methods, the programs execute on the cluster but display on your local machine. The operating system on the cluster manages these processes like any others, with resource usage, permissions, and networking all handled by the kernel.
How Linux Shapes HPC Workflow Design
The characteristics of Linux and its operating environment have a direct effect on how HPC workflows are designed. The assumption of a text based shell interface encourages scripting and automation. Jobs are usually represented as executable scripts that access environment variables, run commands, and rely on the kernel for file I/O and process control.
The separation between login nodes and compute nodes arises from how operating systems handle interactive and batch workloads. Login nodes run long lived user shell sessions that are relatively lightweight. Compute nodes are dedicated to running tightly controlled processes under batch scheduling, with minimal interactive use. This division influences where and how you compile, test, and run code.
The way Linux handles files and directories encourages a particular organization of data. Large outputs are placed on shared or scratch filesystems, while code and configuration live in home or project directories. Shell utilities and environment variables help encode these assumptions into scripts, so that workflows can be reproduced across clusters with similar directory structures.
Finally, the consistency of Linux across different clusters makes portability practical. Once you are comfortable with essential Linux concepts, you can log into many HPC systems and recognize familiar directories, commands, and behaviors. Differences certainly exist in details of the software stack and policies, but the underlying operating system model is similar enough that your knowledge transfers.
In summary, the operating system in HPC is not only background infrastructure. It is the medium through which you interact with the cluster, organize data, control software, and run jobs. Understanding its Linux foundations will make the remainder of this course more concrete and will enable you to use HPC resources more effectively.