Responsible computing practices

Table of Contents

Understanding Responsibility in HPC Use

Responsible computing in high performance environments is about more than just getting results. It is about how you use shared, powerful, and energy hungry systems in a way that respects other users, the organization that runs the cluster, and broader social and environmental impacts.

You already saw why efficiency and sustainability matter at a system level. Here the focus is on your day to day behavior as an HPC user, the choices you make in code and job submission, and how those choices affect fairness, cost, and trust in shared facilities.

Responsible HPC use means: do not waste resources, do not block others unnecessarily, do not misrepresent what your jobs are doing, and do not put data, systems, or people at risk.

Fair Use of Shared Resources

An HPC cluster is a shared resource. Your jobs compete with thousands of others for CPUs, GPUs, memory, and I/O bandwidth. Responsible use requires awareness of this shared context.

You should size your jobs honestly. Request the number of cores, GPUs, and amount of memory that you actually need, with a small safety margin instead of a large one. Overrequesting resources reduces effective cluster capacity and can slow down everyone, including you, because the scheduler has a harder time packing jobs efficiently.

Similarly, you should choose realistic time limits. If your application usually finishes in 20 minutes, asking for 48 hours because it feels safer is not responsible. Long wall times make backfilling harder for the scheduler, reduce cluster utilization, and can worsen average queue times.

On the other hand, chronic underestimation that leads to frequent job timeouts is also irresponsible. It wastes CPU hours already spent and forces resubmissions, which further congest the queue. Good practice is to use short exploratory runs to calibrate performance, then set production wall times accordingly.

Avoiding Resource Waste in Practice

Resource waste appears in many small choices. Running debug builds instead of optimized builds on thousands of cores, leaving postprocessing scripts to run indefinitely on a login node, or launching many small jobs that each perform trivial work all contribute to unnecessary consumption.

Responsible practice includes running small tests locally or on a development partition before scaling out, verifying that your code does something meaningful at scale, and cleaning up old or abandoned workflows. If your job generates huge intermediate files that are never read, you should change the workflow to produce only what you need.

A common pattern of waste is the idle process, for example an MPI rank that waits for many hours for others to finish work it cannot use. This can occur when load balancing is poor or when communication patterns force long waits. While performance tuning is its own topic, you should at least recognize that long idle times are not just a performance problem. They are an ethical use problem, because you are occupying resources without doing useful work.

If a job is clearly misconfigured, stalled, or producing nonsense, you should cancel it instead of letting it run to completion just because it was hard to submit.

Respecting Scheduler Policies and Limits

Schedulers implement institutional policies about priority, usage, and fairness. Responsible computing includes following both the letter and the spirit of these rules.

You should not attempt to bypass fair share or quota mechanisms by using multiple accounts, submitting through colleagues, or fragmenting a large job into many small ones that saturate the cluster. These behaviors can give short term advantage but harm the overall community and can lead to sanctions.

If the system defines partitions for different purposes, such as debug, short, long, or GPU queues, use them as intended instead of abusing a lightly used partition for unrelated production runs. For example, using an interactive or development queue for multi day runs may disrupt many users who rely on fast turnaround.

It is also important to respect maintenance windows and instructions from system administrators. Trying to hold on to nodes by submitting jobs that overlap with planned downtime, or ignoring requests to stop using an unstable feature, shifts risk and cost onto the operations team.

Data Responsibility and Privacy

Many HPC jobs process sensitive or proprietary data. Responsible computing means protecting that data and respecting legal and ethical boundaries.

You should know which datasets are allowed on your cluster and which are not. Some facilities forbid personally identifiable information, health records, or confidential business data. Even when processing allowed data, you should store it only where the site recommends, for example in protected project directories, not in world readable scratch space.

You should also minimize unnecessary copies of sensitive data. Making multiple untracked duplicates in temporary directories increases the risk of leaks and complicates deletion. When sharing data with collaborators, use controlled mechanisms approved by the site, not ad hoc transfers that bypass logging and access controls.

When working with data about people or organizations, consider whether your analysis could cause harm if misused or misinterpreted. This includes the risk of reidentification in aggregate data and the ethical implications of combining datasets in new ways. Technical capability does not automatically justify every possible computation.

Reproducibility as an Ethical Obligation

Reproducibility is deeply connected to responsibility. HPC results often inform major decisions in science, engineering, and policy. If your workflow cannot be reproduced, it is difficult to verify, audit, or extend.

Responsible practice includes keeping enough metadata and configuration information to rerun your jobs later. This usually means recording the exact software versions, input parameters, and environment settings used for each run. Environment modules, container images, and job scripts are central tools, but your role is to use them in a disciplined way.

You should avoid ad hoc manual changes that are not recorded anywhere. For example, running a job after editing a configuration directly on the cluster without committing or documenting the change makes your results fragile. Similarly, postprocessing scripts should live under version control instead of scattered copies with unknown differences.

If you discover a bug that affects past results, responsibility requires that you assess the impact, document the problem, and, when appropriate, correct or retract affected conclusions. Simply ignoring known flaws because rerunning the computation is inconvenient is not acceptable, especially in safety critical domains.

Transparency in Reporting Performance and Results

High performance results can be persuasive. They can influence funding decisions, technology choices, and scientific narratives. Responsible computing includes honest reporting of both performance metrics and scientific outputs.

You should avoid cherry picking best case numbers without disclosing the conditions under which they were obtained. For example, reporting peak speedup from a single carefully tuned input without noting that typical cases perform much worse misleads readers. If you use special compiler flags or nonstandard configurations, describe them clearly.

In scientific contexts, you should not hide failed runs or outliers that might indicate methodological problems. Selective reporting is easier to rationalize when runs are expensive, because repeating them feels costly, but the ethical obligation to report accurately does not change with job scale.

In industrial settings, be cautious about extrapolating limited benchmarks to large deployments, especially when safety or financial risk is involved. Overconfident performance claims based on insufficient HPC testing can lead to system overloads or operational failures downstream.

Security Mindset for HPC Users

Security is not only the job of administrators. As an HPC user you control code, data, and credentials that can be misused by others. Responsible practice includes basic security hygiene adapted to HPC contexts.

You should protect your authentication credentials, avoid sharing accounts, and never hard code passwords or API keys into scripts or job files. If you use automation to submit jobs, ensure that stored credentials follow your site’s security guidelines and that scripts do not expose them in logs or error messages.

You should treat untrusted code cautiously. Downloading and running arbitrary binaries on a shared cluster can introduce malware or compromise the system. Prefer building software from known source repositories, and follow your site’s policies on external software installation.

If you suspect a security incident, such as unexpected processes under your account, data in the wrong place, or unusual network activity, you should report it promptly instead of trying to hide it or fix it alone. Quick reporting can limit damage for everyone.

Ethical Use of Computational Power

Access to large scale computation confers power. You can model complex systems, process massive datasets, and explore parameter spaces that are inaccessible on personal machines. Responsible computing asks you to reflect on how you use that power.

Some computations may have dual use implications. For example, simulations that improve materials, cryptanalysis, or biological models can be beneficial or harmful depending on context. You should follow institutional review processes when they exist, such as ethics committees, export control checks, or data use agreements, and you should be willing to question projects whose goals or consequences are unclear.

You should also consider the opportunity cost of your runs. A job that occupies a significant fraction of a system for days can delay many other users. Before launching extremely large runs, ask whether they are necessary, whether a smaller or more targeted study would suffice, and whether the potential benefits justify the environmental and social cost of the computation.

The fact that a computation is possible does not make it ethically justified. Reflect on intent, potential misuse, and the broader impact of your work before using large HPC resources.

Collaboration, Attribution, and Community Norms

HPC work is rarely solitary. You rely on system administrators, library developers, tool authors, and colleagues. Responsible practice includes acknowledging this ecosystem.

You should respect licenses and citation requests for software and datasets. Libraries, numerical packages, and frameworks often specify how they wish to be cited. Ignoring these instructions reports your results as more self contained than they truly are and can harm future funding for the tools you depend on.

In collaborations, you should be clear about who does what, especially when using shared allocations or project accounts. Submitting large batches of jobs under a group allocation without coordination can create internal conflicts and external fairness issues.

You should also engage constructively with the HPC community. Reporting bugs with sufficient detail, sharing improvements when licenses allow, and contributing to documentation or tutorials all help maintain the infrastructure you rely on. Treat operations staff and support teams as partners rather than obstacles.

Responsible Experiment Design on Large Systems

Finally, responsibility extends to how you design experiments that use substantial HPC resources. Good design minimizes wasted computation and maximizes insight per CPU hour or GPU hour.

You should avoid brute force parameter sweeps when more efficient search strategies exist. For example, instead of blindly sampling thousands of combinations, you might use design of experiments or adaptive methods to explore parameter spaces more intelligently. This is not only a methodological improvement, it is an ethical improvement because it reduces unnecessary load.

You should plan for early stopping where appropriate. If intermediate results show that a simulation is diverging, producing physically meaningless outputs, or clearly failing your research goals, you should terminate or adapt the run rather than letting it continue to completion by habit.

When training machine learning models on HPC systems, you should consider practices like monitoring validation performance and using checkpoints to stop training when additional epochs yield minimal gains. Prolonged training beyond the point of meaningful improvement simply consumes power and queue time.

Responsible computing in HPC is an ongoing practice, not a fixed checklist. As systems, tools, and societal expectations evolve, you will need to revisit these questions. The core principles however remain stable: think about others, think about consequences, and treat computational resources as a valuable, shared, and powerful instrument that deserves respect.

Comments

Please login to add a comment.

Don't have an account? Register now!