18.4 Common pitfalls and best practices

Table of Contents

Typical Pitfalls When Starting with OpenShift

Misunderstanding Projects, Namespaces, and Multi-Tenancy

Many teams treat an OpenShift cluster as if it were a single-tenant environment and:

Put everything into a single project (namespace).
Mix dev, test, and prod workloads in one project.
Share service accounts and credentials across teams.

Consequences:

Hard to apply different security and resource policies.
Risk of accidental cross-team impact (e.g., deleting shared resources).
Difficult cost and capacity attribution per team or environment.

Better approach:

Use separate projects per team and per environment (e.g., team-a-dev, team-a-test, team-a-prod).
Use RBAC to scope who can access which projects.
Use resource quotas and limit ranges per project to avoid noisy neighbors.

Ignoring Resource Requests, Limits, and Quotas

A common mistake is leaving CPU and memory unspecified or using arbitrary values:

No resources.requests set:

The scheduler has poor information, resulting in inefficient placement.

No resources.limits set:

A runaway process can consume all memory on a node and trigger OOM kills for other pods.

Over-committing or under-committing:

Over-commit: too many pods per node, leading to contention and throttling.
Under-commit: cluster appears “full” even with unused capacity.

Better approach:

Always specify resources.requests and resources.limits for CPU and memory in pod templates.
Define sensible LimitRange and ResourceQuota in each project.
Establish sizing guidelines for teams (e.g., small/medium/large pod flavors).

Treating OpenShift Like “Just VMs”

Applications often arrive “lift-and-shift” from VMs:

Stateful applications using local disk or in-container storage.
Manual configuration via SSH or interactive shells inside containers.
Long-running background tasks managed manually.

Consequences:

Loss of data when pods are rescheduled.
Configuration drift and unrepeatable deployments.
Fragile operational practices that do not fit container orchestration.

Better approach:

Externalize state to persistent volumes or external services.
Use ConfigMaps, Secrets, and environment variables for configuration.
Use DeploymentConfigs/Deployments, Jobs, and CronJobs instead of manual process management.
Define everything as YAML manifests in version control.

Misusing Routes, Services, and Network Policies

Common networking missteps include:

Exposing everything with public Routes by default.
Confusing ClusterIP, NodePort, and Routes and using NodePorts for general access.
Not defining network policies, leaving all pods able to talk to each other.
Hard-coding pod IPs or node IPs in applications.

Consequences:

Unnecessary attack surface and exposure to the internet.
Complex, brittle network paths that are hard to debug.
Lateral movement risk if an application is compromised.

Better approach:

Use Services for stable in-cluster communication; never rely on pod IPs.
Expose external endpoints via Routes or Ingress, not NodePort, unless there is a specific need.
Start with a default-deny network policy model and explicitly allow necessary flows.
Use DNS-based discovery (<service>.<namespace>.svc.cluster.local) instead of IPs.

Mismanaging Storage and Data

Typical storage pitfalls:

Using ephemeral storage for persistent workloads (databases, queues).
Assuming local node storage is durable and stable.
Creating PersistentVolume objects manually for everything.
Ignoring performance and access mode (e.g., using RWO volumes for multiple replicas).

Consequences:

Data loss when pods are rescheduled or nodes fail.
Performance problems (e.g., latency-sensitive workloads on slow backing storage).
Stuck pods waiting for unattainable volumes or access modes.

Better approach:

Use PersistentVolumeClaims with appropriate StorageClass for persistent data.
Understand the storage type (file/block/object) and access modes (RWO, RWX, ROX) and choose accordingly.
Use dynamic provisioning where possible; reserve static PVs only for special cases.
Design stateful workloads with failure scenarios in mind (backup, restore, resync).

Overlooking Security and Compliance Controls

Frequent issues:

Running containers as root or with overly permissive SCCs.
Pulling images from untrusted registries or without scanning them.
Storing secrets in ConfigMaps or environment variables in plain text.
Sharing service accounts between applications.

Consequences:

Elevated blast radius if an application is compromised.
Non-compliance with organizational or regulatory requirements.
Increased difficulty of security auditing and incident response.

Better approach:

Use appropriate Security Context Constraints and run as non-root whenever possible.
Store credentials in Secrets and limit their distribution.
Pin images to trusted registries; integrate image scanning into build or admission workflows.
Use separate service accounts per application with least-privilege RBAC.

Ignoring Cluster and Application Observability

New teams often run with default settings and:

Do not define application metrics or expose them in a standard format.
Ignore out-of-the-box dashboards and alerts until something critical fails.
Rely only on logs inside containers rather than central logging.
Have no clear SLOs or thresholds.

Consequences:

Slow incident detection and resolution.
Difficulty correlating cluster events (node failures, OOMs) with application symptoms.
Little data available for capacity planning.

Better approach:

Integrate application metrics with the built-in monitoring stack (e.g., Prometheus format).
Define key alerts on latency, errors, and resource saturation.
Centralize logging using the OpenShift logging architecture; avoid local log scraping.
Define basic SLOs per service and use dashboards to track them.

Misconfigured Autoscaling and High Availability

Typical mistakes:

Enabling Horizontal Pod Autoscaling without setting resource requests correctly.
Assuming more replicas always equals high availability.
Ignoring pod disruption budgets (PDBs) and health probes.
Forgot about cluster autoscaler or lack of capacity for scale-out.

Consequences:

Unpredictable scaling behavior and thrashing.
Cascading failures when nodes drain and all replicas are removed at once.
Downtime during planned or unplanned maintenance.

Better approach:

Set accurate resource requests and use them as the basis for HPA.
Use readiness and liveness probes to control rollout behavior and restarts.
Define PodDisruptionBudgets for critical workloads.
Align application autoscaling with cluster capacity and node scaling policies.

Poor Image and Build Practices

Common build and image pitfalls:

Using large base images with many unused tools.
Building images directly on the cluster as root or with unsafe Dockerfiles.
Baking environment-specific configuration into images.
Not tagging images immutably (e.g., only using latest).

Consequences:

Large image sizes, slow deployments, and increased attack surface.
Inconsistent behavior between environments.
Difficulty rolling back or auditing exactly what ran where.

Better approach:

Use minimal, well-maintained base images.
Separate build and runtime stages (multi-stage builds or S2I) to keep runtime images small.
Keep images environment-agnostic; inject configuration at deploy time.
Use immutable tags (e.g., git SHA, build ID) and treat latest only as a moving pointer, if at all.

Neglecting Operational Runbooks and Ownership

Another recurring issue is organizational, not technical:

No clear ownership of applications or namespaces.
No documented procedures for common operations (restart, rollback, scale, outage).
Overreliance on a central “cluster admin” for all tasks.

Consequences:

Slow response during incidents.
High friction to onboard new services or teams.
Misuse of admin privileges when simple, delegated operations would suffice.

Better approach:

Define ownership for each project, application, and critical component.
Create and maintain runbooks for standard scenarios:

Deployment failure
Rollback
Capacity exhaustion
Storage incidents

Delegate day-to-day operations using RBAC; reserve cluster-admin for platform-level tasks.

Best Practices for Day-to-Day Work with OpenShift

Design Applications for the Platform

To work well with OpenShift, applications should:

Be stateless where possible; where state is necessary, isolate it and use proper storage.
Use configuration via environment variables, ConfigMaps, and Secrets.
Gracefully handle restarts and rescheduling (e.g., handle SIGTERM, support quick startup).
Avoid assumptions about local filesystem persistence or network topology.

Embrace Declarative, Git-Centric Workflows

Instead of manual changes:

Store all manifests (Deployments, Services, Routes, policies) in version control.
Use declarative tools and GitOps workflows for environment management.
Treat OpenShift as an execution target; treat Git as the source of truth.

Benefits:

Consistency across environments.
Auditability and easy rollback.
Collaboration and review via pull requests.

Standardize Project and Resource Conventions

Cluster-wide conventions reduce complexity and entropy:

Naming patterns for projects and resources (e.g., team-env-app).
Standard labels and annotations (e.g., app, team, environment, tier).
Shared base templates for common workloads (web service, batch job, cron job).

These conventions make it easier to:

Filter and group resources.
Apply policies and quotas.
Onboard new users to a predictable environment.

Use Platform Features Instead of Custom Plumbing

Avoid re-inventing mechanisms that OpenShift already provides, such as:

Blue/green or rolling deployments: use DeploymentConfigs/Deployments, not custom scripts.
Access control: use RBAC and SCCs, not ad-hoc access lists in applications.
SSL/TLS termination: use Routes and built-in certificates where appropriate.

Leveraging these fully:

Simplifies operational complexity.
Ensures you benefit from upstream improvements and bug fixes.
Keeps architecture aligned with the platform’s strengths.

Iterate Safely Across Environments

Adopt an environment promotion strategy:

Separate dev, test, and prod clusters or projects.
Promote the same artifact (image) through environments; avoid rebuilding for each stage.
Use the same manifests with environment-specific overlays (e.g., configuration values).

Guidelines:

Practice rollbacks and failover in non-prod regularly.
Keep production as close as possible to test, aside from sizing and secrets.

Integrate Security, Compliance, and Observability Early

Instead of adding them late:

Make security scanning, policy checks, and tests part of your CI/CD pipeline.
Standardize logging, metrics, and tracing from the first service onward.
Align with organization-wide policies (e.g., approved registries, encryption at rest/in transit).

This reduces:

Surprises during security reviews.
Expensive rework to retrofit observability.
The risk of non-compliant deployments.

Collaborate Between Platform and Application Teams

OpenShift success usually depends on healthy collaboration:

Platform team:

Defines cluster-level policies, quotas, and shared services.
Offers templates, documentation, and guardrails.

Application teams:

Own application lifecycles and ensure apps adhere to platform standards.
Provide feedback on friction and missing capabilities.

Best practices:

Regularly review resource usage, incidents, and changes together.
Maintain shared documentation and internal “cookbooks” of patterns that work well.
Use internal communities of practice to spread knowledge.

Lightweight Checklists

Before Deploying a New Application

[ ] Resource requests and limits defined.
[ ] ConfigMaps and Secrets used; no secrets in plain text files in images.
[ ] Health probes (liveness/readiness) configured.
[ ] Storage needs assessed; appropriate PVCs and StorageClasses configured.
[ ] Network exposure justified; only necessary Routes and Services created.
[ ] RBAC and SCC requirements identified; least privilege applied.
[ ] Logging and metrics integrated with platform tools.

Before Promoting to Production

[ ] Same container image tested in lower environments.
[ ] Automated deployment (no manual oc patching in prod).
[ ] Rollback procedure documented and tested.
[ ] Capacity and autoscaling behavior validated.
[ ] Alerts tuned and tested for key failure modes.
[ ] Ownership and on-call responsibility defined.

Using these patterns and avoiding the common pitfalls transforms OpenShift from “just another cluster” into a reliable, secure, and efficient platform for your applications.

Comments

Please login to add a comment.

Don't have an account? Register now!

18.4 Common pitfalls and best practices

Typical Pitfalls When Starting with OpenShift

Misunderstanding Projects, Namespaces, and Multi-Tenancy

Ignoring Resource Requests, Limits, and Quotas

Treating OpenShift Like “Just VMs”

Misusing Routes, Services, and Network Policies

Mismanaging Storage and Data

Overlooking Security and Compliance Controls

Ignoring Cluster and Application Observability

Misconfigured Autoscaling and High Availability

Poor Image and Build Practices

Neglecting Operational Runbooks and Ownership

Best Practices for Day-to-Day Work with OpenShift

Design Applications for the Platform

Embrace Declarative, Git-Centric Workflows

Standardize Project and Resource Conventions

Use Platform Features Instead of Custom Plumbing

Iterate Safely Across Environments

Integrate Security, Compliance, and Observability Early

Collaborate Between Platform and Application Teams

Lightweight Checklists

Before Deploying a New Application

Before Promoting to Production

Comments

Where to Move