Table of Contents
Typical Pitfalls When Starting with OpenShift
Misunderstanding Projects, Namespaces, and Multi-Tenancy
Many teams treat an OpenShift cluster as if it were a single-tenant environment and:
- Put everything into a single project (namespace).
- Mix dev, test, and prod workloads in one project.
- Share service accounts and credentials across teams.
Consequences:
- Hard to apply different security and resource policies.
- Risk of accidental cross-team impact (e.g., deleting shared resources).
- Difficult cost and capacity attribution per team or environment.
Better approach:
- Use separate projects per team and per environment (e.g.,
team-a-dev,team-a-test,team-a-prod). - Use RBAC to scope who can access which projects.
- Use resource quotas and limit ranges per project to avoid noisy neighbors.
Ignoring Resource Requests, Limits, and Quotas
A common mistake is leaving CPU and memory unspecified or using arbitrary values:
- No
resources.requestsset: - The scheduler has poor information, resulting in inefficient placement.
- No
resources.limitsset: - A runaway process can consume all memory on a node and trigger OOM kills for other pods.
- Over-committing or under-committing:
- Over-commit: too many pods per node, leading to contention and throttling.
- Under-commit: cluster appears “full” even with unused capacity.
Better approach:
- Always specify
resources.requestsandresources.limitsfor CPU and memory in pod templates. - Define sensible
LimitRangeandResourceQuotain each project. - Establish sizing guidelines for teams (e.g., small/medium/large pod flavors).
Treating OpenShift Like “Just VMs”
Applications often arrive “lift-and-shift” from VMs:
- Stateful applications using local disk or in-container storage.
- Manual configuration via SSH or interactive shells inside containers.
- Long-running background tasks managed manually.
Consequences:
- Loss of data when pods are rescheduled.
- Configuration drift and unrepeatable deployments.
- Fragile operational practices that do not fit container orchestration.
Better approach:
- Externalize state to persistent volumes or external services.
- Use ConfigMaps, Secrets, and environment variables for configuration.
- Use DeploymentConfigs/Deployments, Jobs, and CronJobs instead of manual process management.
- Define everything as YAML manifests in version control.
Misusing Routes, Services, and Network Policies
Common networking missteps include:
- Exposing everything with public Routes by default.
- Confusing ClusterIP, NodePort, and Routes and using NodePorts for general access.
- Not defining network policies, leaving all pods able to talk to each other.
- Hard-coding pod IPs or node IPs in applications.
Consequences:
- Unnecessary attack surface and exposure to the internet.
- Complex, brittle network paths that are hard to debug.
- Lateral movement risk if an application is compromised.
Better approach:
- Use Services for stable in-cluster communication; never rely on pod IPs.
- Expose external endpoints via Routes or Ingress, not NodePort, unless there is a specific need.
- Start with a default-deny network policy model and explicitly allow necessary flows.
- Use DNS-based discovery (
<service>.<namespace>.svc.cluster.local) instead of IPs.
Mismanaging Storage and Data
Typical storage pitfalls:
- Using ephemeral storage for persistent workloads (databases, queues).
- Assuming local node storage is durable and stable.
- Creating
PersistentVolumeobjects manually for everything. - Ignoring performance and access mode (e.g., using RWO volumes for multiple replicas).
Consequences:
- Data loss when pods are rescheduled or nodes fail.
- Performance problems (e.g., latency-sensitive workloads on slow backing storage).
- Stuck pods waiting for unattainable volumes or access modes.
Better approach:
- Use PersistentVolumeClaims with appropriate
StorageClassfor persistent data. - Understand the storage type (file/block/object) and access modes (RWO, RWX, ROX) and choose accordingly.
- Use dynamic provisioning where possible; reserve static PVs only for special cases.
- Design stateful workloads with failure scenarios in mind (backup, restore, resync).
Overlooking Security and Compliance Controls
Frequent issues:
- Running containers as root or with overly permissive SCCs.
- Pulling images from untrusted registries or without scanning them.
- Storing secrets in ConfigMaps or environment variables in plain text.
- Sharing service accounts between applications.
Consequences:
- Elevated blast radius if an application is compromised.
- Non-compliance with organizational or regulatory requirements.
- Increased difficulty of security auditing and incident response.
Better approach:
- Use appropriate Security Context Constraints and run as non-root whenever possible.
- Store credentials in Secrets and limit their distribution.
- Pin images to trusted registries; integrate image scanning into build or admission workflows.
- Use separate service accounts per application with least-privilege RBAC.
Ignoring Cluster and Application Observability
New teams often run with default settings and:
- Do not define application metrics or expose them in a standard format.
- Ignore out-of-the-box dashboards and alerts until something critical fails.
- Rely only on logs inside containers rather than central logging.
- Have no clear SLOs or thresholds.
Consequences:
- Slow incident detection and resolution.
- Difficulty correlating cluster events (node failures, OOMs) with application symptoms.
- Little data available for capacity planning.
Better approach:
- Integrate application metrics with the built-in monitoring stack (e.g., Prometheus format).
- Define key alerts on latency, errors, and resource saturation.
- Centralize logging using the OpenShift logging architecture; avoid local log scraping.
- Define basic SLOs per service and use dashboards to track them.
Misconfigured Autoscaling and High Availability
Typical mistakes:
- Enabling Horizontal Pod Autoscaling without setting resource requests correctly.
- Assuming more replicas always equals high availability.
- Ignoring pod disruption budgets (PDBs) and health probes.
- Forgot about cluster autoscaler or lack of capacity for scale-out.
Consequences:
- Unpredictable scaling behavior and thrashing.
- Cascading failures when nodes drain and all replicas are removed at once.
- Downtime during planned or unplanned maintenance.
Better approach:
- Set accurate resource requests and use them as the basis for HPA.
- Use readiness and liveness probes to control rollout behavior and restarts.
- Define PodDisruptionBudgets for critical workloads.
- Align application autoscaling with cluster capacity and node scaling policies.
Poor Image and Build Practices
Common build and image pitfalls:
- Using large base images with many unused tools.
- Building images directly on the cluster as root or with unsafe Dockerfiles.
- Baking environment-specific configuration into images.
- Not tagging images immutably (e.g., only using
latest).
Consequences:
- Large image sizes, slow deployments, and increased attack surface.
- Inconsistent behavior between environments.
- Difficulty rolling back or auditing exactly what ran where.
Better approach:
- Use minimal, well-maintained base images.
- Separate build and runtime stages (multi-stage builds or S2I) to keep runtime images small.
- Keep images environment-agnostic; inject configuration at deploy time.
- Use immutable tags (e.g., git SHA, build ID) and treat
latestonly as a moving pointer, if at all.
Neglecting Operational Runbooks and Ownership
Another recurring issue is organizational, not technical:
- No clear ownership of applications or namespaces.
- No documented procedures for common operations (restart, rollback, scale, outage).
- Overreliance on a central “cluster admin” for all tasks.
Consequences:
- Slow response during incidents.
- High friction to onboard new services or teams.
- Misuse of admin privileges when simple, delegated operations would suffice.
Better approach:
- Define ownership for each project, application, and critical component.
- Create and maintain runbooks for standard scenarios:
- Deployment failure
- Rollback
- Capacity exhaustion
- Storage incidents
- Delegate day-to-day operations using RBAC; reserve cluster-admin for platform-level tasks.
Best Practices for Day-to-Day Work with OpenShift
Design Applications for the Platform
To work well with OpenShift, applications should:
- Be stateless where possible; where state is necessary, isolate it and use proper storage.
- Use configuration via environment variables, ConfigMaps, and Secrets.
- Gracefully handle restarts and rescheduling (e.g., handle SIGTERM, support quick startup).
- Avoid assumptions about local filesystem persistence or network topology.
Embrace Declarative, Git-Centric Workflows
Instead of manual changes:
- Store all manifests (Deployments, Services, Routes, policies) in version control.
- Use declarative tools and GitOps workflows for environment management.
- Treat OpenShift as an execution target; treat Git as the source of truth.
Benefits:
- Consistency across environments.
- Auditability and easy rollback.
- Collaboration and review via pull requests.
Standardize Project and Resource Conventions
Cluster-wide conventions reduce complexity and entropy:
- Naming patterns for projects and resources (e.g.,
team-env-app). - Standard labels and annotations (e.g.,
app,team,environment,tier). - Shared base templates for common workloads (web service, batch job, cron job).
These conventions make it easier to:
- Filter and group resources.
- Apply policies and quotas.
- Onboard new users to a predictable environment.
Use Platform Features Instead of Custom Plumbing
Avoid re-inventing mechanisms that OpenShift already provides, such as:
- Blue/green or rolling deployments: use DeploymentConfigs/Deployments, not custom scripts.
- Access control: use RBAC and SCCs, not ad-hoc access lists in applications.
- SSL/TLS termination: use Routes and built-in certificates where appropriate.
Leveraging these fully:
- Simplifies operational complexity.
- Ensures you benefit from upstream improvements and bug fixes.
- Keeps architecture aligned with the platform’s strengths.
Iterate Safely Across Environments
Adopt an environment promotion strategy:
- Separate dev, test, and prod clusters or projects.
- Promote the same artifact (image) through environments; avoid rebuilding for each stage.
- Use the same manifests with environment-specific overlays (e.g., configuration values).
Guidelines:
- Practice rollbacks and failover in non-prod regularly.
- Keep production as close as possible to test, aside from sizing and secrets.
Integrate Security, Compliance, and Observability Early
Instead of adding them late:
- Make security scanning, policy checks, and tests part of your CI/CD pipeline.
- Standardize logging, metrics, and tracing from the first service onward.
- Align with organization-wide policies (e.g., approved registries, encryption at rest/in transit).
This reduces:
- Surprises during security reviews.
- Expensive rework to retrofit observability.
- The risk of non-compliant deployments.
Collaborate Between Platform and Application Teams
OpenShift success usually depends on healthy collaboration:
- Platform team:
- Defines cluster-level policies, quotas, and shared services.
- Offers templates, documentation, and guardrails.
- Application teams:
- Own application lifecycles and ensure apps adhere to platform standards.
- Provide feedback on friction and missing capabilities.
Best practices:
- Regularly review resource usage, incidents, and changes together.
- Maintain shared documentation and internal “cookbooks” of patterns that work well.
- Use internal communities of practice to spread knowledge.
Lightweight Checklists
Before Deploying a New Application
- [ ] Resource requests and limits defined.
- [ ] ConfigMaps and Secrets used; no secrets in plain text files in images.
- [ ] Health probes (liveness/readiness) configured.
- [ ] Storage needs assessed; appropriate PVCs and StorageClasses configured.
- [ ] Network exposure justified; only necessary Routes and Services created.
- [ ] RBAC and SCC requirements identified; least privilege applied.
- [ ] Logging and metrics integrated with platform tools.
Before Promoting to Production
- [ ] Same container image tested in lower environments.
- [ ] Automated deployment (no manual
ocpatching in prod). - [ ] Rollback procedure documented and tested.
- [ ] Capacity and autoscaling behavior validated.
- [ ] Alerts tuned and tested for key failure modes.
- [ ] Ownership and on-call responsibility defined.
Using these patterns and avoiding the common pitfalls transforms OpenShift from “just another cluster” into a reliable, secure, and efficient platform for your applications.