Table of Contents
Why Build Custom Operators?
Custom Operators extend OpenShift and Kubernetes with domain-specific automation. While catalog Operators (from Red Hat or vendors) cover common platforms and services, custom Operators let you:
- Encode your organization’s operational knowledge (runbooks, SOPs) as code.
- Provide a self-service API for internal platforms (databases, ML platforms, specialized HPC services).
- Standardize deployment patterns so development teams consume higher-level abstractions instead of raw Kubernetes primitives.
In short, a custom Operator turns your “how we run X in production” into a native OpenShift resource that anyone can create and manage declaratively.
Typical Use Cases for Custom Operators
Some common patterns where custom Operators are valuable:
- Internal platforms and PaaS features
- “Internal DB-as-a-Service” (PostgreSQL, MySQL, Redis, etc.).
- Managed message brokers, caches, or search (Kafka, RabbitMQ, ElasticSearch).
- Company-standard application stacks (e.g., Java microservice with sidecars, logging, and monitoring baked in).
- Complex, stateful, or lifecycle-heavy applications
- Distributed databases and storage systems.
- Clustered middleware (e.g., identity providers, API gateways).
- Systems requiring careful rolling updates, backup/restore, or scale operations.
- Environment and compliance enforcement
- Automatic injection of organization-required security settings, sidecars, labels, or network policies.
- Enforcement of naming conventions or deployment policies via admission-style logic.
- HPC and data/AI workloads
- Provisioning of GPU-enabled job environments.
- MPI cluster launchers or job orchestration flows.
- Data pipeline orchestration with specialized resource coordination.
Core Building Blocks of a Custom Operator
A custom Operator is built around a few key concepts:
Custom Resource Definitions (CRDs)
A Custom Resource Definition (CRD) adds a new API type to the cluster, such as:
DatabaseCacheClusterMpiJobModelServingEndpoint
Each CRD defines:
- API group and version (e.g.,
database.example.com/v1alpha1). - Kind (e.g.,
PostgresInstance). - Schema (fields like
spec.size,spec.storage,spec.backupPolicy).
Once the CRD is installed, users can create Custom Resources (CRs) with oc or YAML, just like native Kubernetes/OpenShift objects.
Controllers and Reconciliation
A custom Operator typically includes one or more controllers that:
- Watch for changes to:
- Custom Resources of a given kind.
- Related native resources (Pods, Deployments, PVCs, etc.).
- Reconcile actual cluster state towards the desired state from a CR’s
spec.
The reconciliation pattern usually follows:
- Trigger: A CR is created/updated/deleted or related resources change.
- Observe: The controller reads current state (CR + dependent objects).
- Decide: Determine what needs to change.
- Act: Create/update/delete Kubernetes/OpenShift objects.
- Record status: Update the CR
statusto reflect progress or failures.
Operator Packaging and Distribution
Custom Operators are typically packaged for easy installation:
- As an Operator bundle that can be managed by the Operator Lifecycle Manager (OLM).
- With associated ClusterServiceVersion (CSV) metadata describing:
- Provided APIs (CRDs).
- Required permissions (RBAC).
- Dependencies on other Operators.
- Upgrade strategy and install modes.
This packaging enables:
- Cluster-wide or namespace-scoped installation.
- Versioned upgrades through the OpenShift web console or
oc. - Integration with OperatorHub (internal or external).
Design Considerations for Custom Operators
When designing a custom Operator, there are architectural and API-level decisions that determine how usable and maintainable it will be.
API Design for Custom Resources
Think of your CRD as a product interface:
- Audience: Who will use this CRD?
- Platform engineers? Application developers? Data scientists?
- Abstraction level:
- High-level “I want a
ProductionPostgreswith 3 replicas” vs. exposing low-level tuning knobs. - Schema clarity:
- Use clear field names and types.
- Separate required vs. optional fields.
- Provide sane defaults whenever possible.
Patterns that help:
- Use
specfor desired configuration,statusfor observed state. - Include status fields such as:
status.phase(e.g.,Pending,Ready,Degraded).status.conditions(withtype,status,reason,message).- Important computed fields like connection endpoints, credentials references, etc.
Scoping and Multi-Tenancy
Decide whether your Operator:
- Runs cluster-wide (manages CRs in many namespaces).
- Runs namespace-scoped (limited to a specific project).
Consider:
- Security: Which ServiceAccounts and RBAC rules are needed?
- Multi-tenancy: Should different teams own different instances? Should quotas/limits apply per namespace?
- Resource ownership: Ensure that resources created by the Operator follow clear naming and labels for traceability.
Declarative vs. Imperative Behavior
Operators should:
- Favor declarative interfaces: “what” users want, not “how.”
- Encode the imperative steps (ordering, retries, error recovery) within the Operator logic, not in user-facing YAML.
For example, a CR:
apiVersion: database.example.com/v1
kind: PostgresInstance
metadata:
name: app-db
spec:
size: small
storage:
capacity: 100Gi
highAvailability: trueThe Operator handles:
- Creating StatefulSets, PVCs, Services.
- Coordinated rolling upgrades.
- Automatic failover procedures.
- Backups and restore hooks.
Users only set high-level goals; the Operator decides the action plan.
Implementation Approaches and Tooling
There are multiple ways to implement custom Operators; the choice affects language, tooling, and complexity.
Operator SDK
The Operator SDK (from the Operator Framework) is the standard toolkit for building Operators running on OpenShift. It supports different programming models:
- Go-based Operators
- Full control and flexibility.
- Use
controller-runtimelibraries and patterns. - Best suited for complex, performance-sensitive Operators.
- Ansible-based Operators
- Use Ansible playbooks and roles as reconciliation logic.
- Good for teams with strong Ansible expertise and existing automation.
- Lower barrier to entry for straightforward provisioning patterns.
- Helm-based Operators
- Wrap an existing Helm chart in Operator reconciliation logic.
- Useful when your application is already managed via Helm.
- Less flexible for sophisticated, stateful behavior.
Operator SDK provides:
- Project scaffolding.
- CRD/CV generation and validation.
- Local testing tools.
- Integration with OLM packaging.
Other Frameworks and Patterns
While Operator SDK is most common on OpenShift, you may also encounter:
- Kubebuilder (Go-based; underlying tech for Operator SDK).
- Java, Python, or .NET Operators
- Using language-specific libraries (e.g., Java Operator SDK, Kopf in Python).
- Appropriate if your team prefers a particular language ecosystem.
When using non-standard frameworks, ensure:
- You can containerize and deploy your controller to OpenShift.
- You still follow Kubernetes reconciliation and CRD best practices.
- You can integrate with OLM packaging if you want OperatorHub distribution.
Lifecycle Management of Custom Operators
Building an Operator is only part of the story; you must manage its full lifecycle in production.
Versioning and Upgrades
Key considerations:
- Version both CRDs and Operator logic:
- Use API versioning (
v1alpha1,v1beta1,v1) as your interface stabilizes. - Backward compatibility:
- Support older CR versions during migrations when possible.
- Provide conversion webhooks if you need in-place upgrades between API versions.
- Upgrade strategy:
- Use OLM’s channels (e.g.,
stable,fast) to roll out new versions. - Carefully test upgrade paths, especially for stateful services.
Reliability and Error Handling
A robust custom Operator should:
- Handle transient errors with retries and backoff.
- Avoid tight reconciliation loops that overload the API server.
- Clearly represent error states in CR
statusfields. - Provide actionable error messages for users (not just “failed”).
Consider:
- Idempotent operations: reconciliation should be safe to run repeatedly.
- Protective checks: guard against destructive operations (e.g., unintended data deletion on spec change).
Observability for Operators
To operate and debug custom Operators:
- Emit structured logs with correlation identifiers (e.g., CR name/namespace).
- Expose metrics:
- Reconciliation durations and error counts.
- Number of managed resources.
- Integrate with:
- Cluster logging and monitoring.
- Alerting (for persistent errors, reconciliation failures, or resource anomalies).
This allows SRE/platform teams to treat the Operator itself as a first-class, observable service.
Security and Permissions
Custom Operators require RBAC permissions to manage resources on users’ behalf. Poorly scoped permissions can introduce risk.
RBAC Scoping
Design permission sets carefully:
- Use the principle of least privilege:
- Only grant verbs and resources the Operator actually needs.
- Differentiate:
- Namespaced Operators: scoped to specific namespaces.
- Cluster-scoped Operators: restricted to specific CRDs and core resources necessary for their function.
Review:
- ServiceAccount used by the Operator.
- ClusterRole and RoleBindings granted to that ServiceAccount.
Managing Sensitive Data
If your Operator handles secrets (e.g., DB credentials, TLS keys):
- Use Kubernetes Secrets for storage, not CR
spec. - Avoid logging sensitive values.
- Ensure status fields do not expose credentials.
- Support integration with external secret management solutions where appropriate.
Example Design Pattern: Application Stack Operator
To illustrate how components fit together, consider a simplified pattern for a “WebApp” Operator that standardizes an application stack for development teams.
- CRD:
WebAppwith fields: spec.imagespec.replicasspec.databaseRefspec.ingressHost- Operator behavior:
- On
WebAppcreation: - Create
Deploymentfor the app. - Create or reference a DB CR (e.g.,
PostgresInstance). - Create a
ServiceandRoute. - On updates:
- Perform rolling updates of the Deployment.
- Adjust replicas.
- Update Route if host changes.
- On deletion:
- Clean up application Deployment, Service, and Route.
- Optionally delete or retain the DB instance based on a policy in
spec.
Developers then submit a single WebApp YAML instead of manually wiring multiple resources, while platform teams encode best practices (resource limits, labels, security profiles, logging sidecars) inside the Operator.
Best Practices for Custom Operators
When building custom Operators for OpenShift, follow these guidelines:
- Keep CRDs focused and stable
- Avoid exposing every tunable as a field; start with essential ones.
- Evolve API versions as usage grows; don’t break existing users without migration paths.
- Leverage OpenShift-native features
- Use Routes, SCCs, and OpenShift-specific objects where appropriate.
- Align with cluster-wide observability and logging.
- Integrate with OpenShift authentication/authorization patterns where relevant.
- Automate testing
- Unit tests for reconciliation logic.
- Integration tests on an ephemeral cluster (e.g., using kind, CRC, or test OpenShift clusters).
- Upgrade tests across Operator versions, particularly for stateful workflows.
- Document and support your APIs
- Provide examples and clear field descriptions.
- Explain typical failure modes and remediation steps.
- Treat custom CRDs as public contracts with your internal users.
- Plan for deprecation
- Mark fields or CR versions as deprecated when needed.
- Provide migration recommendations and tooling when APIs change.
By carefully designing, implementing, and operating custom Operators with these practices, you can turn OpenShift into a powerful, self-service platform tailored to your organization’s applications and operational standards.