15.5 Custom Operators

Why Build Custom Operators?

Custom Operators extend OpenShift and Kubernetes with domain-specific automation. While catalog Operators (from Red Hat or vendors) cover common platforms and services, custom Operators let you:

Encode your organization’s operational knowledge (runbooks, SOPs) as code.
Provide a self-service API for internal platforms (databases, ML platforms, specialized HPC services).
Standardize deployment patterns so development teams consume higher-level abstractions instead of raw Kubernetes primitives.

In short, a custom Operator turns your “how we run X in production” into a native OpenShift resource that anyone can create and manage declaratively.

Typical Use Cases for Custom Operators

Some common patterns where custom Operators are valuable:

Internal platforms and PaaS features

“Internal DB-as-a-Service” (PostgreSQL, MySQL, Redis, etc.).
Managed message brokers, caches, or search (Kafka, RabbitMQ, ElasticSearch).
Company-standard application stacks (e.g., Java microservice with sidecars, logging, and monitoring baked in).

Complex, stateful, or lifecycle-heavy applications

Distributed databases and storage systems.
Clustered middleware (e.g., identity providers, API gateways).
Systems requiring careful rolling updates, backup/restore, or scale operations.

Environment and compliance enforcement

Automatic injection of organization-required security settings, sidecars, labels, or network policies.
Enforcement of naming conventions or deployment policies via admission-style logic.

HPC and data/AI workloads

Provisioning of GPU-enabled job environments.
MPI cluster launchers or job orchestration flows.
Data pipeline orchestration with specialized resource coordination.

Core Building Blocks of a Custom Operator

A custom Operator is built around a few key concepts:

Custom Resource Definitions (CRDs)

A Custom Resource Definition (CRD) adds a new API type to the cluster, such as:

Database
CacheCluster
MpiJob
ModelServingEndpoint

Each CRD defines:

API group and version (e.g., database.example.com/v1alpha1).
Kind (e.g., PostgresInstance).
Schema (fields like spec.size, spec.storage, spec.backupPolicy).

Once the CRD is installed, users can create Custom Resources (CRs) with oc or YAML, just like native Kubernetes/OpenShift objects.

Controllers and Reconciliation

A custom Operator typically includes one or more controllers that:

Watch for changes to:

Custom Resources of a given kind.
Related native resources (Pods, Deployments, PVCs, etc.).

Reconcile actual cluster state towards the desired state from a CR’s spec.

The reconciliation pattern usually follows:

Trigger: A CR is created/updated/deleted or related resources change.
Observe: The controller reads current state (CR + dependent objects).
Decide: Determine what needs to change.
Act: Create/update/delete Kubernetes/OpenShift objects.
Record status: Update the CR status to reflect progress or failures.

Operator Packaging and Distribution

Custom Operators are typically packaged for easy installation:

As an Operator bundle that can be managed by the Operator Lifecycle Manager (OLM).
With associated ClusterServiceVersion (CSV) metadata describing:

Provided APIs (CRDs).
Required permissions (RBAC).
Dependencies on other Operators.
Upgrade strategy and install modes.

This packaging enables:

Cluster-wide or namespace-scoped installation.
Versioned upgrades through the OpenShift web console or oc.
Integration with OperatorHub (internal or external).

Design Considerations for Custom Operators

When designing a custom Operator, there are architectural and API-level decisions that determine how usable and maintainable it will be.

API Design for Custom Resources

Think of your CRD as a product interface:

Audience: Who will use this CRD?

Platform engineers? Application developers? Data scientists?

Abstraction level:

High-level “I want a ProductionPostgres with 3 replicas” vs. exposing low-level tuning knobs.

Schema clarity:

Use clear field names and types.
Separate required vs. optional fields.
Provide sane defaults whenever possible.

Patterns that help:

Use spec for desired configuration, status for observed state.
Include status fields such as:

status.phase (e.g., Pending, Ready, Degraded).
status.conditions (with type, status, reason, message).
Important computed fields like connection endpoints, credentials references, etc.

Scoping and Multi-Tenancy

Decide whether your Operator:

Runs cluster-wide (manages CRs in many namespaces).
Runs namespace-scoped (limited to a specific project).

Consider:

Security: Which ServiceAccounts and RBAC rules are needed?
Multi-tenancy: Should different teams own different instances? Should quotas/limits apply per namespace?
Resource ownership: Ensure that resources created by the Operator follow clear naming and labels for traceability.

Declarative vs. Imperative Behavior

Operators should:

Favor declarative interfaces: “what” users want, not “how.”
Encode the imperative steps (ordering, retries, error recovery) within the Operator logic, not in user-facing YAML.

For example, a CR:

yaml

apiVersion: database.example.com/v1
kind: PostgresInstance
metadata:
  name: app-db
spec:
  size: small
  storage:
    capacity: 100Gi
  highAvailability: true

The Operator handles:

Creating StatefulSets, PVCs, Services.
Coordinated rolling upgrades.
Automatic failover procedures.
Backups and restore hooks.

Users only set high-level goals; the Operator decides the action plan.

Implementation Approaches and Tooling

There are multiple ways to implement custom Operators; the choice affects language, tooling, and complexity.

Operator SDK

The Operator SDK (from the Operator Framework) is the standard toolkit for building Operators running on OpenShift. It supports different programming models:

Go-based Operators

Full control and flexibility.
Use controller-runtime libraries and patterns.
Best suited for complex, performance-sensitive Operators.

Ansible-based Operators

Use Ansible playbooks and roles as reconciliation logic.
Good for teams with strong Ansible expertise and existing automation.
Lower barrier to entry for straightforward provisioning patterns.

Helm-based Operators

Wrap an existing Helm chart in Operator reconciliation logic.
Useful when your application is already managed via Helm.
Less flexible for sophisticated, stateful behavior.

Operator SDK provides:

Project scaffolding.
CRD/CV generation and validation.
Local testing tools.
Integration with OLM packaging.

Other Frameworks and Patterns

While Operator SDK is most common on OpenShift, you may also encounter:

Kubebuilder (Go-based; underlying tech for Operator SDK).
Java, Python, or .NET Operators

Using language-specific libraries (e.g., Java Operator SDK, Kopf in Python).
Appropriate if your team prefers a particular language ecosystem.

When using non-standard frameworks, ensure:

You can containerize and deploy your controller to OpenShift.
You still follow Kubernetes reconciliation and CRD best practices.
You can integrate with OLM packaging if you want OperatorHub distribution.

Lifecycle Management of Custom Operators

Building an Operator is only part of the story; you must manage its full lifecycle in production.

Versioning and Upgrades

Key considerations:

Version both CRDs and Operator logic:

Use API versioning (v1alpha1, v1beta1, v1) as your interface stabilizes.

Backward compatibility:

Support older CR versions during migrations when possible.
Provide conversion webhooks if you need in-place upgrades between API versions.

Upgrade strategy:

Use OLM’s channels (e.g., stable, fast) to roll out new versions.
Carefully test upgrade paths, especially for stateful services.

Reliability and Error Handling

A robust custom Operator should:

Handle transient errors with retries and backoff.
Avoid tight reconciliation loops that overload the API server.
Clearly represent error states in CR status fields.
Provide actionable error messages for users (not just “failed”).

Consider:

Idempotent operations: reconciliation should be safe to run repeatedly.
Protective checks: guard against destructive operations (e.g., unintended data deletion on spec change).

Observability for Operators

To operate and debug custom Operators:

Emit structured logs with correlation identifiers (e.g., CR name/namespace).
Expose metrics:

Reconciliation durations and error counts.
Number of managed resources.

Integrate with:

Cluster logging and monitoring.
Alerting (for persistent errors, reconciliation failures, or resource anomalies).

This allows SRE/platform teams to treat the Operator itself as a first-class, observable service.

Security and Permissions

Custom Operators require RBAC permissions to manage resources on users’ behalf. Poorly scoped permissions can introduce risk.

RBAC Scoping

Design permission sets carefully:

Use the principle of least privilege:

Only grant verbs and resources the Operator actually needs.

Differentiate:

Namespaced Operators: scoped to specific namespaces.
Cluster-scoped Operators: restricted to specific CRDs and core resources necessary for their function.

Review:

ServiceAccount used by the Operator.
ClusterRole and RoleBindings granted to that ServiceAccount.

Managing Sensitive Data

If your Operator handles secrets (e.g., DB credentials, TLS keys):

Use Kubernetes Secrets for storage, not CR spec.
Avoid logging sensitive values.
Ensure status fields do not expose credentials.
Support integration with external secret management solutions where appropriate.

Example Design Pattern: Application Stack Operator

To illustrate how components fit together, consider a simplified pattern for a “WebApp” Operator that standardizes an application stack for development teams.

CRD: WebApp with fields:

spec.image
spec.replicas
spec.databaseRef
spec.ingressHost

Operator behavior:

On WebApp creation:

Create Deployment for the app.
Create or reference a DB CR (e.g., PostgresInstance).
Create a Service and Route.

On updates:

Perform rolling updates of the Deployment.
Adjust replicas.
Update Route if host changes.

On deletion:

Clean up application Deployment, Service, and Route.
Optionally delete or retain the DB instance based on a policy in spec.

Developers then submit a single WebApp YAML instead of manually wiring multiple resources, while platform teams encode best practices (resource limits, labels, security profiles, logging sidecars) inside the Operator.

Best Practices for Custom Operators

When building custom Operators for OpenShift, follow these guidelines:

Keep CRDs focused and stable

Avoid exposing every tunable as a field; start with essential ones.
Evolve API versions as usage grows; don’t break existing users without migration paths.

Leverage OpenShift-native features

Use Routes, SCCs, and OpenShift-specific objects where appropriate.
Align with cluster-wide observability and logging.
Integrate with OpenShift authentication/authorization patterns where relevant.

Automate testing

Unit tests for reconciliation logic.
Integration tests on an ephemeral cluster (e.g., using kind, CRC, or test OpenShift clusters).
Upgrade tests across Operator versions, particularly for stateful workflows.

Document and support your APIs

Provide examples and clear field descriptions.
Explain typical failure modes and remediation steps.
Treat custom CRDs as public contracts with your internal users.

Plan for deprecation

Mark fields or CR versions as deprecated when needed.
Provide migration recommendations and tooling when APIs change.

By carefully designing, implementing, and operating custom Operators with these practices, you can turn OpenShift into a powerful, self-service platform tailored to your organization’s applications and operational standards.

Comments

Please login to add a comment.

Don't have an account? Register now!