12.4 Distributed tracing

Table of Contents

Why Distributed Tracing Matters on OpenShift

In a microservices-based application on OpenShift, a single user request often flows through many services: API gateways, backends, caches, databases, and external APIs. Traditional logs and metrics tell you what is happening on each component, but not easily how a single request moved across the system.

Distributed tracing solves that by:

Following a request end-to-end across services.
Measuring latency at each hop.
Revealing where time is spent and where failures originate.
Providing context to correlate with logs and metrics.

In OpenShift, distributed tracing is especially valuable because:

Applications are dynamic: pods scale, move, and restart.
Network paths are not static: service meshes, ingresses, and sidecars add complexity.
You frequently debug multi-tenant, multi-team systems.

Core Concepts: Traces, Spans, and Context

Distributed tracing relies on a few core concepts you must understand to use it effectively on OpenShift:

Trace
Represents a single end-to-end request or workflow.
Example: “GET /checkout” from the user’s browser through all backend services.
Span
A single operation within a trace, such as:

An HTTP request from one service to another.
A database query.
A cache lookup.

Each span has:

A name (e.g., GET /payments, SELECT orders).
A start timestamp and duration.
Tags/attributes (key–value pairs such as http.status_code, db.system).
A parent–child relationship (except the root span).

Trace context (propagation)
Metadata (trace ID, span ID, sampling info) that is passed between services.
For HTTP, this is typically carried in headers such as:

traceparent (W3C standard)
b3 / x-b3-* (Zipkin/B3)

On OpenShift, propagation headers must be forwarded by your applications and any service mesh/ingress or your trace will be fragmented.

Distributed Tracing Tooling on OpenShift

OpenShift does not enforce a single tracing tool but commonly works with:

OpenTelemetry

Vendor-neutral standard for APIs, SDKs, and collectors.
Often the primary way to instrument applications and export traces.
Supports many backends (Jaeger, Tempo, Zipkin, commercial APM tools).

Jaeger

Open-source tracing backend frequently deployed on OpenShift.
Provides UI for searching, viewing, and analyzing traces.
Often installed via an Operator.

Service mesh tracing (e.g., Istio/Red Hat OpenShift Service Mesh)

Sidecars (Envoy proxies) automatically create spans for inbound/outbound traffic.
Integrates with Jaeger or other backends for visualization.
Reduces the need for manual instrumentation of basic HTTP/RPC calls.

Depending on your environment, your platform team may already provide:

A cluster-level tracing backend (Jaeger, Tempo).
An OpenTelemetry Collector deployed as a central service.
A service mesh with baked-in tracing support.

How Tracing Fits into OpenShift Observability

Within OpenShift’s overall observability stack, tracing complements metrics and logs:

Metrics: “Is the service healthy? Is error rate high? Is latency increasing?”
Logs: “What happened inside a specific component at a certain time?”
Traces: “How did this particular request flow? Where was the slowdown or failure?”

Useful combined workflows:

Use metrics to detect a slow endpoint.
Jump into traces for that endpoint to see:

Which downstream service is slow.
How much time network hops add.

Drill into logs from the spans that show errors.

On OpenShift you often:

Attach tracing to Routes, Ingress, or mesh Gateways to see entry points.
Correlate trace IDs with application logs (e.g., log the trace ID in your app).

Implementing Distributed Tracing for Applications on OpenShift

1. Decide what to trace and the sampling strategy

You rarely trace every single request in production due to overhead and storage:

Head-based sampling at the edge:

Decide at the start of the trace (e.g., 1% of requests).
Common for high-traffic systems.

Tail-based sampling (often via the collector):

Sample based on trace outcome (e.g., errors, high latency).
More expensive but focuses on interesting traces.

On OpenShift, sampling configuration is typically managed via:

Environment variables in deployments/configurations.
OpenTelemetry Collector configuration ConfigMaps.
Mesh and ingress settings for sampling rate.

Common pattern in production:

Low default sample rate (e.g., 0.1–1%).
Always-sample for error traces or specific routes.

2. Instrumenting applications

You instrument code to create spans and propagate context. In practice:

Use OpenTelemetry SDKs for your language.
Use framework integrations (e.g., Spring Boot, Quarkus, Node.js frameworks) for HTTP and DB calls.
Configure exporters to send traces to:

An in-cluster OpenTelemetry Collector service.
Directly to Jaeger/Tempo/APM endpoint (less flexible).

Typical OpenShift-specific configuration:

Configure the collector endpoint via environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_SERVICE_NAME

Use a ConfigMap or Secret to manage those settings.

Example deployment snippet showing tracing-related environment variables:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
spec:
  template:
    spec:
      containers:
        - name: app
          image: myorg/checkout:latest
          env:
            - name: OTEL_SERVICE_NAME
              value: checkout-service
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: http://otel-collector:4317
            - name: OTEL_TRACES_SAMPLER
              value: parentbased_traceidratio
            - name: OTEL_TRACES_SAMPLER_ARG
              value: "0.1"

3. Using OpenTelemetry Collector on OpenShift

The OpenTelemetry Collector acts as a central point in the cluster to:

Receive spans from applications (OTLP/HTTP, OTLP/gRPC).
Process and sample traces (e.g., tail-based sampling, filtering).
Export data to one or multiple backends (Jaeger, Tempo, vendor APM).

On OpenShift, it is commonly deployed as:

A Deployment (centralized).
A DaemonSet (agent on every node).
Or via an Operator, which manages its lifecycle and configuration.

A minimal Collector configuration often does:

receivers: OTLP (from apps and mesh sidecars).
processors: batching, sampling.
exporters: Jaeger, logging (for debugging), others.

OpenShift Service Mesh and Automatic Tracing

If you use the OpenShift Service Mesh (OSM):

Each pod gets an Envoy sidecar.
The sidecar automatically:

Starts spans for inbound requests.
Creates child spans for outbound calls to other services.
Adds mesh-level metadata (e.g., source/destination service, namespace).

You still benefit from manual instrumentation in application code for:

Business logic spans (e.g., calculate_discount, render_invoice).
Non-HTTP operations (asynchronous jobs, background workers).

Key considerations on OpenShift:

Mesh configuration determines sampling and backend (Jaeger/OTel).
You must ensure ingress/egress gateways propagate trace headers.
Non-mesh traffic (e.g., direct DB connections) still needs explicit spans if you care about them.

Tracing Across OpenShift Boundaries

Distributed traces often need to cross:

Ingress/Routes: from external clients into the cluster.
Other clusters or clouds: hybrid or multi-cluster systems.
External services: SaaS APIs, external databases.

To keep a single coherent trace:

Use standard trace propagation formats (W3C traceparent is recommended).
Configure edge components (ingress controllers, API gateways) to:

Accept incoming trace headers from clients.
Start a new trace only if none exists.

Ensure your external services:

Forward or maintain trace headers (if under your control).
Or at least log incoming trace IDs for partial correlation.

Hybrid HPC + OpenShift or batch workloads can also be traced if:

Job launchers and batch workers are instrumented.
Trace context is propagated via job metadata, queues, or environment variables.

Working with Traces in the UI

Once tracing is enabled and data is flowing to Jaeger or another backend, you typically:

Search for traces

Filter by service, operation (span name), HTTP status, error flag, or duration.
Narrow to a time window when an incident occurred.

Inspect a single trace

View the “waterfall” chart of spans.
Quickly see:

Which services were involved.
Where most of the latency lies.
Which spans are marked as errors.

Analyze patterns

Compare traces before and after a deployment.
Look for new spans or changes in duration.
Detect “fan-out” patterns (one service calling many others) that cause latency spikes.

Correlate with logs

Many teams include the trace ID in logs.
From a trace, you can copy its ID and search for it in Loki/Elasticsearch or your log backend.

Common Patterns and Use Cases on OpenShift

Typical use cases where distributed tracing shines in OpenShift environments:

Debugging slow requests

Identify whether slowness is in:

Application logic.
Downstream service call.
Network or TLS handshake.
Database queries.

Analyzing retries and errors

Observe retries introduced by:

Circuit breakers and timeouts.
Service mesh or client libraries.

Determine if retries actually improve reliability or just add load.

Understanding complex call graphs

Visualize which services depend on which.
Identify “hotspots” or overly chatty services.

Validating architecture changes

After splitting a monolith into microservices, verify:

Number of calls per request.
End-to-end latency impact.
Correct propagation of context.

Best Practices for Distributed Tracing on OpenShift

Standardize trace propagation

Choose W3C traceparent as default where possible.
Ensure all services and ingress/mesh components respect it.

Instrument at meaningful boundaries

In addition to automatic HTTP spans:

Create spans around business operations.
Group related low-level calls into higher-level spans.

Tune sampling thoughtfully

Low sample rate for routine production traffic.
Higher or full sampling for:

Staging/pre-production.
Critical transactions.
Troubleshooting windows.

Align naming and tagging conventions

Consistent service.name, http.target, db.system tags.
Use attributes for tenant/namespace, version, region/zone when helpful.

Integrate with deployment workflows

Include tracing configuration in:

Deployment/DeploymentConfig manifests.
Helm/Templating or GitOps repos.

Validate tracing in CI/CD (e.g., smoke tests that verify spans appear).

Watch cost and storage

Monitor trace volume and storage usage.
Use retention policies and sampling to balance visibility and cost.

Typical Pitfalls on OpenShift

Common issues you may encounter:

Broken traces due to header loss

Ingress or internal proxies strip or override trace headers.
Fix: configure them to forward traceparent/b3 headers properly.

Multiple, incompatible propagation formats

Some services use B3, others W3C only.
Fix: configure OpenTelemetry and mesh to support and translate where needed.

No spans for background jobs or async flows

Asynchronous processing systems are left uninstrumented.
Fix: carry trace context through queues, schedulers, or job metadata.

High overhead from too much detail

Very fine-grained spans (e.g., every loop iteration).
Fix: focus on meaningful operations and aggregate where possible.

By combining proper instrumentation, consistent propagation, and a well-managed tracing backend, distributed tracing on OpenShift becomes a powerful tool for understanding and debugging modern, containerized applications.

Comments

Please login to add a comment.

Don't have an account? Register now!