Observability & Logging

Duration: 55 minutes
Format: Hands-on observability configuration

The Scenario

Something’s wrong with your cluster. How do you know? Where do you look? If you can’t answer that in 30 seconds, your observability stack isn’t ready for production.

In this module you’ll explore the metrics and alerting that’s already running (Prometheus, Alertmanager), set up a logging pipeline (Loki, Vector) so you can actually search logs, and deploy distributed tracing (OpenTelemetry, Tempo) to follow requests across services.

OpenShift Monitoring Architecture
If you were logged in as a different user in a previous module, log back in as admin. Console: Username {openshift_cluster_admin_username} / Password {openshift_cluster_admin_password}. Terminal: oc login -u {openshift_cluster_admin_username} -p {openshift_cluster_admin_password}

Built-in Metrics with Prometheus

OpenShift ships with a fully managed Prometheus stack - no installation required. It collects metrics from every component in the cluster.

Explore Cluster Metrics

View the Prometheus pods that are already running:

oc get pods -n openshift-monitoring -l app.kubernetes.io/name=prometheus

Query cluster CPU usage using the console metrics viewer:

  1. Switch to the OCP Console tab

  2. Navigate to Observe → Metrics

  3. Enter the query: cluster:cpu_usage_cores:sum

  4. Click Run Queries

You’ll see a time-series graph showing cluster CPU usage over time - far more useful than a point-in-time number.

Observe Metrics page showing cluster CPU usage query result
Try other queries like node_memory_MemAvailable_bytes or kubelet_running_pods to explore what metrics are available.

For quick CLI snapshots, you can also use:

oc adm top nodes
oc adm top pods -A --sort-by=cpu | head -15

Dashboards in the Console

Navigate to Observe → Dashboards in the OpenShift console. Key dashboards include:

Observe Dashboards showing Kubernetes Compute Resources Cluster dashboard
  • etcd - Control plane database health

  • Kubernetes / Compute Resources / Cluster - Cluster-wide CPU and memory

  • Kubernetes / Compute Resources / Namespace (Pods) - Per-namespace breakdown

  • Node Exporter / USE Method / Cluster - Utilization, Saturation, Errors

User Workload Monitoring

OpenShift can also scrape metrics from your own applications. User workload monitoring is not enabled by default - it has been pre-enabled on this workshop cluster. Verify it’s running:

oc get pods -n openshift-user-workload-monitoring

If pods are running, any application that exposes a /metrics endpoint can be scraped by creating a ServiceMonitor or PodMonitor resource.

This is already enabled on your workshop cluster - you don’t need to run this. On your own cluster, you would enable it with:
oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
EOF

Any application that exposes a Prometheus /metrics endpoint can be scraped by creating a ServiceMonitor pointing at it. This is how HPA gets custom metrics for scaling decisions.

Alerting with Alertmanager

OpenShift includes Alertmanager for routing alerts to notification channels (email, Slack, PagerDuty, webhooks).

View Current Alerts

Check what alerting rules exist:

oc get prometheusrules -A | head -15

You can also view alerts in the console under Observe → Alerting.

Create a Custom Alert

You can define your own alerts for application namespaces. Here’s an example that fires when a pod has been restarting. We’ll create it in the openshift-monitoring namespace where the built-in alerting rules live:

cat <<EOF | oc apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ops-track-alerts
  namespace: openshift-monitoring
spec:
  groups:
  - name: ops-track
    rules:
    - alert: HighPodRestartRate
      expr: increase(kube_pod_container_status_restarts_total[5m]) > 2
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ \$labels.pod }} restarting frequently"
        description: "Pod {{ \$labels.pod }} in {{ \$labels.namespace }} has restarted more than 2 times in 5 minutes."
EOF

Verify the rule was created:

oc get prometheusrule ops-track-alerts -n openshift-monitoring

The rule exists but nothing is triggering it. Let’s fix that - deploy a pod that crash-loops so you can see the alert go from inactive to pending to firing:

oc new-project alert-demo 2>/dev/null || oc project alert-demo
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: crashloop-test
  namespace: alert-demo
spec:
  containers:
  - name: crash
    image: registry.access.redhat.com/ubi9/ubi-minimal
    command: ["/bin/sh", "-c", "echo 'Starting...'; sleep 2; exit 1"]
  restartPolicy: Always
EOF
echo "Crash-looping pod deployed - it will restart every few seconds"

While the restarts accumulate, navigate to Observe → Alerting → Alerting Rules in the console and search for HighPodRestartRate.

It may take a minute or two for Prometheus to pick up the new rule. If you don’t see it, refresh the page. You may also need to click Clear all filters or remove any existing filter chips (click the x on them) before the rule appears in the search results.
Alerting Rules showing custom HighPodRestartRate rule

Click on the rule to see its details. After about 2-3 minutes (once the pod has restarted a few times and the for: 1m duration passes), you’ll see the alert transition from InactivePendingFiring:

HighPodRestartRate rule detail showing expression and active alerts

Now switch to the Alerts tab. Filter by Alert State: Firing - you should see HighPodRestartRate with crashloop-test as the source pod:

Alerts tab showing currently firing alerts with severity and state

This is the view on-call engineers live in. You just created a rule, triggered it with a real failure, and watched it fire. Clean up the crash-looping pod:

oc delete pod crashloop-test -n alert-demo --ignore-not-found

Send Alerts to Email

Your alert fired - but who got notified? Right now, nobody. Alertmanager has receivers configured (Default, Watchdog, Critical) but none of them actually send notifications anywhere. Alerts fire, appear in the console, and that’s it.

In production, that means an overnight incident goes unnoticed until someone checks the console in the morning. Let’s fix that by wiring up email notifications.

Deploy a Mail Server

Deploy Mailpit - a lightweight mail server that catches emails and shows them in a web inbox. This simulates what an internal mail relay does in production:

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/deploy-mailpit.sh)

Configure Alertmanager

Now configure Alertmanager to send emails through Mailpit. This routes warning alerts to ops-team@example.com and critical alerts to oncall@example.com:

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/configure-alertmanager-email.sh)

The script will print the Mailpit inbox URL. Click it or copy it into your browser. Within a minute you should see alert emails arriving:

Mailpit inbox showing alert notification emails from Alertmanager

Notice the routing:

  • Warning alerts (like PodDisruptionBudgetAtLimit, KubeCPUOvercommit) go to ops-team@example.com

  • Critical alerts go to oncall@example.com

Click on any email to see the full alert details - alert name, namespace, severity, description, and a link to the Alertmanager UI.

In production, you would replace mailpit.alert-demo.svc:1025 with your organization’s SMTP relay and use real email addresses. Alertmanager also supports Slack, PagerDuty, and webhooks using the same routing structure - see Sending notifications to external systems.

Restore Alertmanager

Reset Alertmanager to its default configuration and clean up:

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/restore-alertmanager.sh)

Takeaway: An alert that nobody receives is the same as no alert at all. Configuring receivers is one of the first things to do on a new cluster - without it, your monitoring is just decoration.

Set Up the Logging Stack

While metrics tell you what is happening, logs tell you why. OpenShift’s logging stack has three layers - you need to understand each one because when logging breaks in production, you need to know which layer to troubleshoot.

The Architecture

The logging pipeline works like this:

  1. Collectors (Vector) run on every node and gather logs from containers, the OS, and the Kubernetes API

  2. Loki stores and indexes the logs (backed by S3-compatible object storage)

  3. Console UI plugin lets you query logs from the OpenShift web console

Each layer is managed by a separate operator, so you can upgrade or troubleshoot them independently.

Install the Logging Operators

Three operators are needed. Each one manages a different part of the pipeline. First, create the required namespaces and operator groups:

oc create namespace openshift-operators-redhat 2>/dev/null || true
oc create namespace openshift-logging 2>/dev/null || true
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-operators-redhat
  namespace: openshift-operators-redhat
spec: {}
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: cluster-logging
  namespace: openshift-logging
spec:
  targetNamespaces:
  - openshift-logging
EOF

Install the Loki Operator from the Software Catalog - this is the same process for any operator in OperatorHub.

Loki Operator (log storage)

  1. Navigate to Ecosystem → Software Catalog

  2. Search for Loki Operator

  3. Click the Loki Operator tile (select the one labeled Red Hat, not snapshot-lightspeed)

    Loki Operator details with Install button highlighted
  4. Click Install

  5. On the install configuration page, leave all defaults (channel, namespace, approval) and click Install

    Loki Operator install configuration page

You’ll see the "Installing Operator" confirmation:

Loki Operator installing confirmation

Remaining Operators

You’ve seen how OperatorHub works. The remaining operators use the same process - let’s script them:

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/install-operators.sh)

Configure Log Storage (Loki + S3)

Loki needs object storage for log data. This cluster has OpenShift Data Foundation (ODF) which provides S3-compatible storage via NooBaa. In production, you’d point this at AWS S3, Azure Blob, or any S3-compatible store.

Create an S3 bucket for log storage:

cat <<EOF | oc apply -f -
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: loki-bucket
  namespace: openshift-logging
spec:
  generateBucketName: loki-bucket
  storageClassName: openshift-storage.noobaa.io  # ODF's S3 provider
EOF

Wait for the bucket to be provisioned:

oc wait --for=jsonpath='{.status.phase}'=Bound obc/loki-bucket -n openshift-logging --timeout=60s

Now extract the S3 credentials, create the CA cert ConfigMap, and deploy the LokiStack - all in one block so the variables stay in scope:

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/setup-loki-storage.sh)

Set Up Log Collection

The ClusterLogForwarder tells Vector (the collector) what to collect and where to send it. The collector runs as a DaemonSet - one pod per node - so it captures logs from every container on every node.

Create the collector service account with the required permissions:

oc create sa collector -n openshift-logging
oc adm policy add-cluster-role-to-user logging-collector-logs-writer -z collector -n openshift-logging
oc adm policy add-cluster-role-to-user collect-application-logs -z collector -n openshift-logging
oc adm policy add-cluster-role-to-user collect-audit-logs -z collector -n openshift-logging
oc adm policy add-cluster-role-to-user collect-infrastructure-logs -z collector -n openshift-logging

Wait for the LokiStack gateway CA bundle to be created (the ClusterLogForwarder needs it):

echo "Waiting for LokiStack gateway CA bundle..."
ELAPSED=0
until oc get configmap logging-loki-gateway-ca-bundle -n openshift-logging 2>/dev/null; do
  sleep 5; ELAPSED=$((ELAPSED+5))
  [ $ELAPSED -ge 300 ] && echo "ERROR: Timed out waiting for CA bundle" && break
done
echo "CA bundle ready"

Create the ClusterLogForwarder - this defines the pipeline:

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/setup-log-forwarding.sh)

Enable the Logging UI

The UIPlugin adds the Observe → Logs view to the OpenShift console:

cat <<EOF | oc apply -f -
apiVersion: observability.openshift.io/v1alpha1
kind: UIPlugin
metadata:
  name: logging
spec:
  type: Logging
  logging:
    lokiStack:
      name: logging-loki
EOF

Verify and View Logs

Wait for the collector pods to start before verifying the full stack:

echo "Waiting for collector pods to start..."
ELAPSED=0
until oc get pods -n openshift-logging -l app.kubernetes.io/component=collector --no-headers 2>/dev/null | grep -q Running; do
  sleep 10; ELAPSED=$((ELAPSED+10))
  [ $ELAPSED -ge 300 ] && echo "ERROR: Timed out - check 'oc get pods -n openshift-logging'" && break
done
oc get pods -n openshift-logging

You should see the logging operator, collector pods (one per node), and multiple Loki component pods all Running.

After the UIPlugin is created, you may see a notification in the console:

Web console update notification

Click Refresh to reload the console with the new Logs view. This may log you out of the console - if so, log back in with:

  • Username: {openshift_cluster_admin_username}

  • Password: {openshift_cluster_admin_password}

Navigate to Observe → Logs in the console. You’ll see three log types:

Observe Logs view showing infrastructure logs with severity filter
  • Application - Logs from user workloads

  • Infrastructure - Logs from OpenShift components

  • Audit - Kubernetes API audit logs

Log type selector

Use the filters to search by namespace, pod name, or severity level:

Log filters

Filter by severity (Critical, Error, Warning, Info, Debug):

Severity filter

Try it now: select Infrastructure logs, then filter by severity Warning or Error. These are the log entries that matter during an incident - pod evictions, OOM kills, certificate expiry warnings.

The histogram at the top shows log volume over time - spikes in the error count correlate with incidents. Click on a spike to zoom into that time window.

Log histogram

It may take 1-2 minutes for logs to start appearing after the collector pods are running. If you see "No datapoints found", wait a moment and refresh.

OpenTelemetry for Distributed Tracing

Metrics tell you what is happening. Logs tell you why. Traces tell you where - when a request crosses multiple microservices, traces show the exact path and timing across every service. If a user reports "checkout is slow", a trace shows you: frontend (2ms) → API gateway (5ms) → inventory service (800ms) → payment (3ms) - so you instantly see inventory is the bottleneck.

Developers instrument their apps with OpenTelemetry SDKs, but ops teams own the tracing infrastructure - deploying the collector, managing the storage backend, and making traces visible in the console.

The Tempo and OpenTelemetry operators were already installed in the scripted block earlier. Let’s deploy the tracing backend.

Deploy Tempo (Trace Storage)

Tempo is the backend that stores and indexes traces - similar to how Loki stores logs. We’ll use TempoMonolithic, a single-binary deployment suitable for workshops and small clusters.

Create the namespace and deploy TempoMonolithic with multi-tenancy enabled:

oc create namespace tracing-system 2>/dev/null || true
cat <<EOF | oc apply -f -
apiVersion: tempo.grafana.com/v1alpha1
kind: TempoMonolithic
metadata:
  name: sample
  namespace: tracing-system
spec:
  storage:
    traces:
      backend: pv
      size: 10Gi
  jaegerui:
    enabled: true
    route:
      enabled: true
  multitenancy:
    enabled: true
    mode: openshift
    authentication:
      - tenantName: dev
        tenantId: "1610b0c3-c509-4592-a256-a1871353dbfa"
EOF
Multi-tenancy is required for the console Traces UI to detect the Tempo instance. The openshift mode uses OpenShift’s built-in authentication - no external OIDC setup needed.

Wait for Tempo to be ready:

echo "Waiting for Tempo pod..."
ELAPSED=0
until oc get pods -n tracing-system -l app.kubernetes.io/component=tempo 2>/dev/null | grep -q Running; do
  sleep 10; ELAPSED=$((ELAPSED+10))
  [ $ELAPSED -ge 180 ] && echo "ERROR: Timed out" && break
done
oc get pods -n tracing-system -l app.kubernetes.io/component=tempo

Set Up RBAC

The tracing pipeline needs two sets of permissions: the collector needs write access to push traces, and users need read access to query them.

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/setup-tracing-rbac.sh)

Deploy the OpenTelemetry Collector

The collector is the central receiving station for trace data. Apps send traces to the collector, and the collector forwards them to Tempo for storage. Think of it like Vector for logs - a pipeline component that sits between producers and storage.

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/09-observability/setup-otel-collector.sh)

This collector:

  • Receives traces via OTLP on ports 4317 (gRPC) and 4318 (HTTP)

  • Batches incoming spans for efficiency

  • Exports to the Tempo gateway, authenticating with the service account token

Verify the collector is running:

oc get pods -n tracing-system

You should see both the Tempo pod and the collector pod Running.

Enable the Traces Console UI

Like the logging UI, the Traces view needs a UIPlugin:

cat <<EOF | oc apply -f -
apiVersion: observability.openshift.io/v1alpha1
kind: UIPlugin
metadata:
  name: distributed-tracing
spec:
  type: DistributedTracing
EOF

You may see a console notification to refresh - click Refresh to reload.

Send a Test Trace

The collector is ready but nothing is sending traces yet. Let’s send a test trace to verify the full pipeline:

oc run test-trace --rm -i --restart=Never \
  --image=registry.access.redhat.com/ubi9/ubi-minimal \
  -n tracing-system -- sh -c '
TIMESTAMP=$(date +%s)
curl -s -X POST http://otel-collector.tracing-system.svc:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d "{
    \"resourceSpans\": [{
      \"resource\": {
        \"attributes\": [{\"key\": \"service.name\", \"value\": {\"stringValue\": \"workshop-demo\"}}]
      },
      \"scopeSpans\": [{
        \"spans\": [{
          \"traceId\": \"11112222333344445555666677778888\",
          \"spanId\": \"aaaa222233334444\",
          \"name\": \"GET /api/weather\",
          \"kind\": 2,
          \"startTimeUnixNano\": \"${TIMESTAMP}000000000\",
          \"endTimeUnixNano\": \"${TIMESTAMP}500000000\",
          \"attributes\": [{\"key\": \"http.method\", \"value\": {\"stringValue\": \"GET\"}}]
        }]
      }]
    }]
  }" && echo "Trace sent successfully"
'

This simulates what an instrumented application does - sends a span describing a single operation (GET /api/weather) with timing data.

View Traces in the Console

Loading the Traces view may log you out of the console. If so, log back in with:

  • Username: {openshift_cluster_admin_username}

  • Password: {openshift_cluster_admin_password}

Navigate to Observe → Traces in the console. Select the Tempo instance (tracing-system / sample) and tenant (dev):

Observe Traces view showing workshop-demo trace with scatter graph

You should see the workshop-demo trace. Click on it to see the trace detail view - a timing waterfall showing each span:

Trace detail view showing GET /api/weather span with 500ms duration

With a single service this shows one span. In a real microservices application, you’d see multiple spans stacked in a waterfall - frontend → API → database - making it immediately obvious where latency is coming from.

The collector is now ready for production use. Developers point their apps at otel-collector.tracing-system.svc:4317 using OpenTelemetry SDKs, and traces automatically appear here.

Cleanup & Summary

Remove the observability resources you deployed. The Prometheus monitoring stack is built-in and doesn’t need cleanup.

bash <(curl -sL https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/cleanup-scripts/cleanup-observability.sh)

Summary

What you learned:

  • OpenShift’s built-in Prometheus stack provides metrics and dashboards with no setup

  • How to create custom alerting rules with PrometheusRule and configure email notifications

  • How to install and configure the full logging stack (Loki + Vector)

  • How to view and filter logs in the OpenShift console

  • How to deploy the OpenTelemetry tracing pipeline (Tempo + OTel Collector)

  • How to view distributed traces in the console

Together, metrics, logs, and traces give you the three pillars of observability - each answers a different question about your cluster and workloads.

Key operational commands:

# Cluster metrics
oc adm top nodes
oc adm top pods -A --sort-by=cpu | head -15

# View alerting rules
oc get prometheusrules -A

# View logging pods
oc get pods -n openshift-logging

# Check LokiStack status
oc get lokistack -n openshift-logging

# Check ClusterLogForwarder status
oc get clusterlogforwarder -n openshift-logging

# View tracing pods
oc get pods -n tracing-system

# Check TempoMonolithic status
oc get tempomonolithic -n tracing-system

Additional Resources