Debugging & Troubleshooting

Module Overview

Duration: 30 minutes
Format: Hands-on Lab — Real Problem Solving
Audience: IT Operations, SRE, Platform Engineers

Your cluster is running a weather application made up of several microservices. Some of them have just broken. Your job is to diagnose and fix each one — these are the same kinds of issues that generate real support tickets.

Learning Objectives

By the end of this module, you will be able to:

Read pod events and logs to identify root causes
Fix ImagePullBackOff, CrashLoopBackOff, ConfigMap errors, resource constraint violations, and service routing issues
Use the Observe → Diagnose → Fix → Verify workflow

Your Diagnostic Toolkit

These are the commands you’ll use constantly:

# What's broken?
oc get pods -n <namespace>

# Why is it broken?
oc describe pod <pod-name> -n <namespace>

# What did the app say before it died?
oc logs <pod-name> -n <namespace>
oc logs <pod-name> -n <namespace> --previous

# Check deployment events (for pods that never got created)
oc describe deployment <deployment-name> -n <namespace>

Deploy the Broken Applications

First, create the namespace and deploy the intentionally broken applications:

oc create namespace ops-track-demo 2>/dev/null || true
oc apply -f https://raw.githubusercontent.com/rhpds/openshift-days-ops-showroom/main/support/broken-apps.yaml

See What’s Broken

Five microservices have been deployed with intentional misconfigurations. Let’s triage the damage.

Switch to the OCP Console tab and navigate to Workloads → Pods (select project ops-track-demo from the dropdown). You’ll immediately see colour-coded status indicators:

Pod list showing multiple failure states - ImagePullBackOff CrashLoopBackOff and CreateContainerConfigError

Red — weather-frontend in ImagePullBackOff, weather-api in CrashLoopBackOff
Yellow — weather-backend in CreateContainerConfigError
Green — weather-proxy shows Running (but has a hidden issue)
Missing — weather-cache doesn’t appear at all

This is how real incident triage works — visual scanning of pod status to quickly identify what needs attention.

You can also verify via CLI:

oc get pods -n ops-track-demo

You might notice weather-cache is missing from the pod list — that’s one of the problems to investigate. Also, weather-proxy shows Running but has its own issue.

Your mission: Fix all five.

Problem 1: ImagePullBackOff (weather-frontend)

Diagnose

In the OCP Console, click on the weather-frontend pod. The Events tab at the bottom shows the error with colour-coded severity — you’ll see the image pull failure immediately.

weather-frontend pod Events tab showing ImagePullBackOff error

You can also check via CLI:

oc describe pod -l app=weather-frontend -n ops-track-demo | tail -10

Look at the Events section. You should see:

Warning  Failed   Back-off pulling image "registry.access.redhat.com/ubi9/httpd-24:9.9"

The image tag 9.9 doesn’t exist. Let’s confirm the image reference:

oc get deployment weather-frontend -n ops-track-demo -o jsonpath='{.spec.template.spec.containers[0].image}'

Fix

Update the image to the correct tag (latest):

oc set image deployment/weather-frontend weather-frontend=registry.access.redhat.com/ubi9/httpd-24:latest -n ops-track-demo

Verify

oc get pods -l app=weather-frontend -n ops-track-demo

You should see 1/1 Running.

Takeaway: ImagePullBackOff = wrong image name, missing tag, or no pull credentials. Always check oc describe pod → Events.

Problem 2: CrashLoopBackOff (weather-api)

Diagnose

The pod keeps starting and crashing. In the OCP Console, click the weather-api pod, then the Logs tab. The log viewer shows the error message with real-time streaming. If the pod has restarted, use the Container dropdown to select the previous container’s logs.

weather-api pod Logs tab showing missing config file error

You can also check via CLI:

oc logs deployment/weather-api -n ops-track-demo

You should see:

Error: Missing required config file /etc/weather/config.yaml

The app is trying to read a config file that doesn’t exist.

Fix

Create the missing config file as a ConfigMap and mount it into the container. We’ll also fix the container command to run the actual application:

oc create configmap weather-api-config --from-literal=config.yaml="port: 8080" -n ops-track-demo

oc patch deployment weather-api -n ops-track-demo --type=json -p='[
  {"op":"add","path":"/spec/template/spec/volumes","value":[{"name":"config","configMap":{"name":"weather-api-config"}}]},
  {"op":"add","path":"/spec/template/spec/containers/0/volumeMounts","value":[{"name":"config","mountPath":"/etc/weather"}]},
  {"op":"replace","path":"/spec/template/spec/containers/0/command","value":["/bin/sh","-c","cat /etc/weather/config.yaml && echo Config loaded successfully && sleep 3600"]}
]'

Verify

oc get pods -l app=weather-api -n ops-track-demo

You should see 1/1 Running.

Takeaway: CrashLoopBackOff = the app starts then exits. Always check oc logs and oc logs --previous. Common causes: missing config, bad command, failed health checks, out of memory.

Problem 3: Resource Constraint Violation (weather-cache)

Diagnose

You may have noticed weather-cache doesn’t appear in the pod list at all. In the OCP Console, navigate to Workloads → Deployments (project: ops-track-demo) and click weather-cache. The Events tab immediately shows the FailedCreate error explaining why no pod was created.

weather-cache deployment Events showing FailedCreate due to LimitRange violation

Let’s also check via CLI:

oc get deployment weather-cache -n ops-track-demo

It shows 0/1 ready — the pod was never created. When a pod can’t even be created, check the ReplicaSet events:

oc describe replicaset -l app=weather-cache -n ops-track-demo | tail -5

You should see:

Warning  FailedCreate  ...  maximum memory usage per Container is 2Gi, but limit is 1Ti

A LimitRange in the namespace caps containers at 2Gi memory, but the deployment requests 512Gi (a typo — should be 512Mi). Let’s confirm:

oc get deployment weather-cache -n ops-track-demo -o jsonpath='{.spec.template.spec.containers[0].resources}' | python3 -m json.tool

oc get limitrange -n ops-track-demo -o yaml | grep -A4 "max:"

Fix

Fix the memory typo — change 512Gi to 512Mi and 1Ti to 1Gi:

oc set resources deployment/weather-cache -n ops-track-demo --requests=cpu=100m,memory=512Mi --limits=cpu=500m,memory=1Gi

Verify

oc get pods -l app=weather-cache -n ops-track-demo

You should see 1/1 Running. The pod can now be created because the resource requests fit within the LimitRange.

Takeaway: When a pod never appears at all, check the ReplicaSet events (oc describe replicaset). LimitRanges and ResourceQuotas can silently prevent pods from being created.

Problem 4: CreateContainerConfigError (weather-backend)

Diagnose

oc describe pod -l app=weather-backend -n ops-track-demo | grep -E "Warning|Error|configmap"

You should see:

Warning  Failed  Error: configmap "weather-config" not found

The deployment references a ConfigMap that doesn’t exist:

oc get deployment weather-backend -n ops-track-demo -o jsonpath='{.spec.template.spec.containers[0].envFrom}' | python3 -m json.tool

Fix

Create the missing ConfigMap with the keys the app expects:

oc create configmap weather-config \
  --from-literal=API_ENDPOINT="http://weather-api:8080" \
  --from-literal=CACHE_TTL="300" \
  -n ops-track-demo

You can verify the ConfigMap in the console at Workloads > ConfigMaps (project: ops-track-demo) — click weather-config to see its keys and values.

Restart the pod to pick up the new ConfigMap:

oc delete pod -l app=weather-backend -n ops-track-demo

Verify

oc get pods -l app=weather-backend -n ops-track-demo

You should see 1/1 Running. Confirm the environment variables are injected:

oc exec deployment/weather-backend -n ops-track-demo -- env | grep -E "API_ENDPOINT|CACHE_TTL"

You should see:

API_ENDPOINT=http://weather-api:8080
CACHE_TTL=300

In the console, click the weather-backend pod and select the Environment tab to see the injected variables without shelling into the container.

Takeaway: CreateContainerConfigError = a referenced ConfigMap or Secret doesn’t exist. Check oc describe pod for "not found" messages. Create the missing resource and restart the pod.

Problem 5: Service Routing Broken (weather-proxy)

Diagnose

This pod is Running — so what’s wrong? In the OCP Console, navigate to Networking → Services (project: ops-track-demo) and click weather-proxy. The Pods tab shows 0 Pods — immediately revealing that the service can’t find any matching pods.

weather-proxy Service showing 0 Pods due to selector mismatch

Check via CLI as well:

oc get endpointslice -l kubernetes.io/service-name=weather-proxy -n ops-track-demo

You should see ENDPOINTS is empty (<unset>). The service has no backends.

Compare the service selector with the pod labels:

echo "=== Service selector ===" && oc get svc weather-proxy -n ops-track-demo -o jsonpath='{.spec.selector}' | python3 -m json.tool && echo "=== Pod labels ===" && oc get pod -l app=weather-proxy -n ops-track-demo -o jsonpath='{.items[0].metadata.labels}' | python3 -m json.tool

Spot the difference? The service selector says weather-proxi (typo) but the pod label is weather-proxy.

Fix

Patch the service selector to match the actual pod label:

oc patch svc weather-proxy -n ops-track-demo -p '{"spec":{"selector":{"app":"weather-proxy"}}}'

Verify

oc get endpointslice -l kubernetes.io/service-name=weather-proxy -n ops-track-demo

You should now see the pod’s IP in the ENDPOINTS column. Traffic will flow to the pod.

Takeaway: When a pod is Running but the service isn’t working, check if the service selector matches the pod labels. A single character typo can break routing. Use oc get endpointslice to verify.

All Fixed

Switch to the OCP Console tab — Workloads → Pods (project: ops-track-demo). All five pods should now show green Running status. Compare this to the red/orange/yellow view you saw at the start — that visual progression is the troubleshooting workflow in action.

All five weather pods showing green Running status

Verify via CLI:

oc get pods -n ops-track-demo

All five pods should now be Running:

NAME                                READY   STATUS    RESTARTS   AGE
weather-api-xxxxx                   1/1     Running   0          1m
weather-backend-xxxxx               1/1     Running   0          1m
weather-cache-xxxxx                 1/1     Running   0          1m
weather-frontend-xxxxx              1/1     Running   0          1m
weather-proxy-xxxxx                 1/1     Running   0          5m

Quick Reference

Status Meaning First Check

Status	Meaning	First Check
ImagePullBackOff	Can’t pull container image	`oc describe pod` → Events → Check image name/tag
CrashLoopBackOff	Container starts then crashes	`oc logs <pod> --previous` → Check app logs
CreateContainerConfigError	ConfigMap/Secret missing	`oc describe pod` → Events → "not found" message
No pod created	LimitRange/ResourceQuota violation	`oc describe replicaset` → Events → Check resource requests
Pending	Can’t schedule pod	`oc describe pod` → Events → Check resources/node capacity
Running but not working	Service selector mismatch or app bug	`oc get endpointslice` + `oc logs <pod>`
PVC Pending	Storage can’t be provisioned	`oc describe pvc` → Events → Check StorageClass name/existence

ImagePullBackOff

Can’t pull container image

oc describe pod → Events → Check image name/tag

CrashLoopBackOff

Container starts then crashes

oc logs <pod> --previous → Check app logs

CreateContainerConfigError

ConfigMap/Secret missing

oc describe pod → Events → "not found" message

No pod created

LimitRange/ResourceQuota violation

oc describe replicaset → Events → Check resource requests

Pending

Can’t schedule pod

oc describe pod → Events → Check resources/node capacity

Running but not working

Service selector mismatch or app bug

oc get endpointslice + oc logs <pod>

PVC Pending

Storage can’t be provisioned

oc describe pvc → Events → Check StorageClass name/existence

Summary

You diagnosed and fixed five common OpenShift issues:

ImagePullBackOff — wrong image tag (9.9 doesn’t exist) → oc set image
CrashLoopBackOff — missing config file → create ConfigMap + volume mount
Resource constraint — memory typo exceeded LimitRange → oc set resources
CreateContainerConfigError — missing ConfigMap → oc create configmap
Service selector mismatch — typo in selector → oc patch svc

The pattern is always the same: check status → describe pod/replicaset → read logs/events → fix → verify.

Cleanup

If the Developer Hub module is part of your workshop, skip this cleanup — it uses the weather app pods in ops-track-demo to demonstrate the Kubernetes plugin.

oc delete namespace ops-track-demo --ignore-not-found

echo "Cleanup complete"