RHDP LiteMaaS

Model as a Service for Red Hat Demo Platform

Common Issues

Exact symptoms, root causes, and fix commands for the most frequently encountered LiteMaaS problems.

Issue Index

  1. Model sync not working (models added in LiteLLM but not showing in LiteMaaS)
  2. Subscription failures — foreign key constraint violation
  3. Frontend showing old version (init container image mismatch)
  4. LiteLLM DB migration fails or schema is out of date
  5. Key sync between LiteMaaS and LiteLLM broken
  6. Redis cache issues — stale model data after changes
  7. OAuth callback fails — users cannot log in
  8. Model requests timing out (large context, long inference)
  9. LiteLLM returning 401 / invalid API key
  10. Pods getting OOMKilled
1
Model sync not working — models added in LiteLLM but not showing in LiteMaaS
Symptoms Model appears in the LiteLLM admin UI under Models, and curl /v1/models with master key shows it. However, the LiteMaaS frontend model catalog does not show the new model. Users trying to subscribe get a 404 or the model is absent from the dropdown.
Root Cause LiteMaaS backend maintains its own models table in PostgreSQL. Models added via the LiteLLM admin UI are stored in LiteLLM's internal LiteLLM_ModelTable only. The two databases must be explicitly synced.

Fix — Run the Sync Playbook

# Get credentials
NS=litellm-rhpds
LITELLM_URL=$(oc get route litellm-prod -n $NS -o jsonpath='https://{.spec.host}')
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)

# Run the model sync playbook
ansible-playbook playbooks/manage_models.yml \
  -e litellm_url="${LITELLM_URL}" \
  -e litellm_master_key="${LITELLM_KEY}" \
  -e ocp4_workload_litemaas_models_namespace="${NS}" \
  -e ocp4_workload_litemaas_models_sync_from_litellm=true \
  -e '{"ocp4_workload_litemaas_models_list": []}'

Fix — Via Admin API Sync Endpoint

# If the backend exposes a sync endpoint:
ADMIN_KEY=$(oc get secret backend-secret -n $NS \
  -o jsonpath='{.data.ADMIN_API_KEY}' | base64 -d)
BACKEND_URL=$(oc get route litellm-prod-admin -n $NS \
  -o jsonpath='https://{.spec.host}')

curl -X POST "${BACKEND_URL}/api/admin/sync-models" \
  -H "Authorization: Bearer ${ADMIN_KEY}" \
  -H "Content-Type: application/json"

Fix — Enable Auto-Sync for Ongoing Changes

oc set env deployment/litellm-backend LITELLM_AUTO_SYNC=true -n $NS
oc rollout restart deployment/litellm-backend -n $NS

Verify

# Model should now appear in backend DB
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c "SELECT id, name FROM models;" | grep your-model-name
2
Subscription failures — foreign key constraint violation
Symptoms User can log in, can see models listed in the catalog, but when clicking "Subscribe" they receive an error. Backend logs show: foreign key constraint "subscriptions_model_id_fkey" or violates foreign key constraint "subscriptions_model_id_fkey".
Root Cause The subscriptions table has a foreign key reference to models.id. The model exists in LiteLLM's database but not in LiteMaaS's models table. This happens when models are added directly via the LiteLLM admin UI without running the sync playbook.

Diagnose

# Check if model exists in LiteMaaS models table
NS=litellm-rhpds
MODEL_ID=your-model-name

oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT id, name, provider FROM models WHERE id = '${MODEL_ID}';"
# If 0 rows returned, the model is not in LiteMaaS DB

# Check LiteLLM has it
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
curl -s "https://$ROUTE/model/info" \
  -H "Authorization: Bearer $LITELLM_KEY" | \
  jq ".data[] | select(.model_name == \"${MODEL_ID}\")"

Fix — Sync Model to LiteMaaS DB

# Option 1: Run sync playbook (syncs ALL models)
ansible-playbook playbooks/manage_models.yml \
  -e litellm_url="https://$ROUTE" \
  -e litellm_master_key="${LITELLM_KEY}" \
  -e ocp4_workload_litemaas_models_namespace="${NS}" \
  -e ocp4_workload_litemaas_models_sync_from_litellm=true \
  -e '{"ocp4_workload_litemaas_models_list": []}'

# Option 2: Insert directly (for emergency fix)
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "INSERT INTO models (id, name, display_name, provider, availability, created_at, updated_at)
   VALUES ('${MODEL_ID}', '${MODEL_ID}', '${MODEL_ID}', 'openshift-ai', 'available', NOW(), NOW())
   ON CONFLICT (id) DO NOTHING;"
3
Frontend showing old version after upgrade
Symptoms After updating the frontend image tag, the LiteMaaS portal still shows the old version number in the footer. oc get deployment/litellm-frontend -o yaml shows the new image tag, but the running pods still serve old content. Hard refresh (Ctrl+Shift+F5) does not help.
Root Cause The frontend deployment uses an init container to inject static assets into a shared volume. If the init container image tag is different from the main container image tag (e.g., the main container was updated but the init container was not, or vice versa), the init container will inject old assets even though the main container image is new. The main Nginx container only serves what the init container placed in the volume.

Diagnose

# Check both the main container and init container image tags
NS=litellm-rhpds

oc get deployment/litellm-frontend -n $NS \
  -o jsonpath='{.spec.template.spec.containers[*].image}'
echo ""
oc get deployment/litellm-frontend -n $NS \
  -o jsonpath='{.spec.template.spec.initContainers[*].image}'
echo ""

# They must match for the correct version to be served

Fix — Update Both Container Images Together

NS=litellm-rhpds
NEW_TAG=0.5.0
IMAGE=quay.io/rh-aiservices-bu/litemaas-frontend

# Update the main container
oc set image deployment/litellm-frontend \
  frontend=${IMAGE}:${NEW_TAG} \
  -n $NS

# Update the init container (if present)
oc set image deployment/litellm-frontend \
  init-frontend=${IMAGE}:${NEW_TAG} \
  -n $NS

# Force a rollout to apply changes
oc rollout restart deployment/litellm-frontend -n $NS
oc rollout status deployment/litellm-frontend -n $NS

# Verify running pod is using new image
oc get pods -n $NS -l app=litellm-frontend \
  -o jsonpath='{.items[0].spec.initContainers[*].image}'

Prevention

Always update init container and main container to the same tag simultaneously. When using Ansible, the role handles both; when patching manually, update both in a single oc set image call or via a single deployment patch.

4
LiteLLM DB migration fails or schema is out of date
Symptoms LiteLLM pods are in CrashLoopBackOff or restart repeatedly after an upgrade. Logs show Prisma-related errors: Table 'LiteLLM_VerificationToken' doesn't exist, column does not exist, or The column 'LiteLLM_XXX.field' does not exist. Alternatively, LiteLLM starts but key operations fail with database errors.
Root Cause LiteLLM uses Prisma ORM with schema migrations. When upgrading between minor/major LiteLLM versions, new columns or tables may be added. The migration sometimes fails if the database is locked, if the previous migration was incomplete, or if the container hit a startup timeout before migration finished.

Diagnose

# Check LiteLLM startup logs for migration output
NS=litellm-rhpds
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')

oc logs -n $NS $LITELLM_POD | grep -iE "migrat|prisma|error|schema"

# Check the PostgreSQL migration history table
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT * FROM \"_prisma_migrations\" ORDER BY finished_at DESC LIMIT 10;"

Fix — Trigger Manual Migration

# Scale down LiteLLM to 1 replica to avoid concurrent migration attempts
oc scale deployment/litellm --replicas=1 -n $NS

# Wait for the single pod to start
oc rollout status deployment/litellm -n $NS

# Check if migration ran successfully
oc logs -n $NS deployment/litellm --tail=100 | grep -i migrat

# If migration still fails, exec into the pod and run it manually
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')

oc exec -n $NS $LITELLM_POD -- \
  python -c "
import litellm
from litellm.proxy.proxy_server import ProxyStartupEvent
import asyncio
asyncio.run(ProxyStartupEvent.run_migrations())
"

# Scale back to 3 replicas after migration succeeds
oc scale deployment/litellm --replicas=3 -n $NS

Fix — Reset Migration State (Last Resort)

Warning: Only use this if migration is stuck on a specific migration that has already been applied to the database. This marks the failed migration as applied without re-running it. Take a full database backup first.

# Mark a failed migration as applied (replace with actual migration name from logs)
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "UPDATE \"_prisma_migrations\" \
   SET finished_at = NOW(), applied_steps_count = 1 \
   WHERE migration_name = '20240101000000_add_some_column' \
   AND finished_at IS NULL;"

# Restart LiteLLM
oc rollout restart deployment/litellm -n $NS
5
Key sync between LiteMaaS and LiteLLM broken
Symptoms User created an API key in LiteMaaS portal (shows as Active in the portal), but requests to LiteLLM return 401 Unauthorized or Invalid API key. Alternatively, a key shows as revoked in the portal but still works when used against the LiteLLM endpoint. The sync_status column in the api_keys table shows 'error'.
Root Cause The LiteMaaS backend creates keys in LiteLLM via the /key/generate API, then stores the returned key token and alias in its own api_keys table. If the LiteLLM API call succeeded but the database write failed (or vice versa), the two systems are out of sync. Also happens when keys are deleted directly in LiteLLM's admin UI without going through LiteMaaS's key management flow.

Diagnose — Find Keys with Sync Errors

NS=litellm-rhpds

# List all keys with sync errors
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT id, user_id, litellm_key_alias, is_active, sync_status, sync_error
   FROM api_keys
   WHERE sync_status = 'error' OR sync_status = 'pending'
   ORDER BY updated_at DESC
   LIMIT 20;"

# Find LiteMaaS-active keys that don't exist in LiteLLM
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT ak.id, ak.litellm_key_alias, ak.is_active
   FROM api_keys ak
   WHERE ak.is_active = true
     AND ak.litellm_key_alias IS NOT NULL
     AND NOT EXISTS (
       SELECT 1 FROM \"LiteLLM_VerificationToken\" lv
       WHERE lv.key_alias = ak.litellm_key_alias
     );"

Fix — Mark Orphaned LiteMaaS Keys as Inactive

# Mark all LiteMaaS keys inactive if they have no LiteLLM counterpart
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "UPDATE api_keys
   SET is_active = false,
       revoked_at = NOW(),
       sync_status = 'error',
       sync_error = 'Key not found in LiteLLM - manual cleanup',
       updated_at = NOW()
   WHERE is_active = true
     AND litellm_key_alias IS NOT NULL
     AND NOT EXISTS (
       SELECT 1 FROM \"LiteLLM_VerificationToken\" lv
       WHERE lv.key_alias = api_keys.litellm_key_alias
     );"

Fix — Re-sync a Specific Key

# If a key exists in LiteLLM but not tracked in LiteMaaS,
# the user should revoke the old key and create a new one via the portal.
# The new key creation will go through the proper sync flow.

# Alternatively, delete the orphaned LiteLLM key:
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')

curl -X POST "https://$ROUTE/key/delete" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"keys": ["sk-the-orphaned-key-token"]}'

Fix — Run the Key Cleanup Cronjob Immediately

# The cleanup cron handles orphaned keys automatically
# Run it manually for immediate effect
sudo /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh
sudo tail -50 /var/log/litemaas-key-cleanup.log
6
Redis cache issues — stale model data after model changes
Symptoms After adding, removing, or updating a model in LiteLLM, the LiteLLM proxy continues to serve stale model information for several minutes. In multi-replica deployments, different LiteLLM pods may return different model lists. The LiteMaaS backend cache may also show outdated model data immediately after a change.
Root Cause LiteLLM maintains an in-memory cache of model configurations and virtual key data, synchronized via Redis in multi-replica deployments. If Redis is down, each pod maintains its own independent cache. If Redis is running but the cache is stale, a TTL-based expiry or an explicit flush is needed.

Diagnose

NS=litellm-rhpds

# Check Redis is running
oc get pods -n $NS -l app=litellm-redis

# Check Redis memory usage and key count
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli INFO memory | grep used_memory_human
oc exec -n $NS $REDIS_POD -- redis-cli DBSIZE

# Check if LiteLLM pods can reach Redis
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')
oc exec -n $NS $LITELLM_POD -- \
  sh -c 'echo REDIS_HOST=$REDIS_HOST REDIS_PORT=$REDIS_PORT'

Fix — Flush the Redis Cache

# Flush all Redis keys (forces LiteLLM pods to reload from DB)
# WARNING: This briefly increases DB load as all pods reload their caches
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli FLUSHALL

# Verify flush
oc exec -n $NS $REDIS_POD -- redis-cli DBSIZE
# Should return: 0

Fix — Restart LiteLLM Pods (Forces Cache Reload)

# Rolling restart — no downtime
oc rollout restart deployment/litellm -n $NS
oc rollout status deployment/litellm -n $NS

Fix — Redis Pod Down (Restart Redis)

# If Redis pod is in error state
oc delete pod -n $NS -l app=litellm-redis
# The deployment controller will create a new pod automatically

# If Redis deployment is misconfigured, check logs
oc logs -n $NS deployment/litellm-redis --tail=50
7
OAuth callback fails — users cannot log in
Symptoms Users click "Login" on the LiteMaaS frontend, are redirected to the OpenShift OAuth page, authenticate successfully, but then see a generic error page or are redirected back to the frontend with an error. Backend logs show: invalid_grant, redirect_uri_mismatch, or oauth_id not found.
Root Cause (most common): The OAuthClient's redirectURIs list does not include the actual route hostname. This happens when route hostnames change (e.g., after cluster migration) or when the OAuthClient was created with old hostnames. Secondary cause: users were directly inserted into the database before logging in via OAuth, so their oauth_id is null.

Diagnose — Check OAuthClient Redirect URIs

NS=litellm-rhpds

# Get the OAuthClient configuration
oc get oauthclient $NS -o yaml | grep -A 20 redirectURIs

# Get current route hostnames
oc get routes -n $NS -o jsonpath='{range .items[*]}{.spec.host}{"\n"}{end}'

# The OAuthClient redirectURIs must include:
# https://<api-route>/api/auth/callback
# https://<frontend-route>/api/auth/callback

Fix — Update OAuthClient Redirect URIs

NS=litellm-rhpds
API_ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
FRONTEND_ROUTE=$(oc get route litellm-prod-frontend -n $NS -o jsonpath='{.spec.host}')

# Patch the OAuthClient
oc patch oauthclient $NS --type=merge -p \
  "{\"redirectURIs\": [
    \"https://${API_ROUTE}/api/auth/callback\",
    \"https://${FRONTEND_ROUTE}/api/auth/callback\"
  ]}"

# Verify
oc get oauthclient $NS -o jsonpath='{.redirectURIs}'

Fix — User oauth_id Mismatch (after migration or manual insert)

# LiteMaaS v0.2.1+ has email fallback: it looks up by email if oauth_id doesn't match
# and auto-updates the oauth_id. No manual fix needed for v0.2.1+.
# For older versions, update oauth_id manually:

# Get the OpenShift user's UID (this is the oauth_id)
oc get user user@redhat.com -o jsonpath='{.metadata.uid}'

# Update the oauth_id in LiteMaaS DB
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "UPDATE users SET oauth_id = 'the-openshift-user-uid-here' \
   WHERE email = 'user@redhat.com';"
8
Model requests timing out (large context, long inference)
Symptoms Requests to models with large context windows (especially Llama Scout 17B with 400K context) return a 504 Gateway Timeout before the model finishes generating. The LiteLLM proxy returns TimeoutError in logs. Streaming requests are cut off mid-response.
Root Cause HAProxy (OpenShift's default ingress controller) has a default timeout of 30 seconds or less for backend connections. Long inference requests — especially with large context windows — exceed this timeout.

Fix — Set HAProxy Timeout on Routes

NS=litellm-rhpds

# Add 600-second timeout annotation to the LiteLLM API route
oc annotate route litellm-prod -n $NS \
  haproxy.router.openshift.io/timeout=600s --overwrite

# Also apply to backend and frontend routes if needed
oc annotate route litellm-prod-admin -n $NS \
  haproxy.router.openshift.io/timeout=600s --overwrite
oc annotate route litellm-prod-frontend -n $NS \
  haproxy.router.openshift.io/timeout=600s --overwrite

# Verify
oc get route litellm-prod -n $NS \
  -o jsonpath='{.metadata.annotations.haproxy\.router\.openshift\.io/timeout}'

The production RHDP deployment already has this annotation applied on all routes. If you re-deploy or the routes are recreated, re-apply the annotation. Consider adding it to the deployment playbook via the route template to make it permanent.

9
LiteLLM returning 401 / invalid API key
Symptoms API calls with a previously working key start returning {"error": {"message": "Invalid API Key"}} with HTTP 401. The key appears active in the LiteMaaS portal. Using the master key works fine.

Diagnose

NS=litellm-rhpds
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')

# Check if the key exists in LiteLLM
curl "https://$ROUTE/key/info?key=sk-the-user-key-here" \
  -H "Authorization: Bearer $LITELLM_KEY" | jq '.'

# Check key in LiteMaaS DB
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT id, is_active, sync_status, revoked_at, litellm_key_alias
   FROM api_keys
   WHERE litellm_key_alias = 'key-alias-here';"

Fix — Key Budget Exhausted

# Check if the key hit its budget limit
curl "https://$ROUTE/key/info?key=sk-the-user-key-here" \
  -H "Authorization: Bearer $LITELLM_KEY" | \
  jq '.info | {max_budget, spend, budget_reset_at}'

# If spent >= max_budget, reset the spend or increase the budget
curl -X POST "https://$ROUTE/key/update" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"key": "sk-the-user-key-here", "max_budget": 200}'

Fix — Redis Cache Has Stale Key State

# Flush Redis so LiteLLM reloads all key data from PostgreSQL
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli FLUSHALL
10
Pods getting OOMKilled
Symptoms LiteLLM pods show OOMKilled as the last termination reason. oc describe pod shows exit code: 137. The pod restarts and the cycle repeats, especially under high concurrent load.

Diagnose

NS=litellm-rhpds

# Check OOM history
oc describe pods -n $NS -l app=litellm | grep -A 5 "Last State\|OOMKilled"

# Check current resource limits
oc get deployment/litellm -n $NS \
  -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq .

Fix — Increase Memory Limits

# Increase LiteLLM memory limit to 4Gi
oc set resources deployment/litellm \
  --limits=memory=4Gi,cpu=2000m \
  --requests=memory=1Gi,cpu=500m \
  -n $NS

# Or patch directly
oc patch deployment/litellm -n $NS --type=json -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/resources/limits/memory",
    "value": "4Gi"
  }
]'

oc rollout status deployment/litellm -n $NS

LiteLLM's memory usage grows with the number of concurrent requests and the size of model response payloads. Llama Scout 17B with 400K context can produce very large responses. For production at scale, 2-4 Gi per LiteLLM replica is recommended. Also consider reducing concurrent request limits via LiteLLM's max_parallel_requests configuration.

General Diagnostic Commands

NS=litellm-rhpds

# Full cluster state
oc get all -n $NS

# Pod events (shows crash reasons, scheduling issues)
oc get events -n $NS --sort-by='.lastTimestamp' | tail -30

# LiteLLM proxy health
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
curl -sk "https://$ROUTE/health/livenessz"
curl -sk "https://$ROUTE/health/readyz"

# LiteLLM request logs (last 5 minutes)
oc logs -n $NS deployment/litellm --since=5m | grep -E "ERROR|WARN|Exception"

# Backend API logs
oc logs -n $NS deployment/litellm-backend --since=5m | grep -E "error|warn|Error"

# Database connectivity check (from backend pod)
oc exec -n $NS deployment/litellm-backend -- \
  sh -c 'echo "DB: $DATABASE_URL" | sed "s/:.*@/:REDACTED@/"'

# Check PostgreSQL is accepting connections
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c "SELECT 1 as connected;"

# Redis connectivity (from LiteLLM pod)
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')
oc exec -n $NS $LITELLM_POD -- \
  sh -c 'redis-cli -h $REDIS_HOST ping 2>/dev/null || echo "Redis not reachable"'