Model as a Service for Red Hat Demo Platform
Exact symptoms, root causes, and fix commands for the most frequently encountered LiteMaaS problems.
curl /v1/models with master key shows it.
However, the LiteMaaS frontend model catalog does not show the new model.
Users trying to subscribe get a 404 or the model is absent from the dropdown.
models table in PostgreSQL. Models added via the LiteLLM admin UI
are stored in LiteLLM's internal LiteLLM_ModelTable only. The two databases must be explicitly synced.
# Get credentials NS=litellm-rhpds LITELLM_URL=$(oc get route litellm-prod -n $NS -o jsonpath='https://{.spec.host}') LITELLM_KEY=$(oc get secret litellm-secret -n $NS \ -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d) # Run the model sync playbook ansible-playbook playbooks/manage_models.yml \ -e litellm_url="${LITELLM_URL}" \ -e litellm_master_key="${LITELLM_KEY}" \ -e ocp4_workload_litemaas_models_namespace="${NS}" \ -e ocp4_workload_litemaas_models_sync_from_litellm=true \ -e '{"ocp4_workload_litemaas_models_list": []}'
# If the backend exposes a sync endpoint:
ADMIN_KEY=$(oc get secret backend-secret -n $NS \
-o jsonpath='{.data.ADMIN_API_KEY}' | base64 -d)
BACKEND_URL=$(oc get route litellm-prod-admin -n $NS \
-o jsonpath='https://{.spec.host}')
curl -X POST "${BACKEND_URL}/api/admin/sync-models" \
-H "Authorization: Bearer ${ADMIN_KEY}" \
-H "Content-Type: application/json"
oc set env deployment/litellm-backend LITELLM_AUTO_SYNC=true -n $NS oc rollout restart deployment/litellm-backend -n $NS
# Model should now appear in backend DB
oc exec -n $NS litellm-postgres-0 -- \
psql -U litellm -d litellm -c "SELECT id, name FROM models;" | grep your-model-name
foreign key constraint "subscriptions_model_id_fkey" or
violates foreign key constraint "subscriptions_model_id_fkey".
subscriptions table has a foreign key reference to models.id.
The model exists in LiteLLM's database but not in LiteMaaS's models table.
This happens when models are added directly via the LiteLLM admin UI without running the sync playbook.
# Check if model exists in LiteMaaS models table NS=litellm-rhpds MODEL_ID=your-model-name oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "SELECT id, name, provider FROM models WHERE id = '${MODEL_ID}';" # If 0 rows returned, the model is not in LiteMaaS DB # Check LiteLLM has it LITELLM_KEY=$(oc get secret litellm-secret -n $NS \ -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d) ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}') curl -s "https://$ROUTE/model/info" \ -H "Authorization: Bearer $LITELLM_KEY" | \ jq ".data[] | select(.model_name == \"${MODEL_ID}\")"
# Option 1: Run sync playbook (syncs ALL models) ansible-playbook playbooks/manage_models.yml \ -e litellm_url="https://$ROUTE" \ -e litellm_master_key="${LITELLM_KEY}" \ -e ocp4_workload_litemaas_models_namespace="${NS}" \ -e ocp4_workload_litemaas_models_sync_from_litellm=true \ -e '{"ocp4_workload_litemaas_models_list": []}' # Option 2: Insert directly (for emergency fix) oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "INSERT INTO models (id, name, display_name, provider, availability, created_at, updated_at) VALUES ('${MODEL_ID}', '${MODEL_ID}', '${MODEL_ID}', 'openshift-ai', 'available', NOW(), NOW()) ON CONFLICT (id) DO NOTHING;"
oc get deployment/litellm-frontend -o yaml shows the new image tag, but the running pods still
serve old content. Hard refresh (Ctrl+Shift+F5) does not help.
# Check both the main container and init container image tags NS=litellm-rhpds oc get deployment/litellm-frontend -n $NS \ -o jsonpath='{.spec.template.spec.containers[*].image}' echo "" oc get deployment/litellm-frontend -n $NS \ -o jsonpath='{.spec.template.spec.initContainers[*].image}' echo "" # They must match for the correct version to be served
NS=litellm-rhpds NEW_TAG=0.5.0 IMAGE=quay.io/rh-aiservices-bu/litemaas-frontend # Update the main container oc set image deployment/litellm-frontend \ frontend=${IMAGE}:${NEW_TAG} \ -n $NS # Update the init container (if present) oc set image deployment/litellm-frontend \ init-frontend=${IMAGE}:${NEW_TAG} \ -n $NS # Force a rollout to apply changes oc rollout restart deployment/litellm-frontend -n $NS oc rollout status deployment/litellm-frontend -n $NS # Verify running pod is using new image oc get pods -n $NS -l app=litellm-frontend \ -o jsonpath='{.items[0].spec.initContainers[*].image}'
Always update init container and main container to the same tag simultaneously. When using Ansible, the role handles both; when patching manually, update both in a single oc set image call or via a single deployment patch.
Table 'LiteLLM_VerificationToken' doesn't exist,
column does not exist, or The column 'LiteLLM_XXX.field' does not exist.
Alternatively, LiteLLM starts but key operations fail with database errors.
# Check LiteLLM startup logs for migration output NS=litellm-rhpds LITELLM_POD=$(oc get pods -n $NS -l app=litellm \ -o jsonpath='{.items[0].metadata.name}') oc logs -n $NS $LITELLM_POD | grep -iE "migrat|prisma|error|schema" # Check the PostgreSQL migration history table oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "SELECT * FROM \"_prisma_migrations\" ORDER BY finished_at DESC LIMIT 10;"
# Scale down LiteLLM to 1 replica to avoid concurrent migration attempts oc scale deployment/litellm --replicas=1 -n $NS # Wait for the single pod to start oc rollout status deployment/litellm -n $NS # Check if migration ran successfully oc logs -n $NS deployment/litellm --tail=100 | grep -i migrat # If migration still fails, exec into the pod and run it manually LITELLM_POD=$(oc get pods -n $NS -l app=litellm \ -o jsonpath='{.items[0].metadata.name}') oc exec -n $NS $LITELLM_POD -- \ python -c " import litellm from litellm.proxy.proxy_server import ProxyStartupEvent import asyncio asyncio.run(ProxyStartupEvent.run_migrations()) " # Scale back to 3 replicas after migration succeeds oc scale deployment/litellm --replicas=3 -n $NS
Warning: Only use this if migration is stuck on a specific migration that has already been applied to the database. This marks the failed migration as applied without re-running it. Take a full database backup first.
# Mark a failed migration as applied (replace with actual migration name from logs) oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "UPDATE \"_prisma_migrations\" \ SET finished_at = NOW(), applied_steps_count = 1 \ WHERE migration_name = '20240101000000_add_some_column' \ AND finished_at IS NULL;" # Restart LiteLLM oc rollout restart deployment/litellm -n $NS
401 Unauthorized or Invalid API key. Alternatively, a key shows as revoked
in the portal but still works when used against the LiteLLM endpoint. The sync_status column
in the api_keys table shows 'error'.
/key/generate API, then stores the returned
key token and alias in its own api_keys table. If the LiteLLM API call succeeded but the database
write failed (or vice versa), the two systems are out of sync. Also happens when keys are deleted directly
in LiteLLM's admin UI without going through LiteMaaS's key management flow.
NS=litellm-rhpds # List all keys with sync errors oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "SELECT id, user_id, litellm_key_alias, is_active, sync_status, sync_error FROM api_keys WHERE sync_status = 'error' OR sync_status = 'pending' ORDER BY updated_at DESC LIMIT 20;" # Find LiteMaaS-active keys that don't exist in LiteLLM oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "SELECT ak.id, ak.litellm_key_alias, ak.is_active FROM api_keys ak WHERE ak.is_active = true AND ak.litellm_key_alias IS NOT NULL AND NOT EXISTS ( SELECT 1 FROM \"LiteLLM_VerificationToken\" lv WHERE lv.key_alias = ak.litellm_key_alias );"
# Mark all LiteMaaS keys inactive if they have no LiteLLM counterpart
oc exec -n $NS litellm-postgres-0 -- \
psql -U litellm -d litellm -c \
"UPDATE api_keys
SET is_active = false,
revoked_at = NOW(),
sync_status = 'error',
sync_error = 'Key not found in LiteLLM - manual cleanup',
updated_at = NOW()
WHERE is_active = true
AND litellm_key_alias IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM \"LiteLLM_VerificationToken\" lv
WHERE lv.key_alias = api_keys.litellm_key_alias
);"
# If a key exists in LiteLLM but not tracked in LiteMaaS, # the user should revoke the old key and create a new one via the portal. # The new key creation will go through the proper sync flow. # Alternatively, delete the orphaned LiteLLM key: LITELLM_KEY=$(oc get secret litellm-secret -n $NS \ -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d) ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}') curl -X POST "https://$ROUTE/key/delete" \ -H "Authorization: Bearer $LITELLM_KEY" \ -H "Content-Type: application/json" \ -d '{"keys": ["sk-the-orphaned-key-token"]}'
# The cleanup cron handles orphaned keys automatically # Run it manually for immediate effect sudo /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh sudo tail -50 /var/log/litemaas-key-cleanup.log
NS=litellm-rhpds # Check Redis is running oc get pods -n $NS -l app=litellm-redis # Check Redis memory usage and key count REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1) oc exec -n $NS $REDIS_POD -- redis-cli INFO memory | grep used_memory_human oc exec -n $NS $REDIS_POD -- redis-cli DBSIZE # Check if LiteLLM pods can reach Redis LITELLM_POD=$(oc get pods -n $NS -l app=litellm \ -o jsonpath='{.items[0].metadata.name}') oc exec -n $NS $LITELLM_POD -- \ sh -c 'echo REDIS_HOST=$REDIS_HOST REDIS_PORT=$REDIS_PORT'
# Flush all Redis keys (forces LiteLLM pods to reload from DB) # WARNING: This briefly increases DB load as all pods reload their caches REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1) oc exec -n $NS $REDIS_POD -- redis-cli FLUSHALL # Verify flush oc exec -n $NS $REDIS_POD -- redis-cli DBSIZE # Should return: 0
# Rolling restart — no downtime
oc rollout restart deployment/litellm -n $NS
oc rollout status deployment/litellm -n $NS
# If Redis pod is in error state oc delete pod -n $NS -l app=litellm-redis # The deployment controller will create a new pod automatically # If Redis deployment is misconfigured, check logs oc logs -n $NS deployment/litellm-redis --tail=50
invalid_grant, redirect_uri_mismatch, or
oauth_id not found.
redirectURIs list does not include
the actual route hostname. This happens when route hostnames change (e.g., after cluster migration) or when
the OAuthClient was created with old hostnames. Secondary cause: users were directly inserted into the database
before logging in via OAuth, so their oauth_id is null.
NS=litellm-rhpds # Get the OAuthClient configuration oc get oauthclient $NS -o yaml | grep -A 20 redirectURIs # Get current route hostnames oc get routes -n $NS -o jsonpath='{range .items[*]}{.spec.host}{"\n"}{end}' # The OAuthClient redirectURIs must include: # https://<api-route>/api/auth/callback # https://<frontend-route>/api/auth/callback
NS=litellm-rhpds
API_ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
FRONTEND_ROUTE=$(oc get route litellm-prod-frontend -n $NS -o jsonpath='{.spec.host}')
# Patch the OAuthClient
oc patch oauthclient $NS --type=merge -p \
"{\"redirectURIs\": [
\"https://${API_ROUTE}/api/auth/callback\",
\"https://${FRONTEND_ROUTE}/api/auth/callback\"
]}"
# Verify
oc get oauthclient $NS -o jsonpath='{.redirectURIs}'
# LiteMaaS v0.2.1+ has email fallback: it looks up by email if oauth_id doesn't match # and auto-updates the oauth_id. No manual fix needed for v0.2.1+. # For older versions, update oauth_id manually: # Get the OpenShift user's UID (this is the oauth_id) oc get user user@redhat.com -o jsonpath='{.metadata.uid}' # Update the oauth_id in LiteMaaS DB oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "UPDATE users SET oauth_id = 'the-openshift-user-uid-here' \ WHERE email = 'user@redhat.com';"
504 Gateway Timeout before the model finishes generating. The LiteLLM proxy returns
TimeoutError in logs. Streaming requests are cut off mid-response.
NS=litellm-rhpds # Add 600-second timeout annotation to the LiteLLM API route oc annotate route litellm-prod -n $NS \ haproxy.router.openshift.io/timeout=600s --overwrite # Also apply to backend and frontend routes if needed oc annotate route litellm-prod-admin -n $NS \ haproxy.router.openshift.io/timeout=600s --overwrite oc annotate route litellm-prod-frontend -n $NS \ haproxy.router.openshift.io/timeout=600s --overwrite # Verify oc get route litellm-prod -n $NS \ -o jsonpath='{.metadata.annotations.haproxy\.router\.openshift\.io/timeout}'
The production RHDP deployment already has this annotation applied on all routes. If you re-deploy or the routes are recreated, re-apply the annotation. Consider adding it to the deployment playbook via the route template to make it permanent.
{"error": {"message": "Invalid API Key"}}
with HTTP 401. The key appears active in the LiteMaaS portal. Using the master key works fine.
NS=litellm-rhpds
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
-o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
# Check if the key exists in LiteLLM
curl "https://$ROUTE/key/info?key=sk-the-user-key-here" \
-H "Authorization: Bearer $LITELLM_KEY" | jq '.'
# Check key in LiteMaaS DB
oc exec -n $NS litellm-postgres-0 -- \
psql -U litellm -d litellm -c \
"SELECT id, is_active, sync_status, revoked_at, litellm_key_alias
FROM api_keys
WHERE litellm_key_alias = 'key-alias-here';"
# Check if the key hit its budget limit curl "https://$ROUTE/key/info?key=sk-the-user-key-here" \ -H "Authorization: Bearer $LITELLM_KEY" | \ jq '.info | {max_budget, spend, budget_reset_at}' # If spent >= max_budget, reset the spend or increase the budget curl -X POST "https://$ROUTE/key/update" \ -H "Authorization: Bearer $LITELLM_KEY" \ -H "Content-Type: application/json" \ -d '{"key": "sk-the-user-key-here", "max_budget": 200}'
# Flush Redis so LiteLLM reloads all key data from PostgreSQL
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli FLUSHALL
OOMKilled as the last termination reason. oc describe pod shows
exit code: 137. The pod restarts and the cycle repeats, especially under high concurrent load.
NS=litellm-rhpds # Check OOM history oc describe pods -n $NS -l app=litellm | grep -A 5 "Last State\|OOMKilled" # Check current resource limits oc get deployment/litellm -n $NS \ -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq .
# Increase LiteLLM memory limit to 4Gi oc set resources deployment/litellm \ --limits=memory=4Gi,cpu=2000m \ --requests=memory=1Gi,cpu=500m \ -n $NS # Or patch directly oc patch deployment/litellm -n $NS --type=json -p='[ { "op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "4Gi" } ]' oc rollout status deployment/litellm -n $NS
LiteLLM's memory usage grows with the number of concurrent requests and the size of model response payloads.
Llama Scout 17B with 400K context can produce very large responses. For production at scale, 2-4 Gi per
LiteLLM replica is recommended. Also consider reducing concurrent request limits via LiteLLM's
max_parallel_requests configuration.
NS=litellm-rhpds # Full cluster state oc get all -n $NS # Pod events (shows crash reasons, scheduling issues) oc get events -n $NS --sort-by='.lastTimestamp' | tail -30 # LiteLLM proxy health ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}') curl -sk "https://$ROUTE/health/livenessz" curl -sk "https://$ROUTE/health/readyz" # LiteLLM request logs (last 5 minutes) oc logs -n $NS deployment/litellm --since=5m | grep -E "ERROR|WARN|Exception" # Backend API logs oc logs -n $NS deployment/litellm-backend --since=5m | grep -E "error|warn|Error" # Database connectivity check (from backend pod) oc exec -n $NS deployment/litellm-backend -- \ sh -c 'echo "DB: $DATABASE_URL" | sed "s/:.*@/:REDACTED@/"' # Check PostgreSQL is accepting connections oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c "SELECT 1 as connected;" # Redis connectivity (from LiteLLM pod) LITELLM_POD=$(oc get pods -n $NS -l app=litellm \ -o jsonpath='{.items[0].metadata.name}') oc exec -n $NS $LITELLM_POD -- \ sh -c 'redis-cli -h $REDIS_HOST ping 2>/dev/null || echo "Redis not reachable"'