Common Issues

Exact symptoms, root causes, and fix commands for the most frequently encountered LiteMaaS problems.

Issue Index

Model sync not working (models added in LiteLLM but not showing in LiteMaaS)
Subscription failures — foreign key constraint violation
Frontend showing old version (init container image mismatch)
LiteLLM DB migration fails or schema is out of date
Key sync between LiteMaaS and LiteLLM broken
Redis cache issues — stale model data after changes
OAuth callback fails — users cannot log in
Model requests timing out (large context, long inference)
LiteLLM returning 401 / invalid API key
Pods getting OOMKilled

Model sync not working — models added in LiteLLM but not showing in LiteMaaS

Symptoms Model appears in the LiteLLM admin UI under Models, and curl /v1/models with master key shows it. However, the LiteMaaS frontend model catalog does not show the new model. Users trying to subscribe get a 404 or the model is absent from the dropdown.

Root Cause LiteMaaS backend maintains its own models table in PostgreSQL. Models added via the LiteLLM admin UI are stored in LiteLLM's internal LiteLLM_ModelTable only. The two databases must be explicitly synced.

Fix — Trigger sync via Admin UI or API

Go to LiteMaaS Frontend → Admin → Models and verify the model appears. If missing, add it via the UI or use the API:

export ADMIN_KEY=sk-1234567890abcdef1234 && export LITELLM_URL=https://litellm-prod.apps.maas.redhatworkshops.io

curl -sk -X POST "$LITELLM_URL/model/new" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model_name":"my-model","litellm_params":{"model":"openai/my-model","api_base":"http://my-model-predictor.llm-hosting.svc.cluster.local/v1","custom_llm_provider":"openai"}}'

Fix — Via Admin API Sync Endpoint

# If the backend exposes a sync endpoint:
ADMIN_KEY=$(oc get secret backend-secret -n $NS \
  -o jsonpath='{.data.ADMIN_API_KEY}' | base64 -d)
BACKEND_URL=$(oc get route litellm-prod-admin -n $NS \
  -o jsonpath='https://{.spec.host}')

curl -X POST "${BACKEND_URL}/api/admin/sync-models" \
  -H "Authorization: Bearer ${ADMIN_KEY}" \
  -H "Content-Type: application/json"

Fix — Enable Auto-Sync for Ongoing Changes

oc set env deployment/litellm-backend LITELLM_AUTO_SYNC=true -n $NS
oc rollout restart deployment/litellm-backend -n $NS

Verify

# Model should now appear in backend DB
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c "SELECT id, name FROM models;" | grep your-model-name

Subscription failures — foreign key constraint violation

Symptoms User can log in, can see models listed in the catalog, but when clicking "Subscribe" they receive an error. Backend logs show: foreign key constraint "subscriptions_model_id_fkey" or violates foreign key constraint "subscriptions_model_id_fkey".

Root Cause The subscriptions table has a foreign key reference to models.id. The model exists in LiteLLM's database but not in LiteMaaS's models table. This happens when models are added directly via the LiteLLM admin UI without running the sync playbook.

Diagnose

# Check if model exists in LiteMaaS models table
NS=litellm-rhpds
MODEL_ID=your-model-name

oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT id, name, provider FROM models WHERE id = '${MODEL_ID}';"
# If 0 rows returned, the model is not in LiteMaaS DB

# Check LiteLLM has it
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
curl -s "https://$ROUTE/model/info" \
  -H "Authorization: Bearer $LITELLM_KEY" | \
  jq ".data[] | select(.model_name == \"${MODEL_ID}\")"

Fix — Add model via Admin UI or API

Go to LiteMaaS Frontend → Admin → Models → Add Model, or use the API:

# Option 2: Insert directly (for emergency fix) oc exec -n $NS litellm-postgres-0 -- \ psql -U litellm -d litellm -c \ "INSERT INTO models (id, name, display_name, provider, availability, created_at, updated_at) VALUES ('${MODEL_ID}', '${MODEL_ID}', '${MODEL_ID}', 'openshift-ai', 'available', NOW(), NOW()) ON CONFLICT (id) DO NOTHING;"

Frontend showing old version after upgrade

Symptoms After updating the frontend image tag, the LiteMaaS portal still shows the old version number in the footer. oc get deployment/litellm-frontend -o yaml shows the new image tag, but the running pods still serve old content. Hard refresh (Ctrl+Shift+F5) does not help.

Root Cause The frontend deployment uses an init container to inject static assets into a shared volume. If the init container image tag is different from the main container image tag (e.g., the main container was updated but the init container was not, or vice versa), the init container will inject old assets even though the main container image is new. The main Nginx container only serves what the init container placed in the volume.

Diagnose

# Check both the main container and init container image tags
NS=litellm-rhpds

oc get deployment/litellm-frontend -n $NS \
  -o jsonpath='{.spec.template.spec.containers[*].image}'
echo ""
oc get deployment/litellm-frontend -n $NS \
  -o jsonpath='{.spec.template.spec.initContainers[*].image}'
echo ""

# They must match for the correct version to be served

Fix — Update Both Container Images Together

NS=litellm-rhpds
NEW_TAG=0.5.0
IMAGE=quay.io/rh-aiservices-bu/litemaas-frontend

# Update the main container
oc set image deployment/litellm-frontend \
  frontend=${IMAGE}:${NEW_TAG} \
  -n $NS

# Update the init container (if present)
oc set image deployment/litellm-frontend \
  init-frontend=${IMAGE}:${NEW_TAG} \
  -n $NS

# Force a rollout to apply changes
oc rollout restart deployment/litellm-frontend -n $NS
oc rollout status deployment/litellm-frontend -n $NS

# Verify running pod is using new image
oc get pods -n $NS -l app=litellm-frontend \
  -o jsonpath='{.items[0].spec.initContainers[*].image}'

Prevention

Always update init container and main container to the same tag simultaneously. When using Ansible, the role handles both; when patching manually, update both in a single oc set image call or via a single deployment patch.

LiteLLM DB migration fails or schema is out of date

Symptoms LiteLLM pods are in CrashLoopBackOff or restart repeatedly after an upgrade. Logs show Prisma-related errors: Table 'LiteLLM_VerificationToken' doesn't exist, column does not exist, or The column 'LiteLLM_XXX.field' does not exist. Alternatively, LiteLLM starts but key operations fail with database errors.

Root Cause LiteLLM uses Prisma ORM with schema migrations. When upgrading between minor/major LiteLLM versions, new columns or tables may be added. The migration sometimes fails if the database is locked, if the previous migration was incomplete, or if the container hit a startup timeout before migration finished.

Diagnose

# Check LiteLLM startup logs for migration output
NS=litellm-rhpds
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')

oc logs -n $NS $LITELLM_POD | grep -iE "migrat|prisma|error|schema"

# Check the PostgreSQL migration history table
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT * FROM \"_prisma_migrations\" ORDER BY finished_at DESC LIMIT 10;"

Fix — Trigger Manual Migration

# Scale down LiteLLM to 1 replica to avoid concurrent migration attempts
oc scale deployment/litellm --replicas=1 -n $NS

# Wait for the single pod to start
oc rollout status deployment/litellm -n $NS

# Check if migration ran successfully
oc logs -n $NS deployment/litellm --tail=100 | grep -i migrat

# If migration still fails, exec into the pod and run it manually
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')

oc exec -n $NS $LITELLM_POD -- \
  python -c "
import litellm
from litellm.proxy.proxy_server import ProxyStartupEvent
import asyncio
asyncio.run(ProxyStartupEvent.run_migrations())
"

# Scale back to 3 replicas after migration succeeds
oc scale deployment/litellm --replicas=3 -n $NS

Fix — Reset Migration State (Last Resort)

Warning: Only use this if migration is stuck on a specific migration that has already been applied to the database. This marks the failed migration as applied without re-running it. Take a full database backup first.

# Mark a failed migration as applied (replace with actual migration name from logs)
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "UPDATE \"_prisma_migrations\" \
   SET finished_at = NOW(), applied_steps_count = 1 \
   WHERE migration_name = '20240101000000_add_some_column' \
   AND finished_at IS NULL;"

# Restart LiteLLM
oc rollout restart deployment/litellm -n $NS

Key sync between LiteMaaS and LiteLLM broken

Symptoms User created an API key in LiteMaaS portal (shows as Active in the portal), but requests to LiteLLM return 401 Unauthorized or Invalid API key. Alternatively, a key shows as revoked in the portal but still works when used against the LiteLLM endpoint. The sync_status column in the api_keys table shows 'error'.

Root Cause The LiteMaaS backend creates keys in LiteLLM via the /key/generate API, then stores the returned key token and alias in its own api_keys table. If the LiteLLM API call succeeded but the database write failed (or vice versa), the two systems are out of sync. Also happens when keys are deleted directly in LiteLLM's admin UI without going through LiteMaaS's key management flow.

Diagnose — Find Keys with Sync Errors

NS=litellm-rhpds

# List all keys with sync errors
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT id, user_id, litellm_key_alias, is_active, sync_status, sync_error
   FROM api_keys
   WHERE sync_status = 'error' OR sync_status = 'pending'
   ORDER BY updated_at DESC
   LIMIT 20;"

# Find LiteMaaS-active keys that don't exist in LiteLLM
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT ak.id, ak.litellm_key_alias, ak.is_active
   FROM api_keys ak
   WHERE ak.is_active = true
     AND ak.litellm_key_alias IS NOT NULL
     AND NOT EXISTS (
       SELECT 1 FROM \"LiteLLM_VerificationToken\" lv
       WHERE lv.key_alias = ak.litellm_key_alias
     );"

Fix — Mark Orphaned LiteMaaS Keys as Inactive

# Mark all LiteMaaS keys inactive if they have no LiteLLM counterpart
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "UPDATE api_keys
   SET is_active = false,
       revoked_at = NOW(),
       sync_status = 'error',
       sync_error = 'Key not found in LiteLLM - manual cleanup',
       updated_at = NOW()
   WHERE is_active = true
     AND litellm_key_alias IS NOT NULL
     AND NOT EXISTS (
       SELECT 1 FROM \"LiteLLM_VerificationToken\" lv
       WHERE lv.key_alias = api_keys.litellm_key_alias
     );"

Fix — Re-sync a Specific Key

# If a key exists in LiteLLM but not tracked in LiteMaaS,
# the user should revoke the old key and create a new one via the portal.
# The new key creation will go through the proper sync flow.

# Alternatively, delete the orphaned LiteLLM key:
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')

curl -X POST "https://$ROUTE/key/delete" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"keys": ["sk-the-orphaned-key-token"]}'

Fix — Run the Key Cleanup Cronjob Immediately

# The cleanup cron handles orphaned keys automatically
# Run it manually for immediate effect
sudo /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh
sudo tail -50 /var/log/litemaas-key-cleanup.log

Redis cache issues — stale model data after model changes

Symptoms After adding, removing, or updating a model in LiteLLM, the LiteLLM proxy continues to serve stale model information for several minutes. In multi-replica deployments, different LiteLLM pods may return different model lists. The LiteMaaS backend cache may also show outdated model data immediately after a change.

Root Cause LiteLLM maintains an in-memory cache of model configurations and virtual key data, synchronized via Redis in multi-replica deployments. If Redis is down, each pod maintains its own independent cache. If Redis is running but the cache is stale, a TTL-based expiry or an explicit flush is needed.

Diagnose

NS=litellm-rhpds

# Check Redis is running
oc get pods -n $NS -l app=litellm-redis

# Check Redis memory usage and key count
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli INFO memory | grep used_memory_human
oc exec -n $NS $REDIS_POD -- redis-cli DBSIZE

# Check if LiteLLM pods can reach Redis
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')
oc exec -n $NS $LITELLM_POD -- \
  sh -c 'echo REDIS_HOST=$REDIS_HOST REDIS_PORT=$REDIS_PORT'

Fix — Flush the Redis Cache

# Flush all Redis keys (forces LiteLLM pods to reload from DB)
# WARNING: This briefly increases DB load as all pods reload their caches
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli FLUSHALL

# Verify flush
oc exec -n $NS $REDIS_POD -- redis-cli DBSIZE
# Should return: 0

Fix — Restart LiteLLM Pods (Forces Cache Reload)

# Rolling restart — no downtime
oc rollout restart deployment/litellm -n $NS
oc rollout status deployment/litellm -n $NS

Fix — Redis Pod Down (Restart Redis)

# If Redis pod is in error state
oc delete pod -n $NS -l app=litellm-redis
# The deployment controller will create a new pod automatically

# If Redis deployment is misconfigured, check logs
oc logs -n $NS deployment/litellm-redis --tail=50

OAuth callback fails — users cannot log in

Symptoms Users click "Login" on the LiteMaaS frontend, are redirected to the OpenShift OAuth page, authenticate successfully, but then see a generic error page or are redirected back to the frontend with an error. Backend logs show: invalid_grant, redirect_uri_mismatch, or oauth_id not found.

Root Cause (most common): The OAuthClient's redirectURIs list does not include the actual route hostname. This happens when route hostnames change (e.g., after cluster migration) or when the OAuthClient was created with old hostnames. Secondary cause: users were directly inserted into the database before logging in via OAuth, so their oauth_id is null.

Diagnose — Check OAuthClient Redirect URIs

NS=litellm-rhpds

# Get the OAuthClient configuration
oc get oauthclient $NS -o yaml | grep -A 20 redirectURIs

# Get current route hostnames
oc get routes -n $NS -o jsonpath='{range .items[*]}{.spec.host}{"\n"}{end}'

# The OAuthClient redirectURIs must include:
# https://<api-route>/api/auth/callback
# https://<frontend-route>/api/auth/callback

Fix — Update OAuthClient Redirect URIs

NS=litellm-rhpds
API_ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
FRONTEND_ROUTE=$(oc get route litellm-prod-frontend -n $NS -o jsonpath='{.spec.host}')

# Patch the OAuthClient
oc patch oauthclient $NS --type=merge -p \
  "{\"redirectURIs\": [
    \"https://${API_ROUTE}/api/auth/callback\",
    \"https://${FRONTEND_ROUTE}/api/auth/callback\"
  ]}"

# Verify
oc get oauthclient $NS -o jsonpath='{.redirectURIs}'

Fix — User oauth_id Mismatch (after migration or manual insert)

# LiteMaaS v0.2.1+ has email fallback: it looks up by email if oauth_id doesn't match
# and auto-updates the oauth_id. No manual fix needed for v0.2.1+.
# For older versions, update oauth_id manually:

# Get the OpenShift user's UID (this is the oauth_id)
oc get user user@redhat.com -o jsonpath='{.metadata.uid}'

# Update the oauth_id in LiteMaaS DB
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "UPDATE users SET oauth_id = 'the-openshift-user-uid-here' \
   WHERE email = 'user@redhat.com';"

Model requests timing out (large context, long inference)

Symptoms Requests to models with large context windows (especially Llama Scout 17B with 400K context) return a 504 Gateway Timeout before the model finishes generating. The LiteLLM proxy returns TimeoutError in logs. Streaming requests are cut off mid-response.

Root Cause HAProxy (OpenShift's default ingress controller) has a default timeout of 30 seconds or less for backend connections. Long inference requests — especially with large context windows — exceed this timeout.

Fix — Set HAProxy Timeout on Routes

NS=litellm-rhpds

# Add 600-second timeout annotation to the LiteLLM API route
oc annotate route litellm-prod -n $NS \
  haproxy.router.openshift.io/timeout=600s --overwrite

# Also apply to backend and frontend routes if needed
oc annotate route litellm-prod-admin -n $NS \
  haproxy.router.openshift.io/timeout=600s --overwrite
oc annotate route litellm-prod-frontend -n $NS \
  haproxy.router.openshift.io/timeout=600s --overwrite

# Verify
oc get route litellm-prod -n $NS \
  -o jsonpath='{.metadata.annotations.haproxy\.router\.openshift\.io/timeout}'

The production RHDP deployment already has this annotation applied on all routes. If you re-deploy or the routes are recreated, re-apply the annotation. Consider adding it to the deployment playbook via the route template to make it permanent.

LiteLLM returning 401 / invalid API key

Symptoms API calls with a previously working key start returning {"error": {"message": "Invalid API Key"}} with HTTP 401. The key appears active in the LiteMaaS portal. Using the master key works fine.

Diagnose

NS=litellm-rhpds
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')

# Check if the key exists in LiteLLM
curl "https://$ROUTE/key/info?key=sk-the-user-key-here" \
  -H "Authorization: Bearer $LITELLM_KEY" | jq '.'

# Check key in LiteMaaS DB
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT id, is_active, sync_status, revoked_at, litellm_key_alias
   FROM api_keys
   WHERE litellm_key_alias = 'key-alias-here';"

Fix — Key Budget Exhausted

# Check if the key hit its budget limit
curl "https://$ROUTE/key/info?key=sk-the-user-key-here" \
  -H "Authorization: Bearer $LITELLM_KEY" | \
  jq '.info | {max_budget, spend, budget_reset_at}'

# If spent >= max_budget, reset the spend or increase the budget
curl -X POST "https://$ROUTE/key/update" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"key": "sk-the-user-key-here", "max_budget": 200}'

Fix — Redis Cache Has Stale Key State

# Flush Redis so LiteLLM reloads all key data from PostgreSQL
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli FLUSHALL

Pods getting OOMKilled

Symptoms LiteLLM pods show OOMKilled as the last termination reason. oc describe pod shows exit code: 137. The pod restarts and the cycle repeats, especially under high concurrent load.

Diagnose

NS=litellm-rhpds

# Check OOM history
oc describe pods -n $NS -l app=litellm | grep -A 5 "Last State\|OOMKilled"

# Check current resource limits
oc get deployment/litellm -n $NS \
  -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq .

Fix — Increase Memory Limits

# Increase LiteLLM memory limit to 4Gi
oc set resources deployment/litellm \
  --limits=memory=4Gi,cpu=2000m \
  --requests=memory=1Gi,cpu=500m \
  -n $NS

# Or patch directly
oc patch deployment/litellm -n $NS --type=json -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/resources/limits/memory",
    "value": "4Gi"
  }
]'

oc rollout status deployment/litellm -n $NS

LiteLLM's memory usage grows with the number of concurrent requests and the size of model response payloads. Llama Scout 17B with 400K context can produce very large responses. For production at scale, 2-4 Gi per LiteLLM replica is recommended. Also consider reducing concurrent request limits via LiteLLM's max_parallel_requests configuration.

All models and keys lost after LiteLLM pod restart

Symptoms After a LiteLLM pod restart or rolling update, all registered models disappear from the portal. All virtual keys stop working (HTTP 401). The LiteMaaS model list shows empty. Everything appears to reset to a blank state on every restart.

Root Cause

LiteLLM requires the DATABASE_URL environment variable to persist state to PostgreSQL. Without it, LiteLLM uses in-memory storage — all models, keys, and configuration exist only in the running pod and are lost the moment the pod restarts. This was a common issue in deployments before v0.4.0 where DATABASE_URL was not set automatically.

Diagnose

# Check if DATABASE_URL is set on the LiteLLM pod
oc exec -n litellm-rhpds deployment/litellm -- sh -c "echo \$DATABASE_URL"

# If empty or not set — this is the cause
# Models and keys are stored in memory only and lost on restart

Fix — Set DATABASE_URL

# Get the PostgreSQL connection string from the DB secret
DB_URL=$(oc get secret litemaas-db -n litellm-rhpds \
  -o jsonpath='{.data.DATABASE_URL}' | base64 -d)

# Set it on the LiteLLM deployment
oc set env deployment/litellm -n litellm-rhpds DATABASE_URL="$DB_URL"

oc rollout status deployment/litellm -n litellm-rhpds

Fixed in v0.4.0: The rhpds.litemaas Ansible collection v0.3.2+ and the LiteMaaS Helm chart v0.4.0 both auto-construct and inject DATABASE_URL from the PostgreSQL credentials when postgresql.enabled: true. If you are on v0.4.0 and still see this issue, verify the secret litemaas-db exists and contains a valid DATABASE_URL key.

After fixing — re-register models

The model-serving ArgoCD app will re-create InferenceServices, but LiteLLM model registrations must be re-added via the Admin UI → Models → Add Model or the API. Then restart the backend to trigger sync:

oc rollout restart deployment/litellm-backend -n litellm-rhpds

Model predictor pod stuck in Pending — Insufficient GPU

Symptoms A model predictor pod stays in Pending for hours or days. oc describe pod shows: "0/16 nodes are available: 2 Insufficient nvidia.com/gpu". The model is not serving requests. This happens when all GPU nodes of the required instance type are fully allocated.

Diagnose

# Find pending predictor pods
oc get pods -n llm-hosting --field-selector=status.phase=Pending

# Check the scheduling error
oc describe pod <pending-pod-name> -n llm-hosting | grep -A5 "Events:"

# Check GPU capacity vs allocation per node
oc get nodes -l 'node-role.kubernetes.io/worker-gpu' \
  -o custom-columns="NAME:.metadata.name,INSTANCE:.metadata.labels['node\.kubernetes\.io/instance-type'],GPU:.status.capacity['nvidia\.com/gpu']"

# See which pods are using GPUs on a specific node
oc describe node <node-name> | grep -A20 "Allocated resources"

Options

# Option 1: Scale down another model on the same node type to free a GPU slot
oc patch inferenceservice <other-model> -n llm-hosting \
  --type merge -p '{"spec":{"predictor":{"maxReplicas":0,"minReplicas":0}}}'

# Option 2: Change the node selector to a different instance type with free GPUs
# Edit the InferenceService in the models-aas GitOps repo and update nodeSelector

# Option 3: Add more GPU nodes (requires cluster admin / AWS console)

The RHDP MaaS cluster has a fixed set of GPU nodes. GPU slots are finite — if all L4 nodes are full, any new model requiring an L4 will stay Pending until another model is scaled down or a new node is added. Coordinate with the infra team before adding new large models to avoid displacing existing ones.

Same model appears twice in LiteMaaS

Symptoms A model name appears twice in the LiteMaaS models page or in GET /model/info. Users may get inconsistent responses — some requests go to one backend, others to the other. This usually happens after a model is added manually via the API and then also picked up by the sync, or when a model is re-registered without removing the old entry.

Diagnose

MASTER_KEY=$(oc get secret litellm-secret -n litellm-rhpds \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)

# List all models and spot duplicates
curl -s https://litellm-prod.apps.maas.redhatworkshops.io/model/info \
  -H "Authorization: Bearer $MASTER_KEY" | \
  python3 -c "
import sys,json
d=json.load(sys.stdin)
names=[m.get('model_name') for m in d.get('data',[])]
dups=[n for n in names if names.count(n)>1]
print('Duplicates:', set(dups))
for m in d.get('data',[]):
    if m.get('model_name') in dups:
        print(m.get('model_name'), m.get('model_info',{}).get('id'))
"

Fix

# Delete the duplicate by its model_info.id
curl -X DELETE https://litellm-prod.apps.maas.redhatworkshops.io/model/delete \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"id": "<model-info-id-to-remove>"}'

# Verify it is removed
curl -s https://litellm-prod.apps.maas.redhatworkshops.io/model/info \
  -H "Authorization: Bearer $MASTER_KEY" | \
  python3 -c "import sys,json; [print(m['model_name']) for m in json.load(sys.stdin).get('data',[])]"

General Diagnostic Commands

NS=litellm-rhpds

# Full cluster state
oc get all -n $NS

# Pod events (shows crash reasons, scheduling issues)
oc get events -n $NS --sort-by='.lastTimestamp' | tail -30

# LiteLLM proxy health
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
curl -sk "https://$ROUTE/health/livenessz"
curl -sk "https://$ROUTE/health/readyz"

# LiteLLM request logs (last 5 minutes)
oc logs -n $NS deployment/litellm --since=5m | grep -E "ERROR|WARN|Exception"

# Backend API logs
oc logs -n $NS deployment/litellm-backend --since=5m | grep -E "error|warn|Error"

# Database connectivity check (from backend pod)
oc exec -n $NS deployment/litellm-backend -- \
  sh -c 'echo "DB: $DATABASE_URL" | sed "s/:.*@/:REDACTED@/"'

# Check PostgreSQL is accepting connections
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c "SELECT 1 as connected;"

# Redis connectivity (from LiteLLM pod)
LITELLM_POD=$(oc get pods -n $NS -l app=litellm \
  -o jsonpath='{.items[0].metadata.name}')
oc exec -n $NS $LITELLM_POD -- \
  sh -c 'redis-cli -h $REDIS_HOST ping 2>/dev/null || echo "Redis not reachable"'

Intel Gaudi Cluster — Troubleshooting

The Gaudi cluster at maas00.rs-dfw3.infra.demo.redhat.com runs models via direct KServe routes — there is no LiteMaaS or LiteLLM proxy. Use these commands when troubleshooting Gaudi-specific issues. For access credentials, contact Ashok.

Check model status

# List InferenceServices and their ready state
oc get inferenceservice -n llm-hosting

# Detailed status for a specific model
oc describe inferenceservice deepseek-r1-distill-qwen-14b -n llm-hosting

Test endpoint directly

# Test the KServe route directly
curl -s https://deepseek-r1-distill-qwen-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"

# Quick chat test
curl -s https://qwen3-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"ping"}]}'

Check Gaudi hardware and pods

# Check Gaudi device allocation on the node
oc describe node maas00-sno | grep -A5 "habana\|Allocated"

# Check predictor pod logs
oc logs -n llm-hosting -l serving.kserve.io/inferenceservice=deepseek-r1-distill-qwen-14b --tail=50

# Check Habana AI operator
oc get pods -n habana-ai-operator

Check Grafana for Gaudi metrics

# Get Grafana route
oc get route grafana-route -n llm-hosting
# URL: https://grafana-route-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com

Common Gaudi issue: If a model pod is stuck in Pending, check if the habana.ai/gaudi resource limit is set correctly in the InferenceService and that the Habana AI Operator is running. The node only has 8 Gaudi cards — if all are allocated, new pods will not schedule.

← Previous Monitoring