Day-2 Operations

Upgrading, scaling, key cleanup, and admin management for a running LiteMaaS deployment.

Upgrading LiteLLM and LiteMaaS Images

In production, upgrades are handled by updating the deployment image tags. LiteMaaS uses rolling deployments — pods are replaced one at a time, so the service remains available during upgrades.

Current Production Images (from oc get deployment -n litellm-rhpds -o wide)

Deployment	Current Image	Version
`litellm`	`quay.io/rh-aiservices-bu/litellm-non-root`	`main-v1.81.0-stable-custom`
`litellm-backend`	`quay.io/rh-aiservices-bu/litemaas-backend`	`0.4.0`
`litellm-frontend`	`quay.io/rh-aiservices-bu/litemaas-frontend`	`0.4.0`
`litellm-redis`	`registry.redhat.io/rhel9/redis-7`	`latest`

Upgrade LiteMaaS Backend/Frontend

# Set namespace variable
NS=litellm-rhpds
NEW_VERSION=0.5.0

# Update backend image
oc set image deployment/litellm-backend \
  backend=quay.io/rh-aiservices-bu/litemaas-backend:${NEW_VERSION} \
  -n ${NS}

# Update frontend image
oc set image deployment/litellm-frontend \
  frontend=quay.io/rh-aiservices-bu/litemaas-frontend:${NEW_VERSION} \
  -n ${NS}

# Monitor rollout
oc rollout status deployment/litellm-backend -n ${NS}
oc rollout status deployment/litellm-frontend -n ${NS}

Frontend showing stale version after upgrade: Do a hard refresh in the browser (Cmd+Shift+R / Ctrl+Shift+R). If the issue persists, check that the frontend deployment is using the new image tag. See Troubleshooting for the fix.

Upgrade LiteLLM Proxy

# Before upgrading LiteLLM, take a database backup
oc exec -n ${NS} litellm-postgres-0 -- \
  pg_dump -U litellm litellm | gzip > /tmp/litellm-pre-upgrade.sql.gz

# Update LiteLLM image
NEW_LITELLM_TAG=main-v1.85.0-stable-custom
oc set image deployment/litellm \
  litellm=quay.io/rh-aiservices-bu/litellm-non-root:${NEW_LITELLM_TAG} \
  -n ${NS}

# Watch the rollout -- LiteLLM runs DB migrations on startup
oc rollout status deployment/litellm -n ${NS}

# Check logs for migration output
oc logs -n ${NS} deployment/litellm --tail=50 | grep -i migrat

Upgrade via Ansible (Re-deploy)

# Override image tags in your vars file, then re-run the playbook
ansible-playbook playbooks/deploy_litemaas_ha.yml \
  -e ocp4_workload_litemaas_namespace=litellm-rhpds \
  -e ocp4_workload_litemaas_litellm_tag=main-v1.85.0-stable-custom \
  -e ocp4_workload_litemaas_version=0.5.0

Scaling Replicas

All three main deployments (LiteLLM, Backend, Frontend) support horizontal scaling. Scaling is stateless for Backend and Frontend; LiteLLM uses Redis for shared session state across replicas.

Scale LiteLLM (the most common scale target)

# Scale LiteLLM to 5 replicas
oc scale deployment/litellm --replicas=5 -n litellm-rhpds

# Or via patch
oc patch deployment/litellm -n litellm-rhpds \
  --type=json -p='[{"op":"replace","path":"/spec/replicas","value":5}]'

# Verify
oc get deployment/litellm -n litellm-rhpds

Scale Backend and Frontend

oc scale deployment/litellm-backend --replicas=3 -n litellm-rhpds
oc scale deployment/litellm-frontend --replicas=3 -n litellm-rhpds

Check Current Scale

oc get deployments -n litellm-rhpds -o wide

Redis is required for multi-replica LiteLLM. Without Redis, each LiteLLM pod has its own in-memory session cache. Key validation will be inconsistent across replicas if Redis is not running. Verify Redis is healthy before scaling LiteLLM beyond 1 replica: oc get pods -n litellm-rhpds -l app=litellm-redis

Key Cleanup Cronjob

Click the diagram to expand.

In a production deployment, virtual API keys accumulate over time. Workshop participants receive 30-day keys, and after the workshop ends, those keys remain in LiteLLM's database consuming storage and complicating admin views. The key cleanup cronjob handles this automatically.

What the Cronjob Does

Runs daily at 2 AM on the bastion host
Fetches all virtual keys from the LiteLLM API (with pagination, 100 keys per page)
Identifies keys that match either deletion condition:
- Keys where expires timestamp is in the past (expired keys)
- Keys created more than 30 days ago (calculated from expires - duration metadata)
Deletes each matching key via POST /key/delete
For each deleted key: syncs the api_keys table in LiteMaaS PostgreSQL to mark it inactive with revoked_at timestamp and sync_status='error'
Final pass: marks any remaining LiteMaaS api_keys records inactive if their corresponding LiteLLM_VerificationToken no longer exists
All output is written to /var/log/litemaas-key-cleanup.log with logrotate (30-day retention)

Setup

# Run from the rhpds.litemaas repository root
# Detects whether running on bastion (direct write) or workstation (SSH)
./setup-key-cleanup-cronjob.sh litellm-rhpds

The script creates:

/usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh — the cleanup script
A root crontab entry: 0 2 * * * /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh
/etc/logrotate.d/litemaas-key-cleanup-litellm-rhpds — logrotate config

Manual Test Run

# Run manually (dry-run not supported — this will actually delete keys)
sudo /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh

# Tail the log
sudo tail -f /var/log/litemaas-key-cleanup.log

# Verify cronjob is installed
sudo crontab -l | grep cleanup-litemaas

Remove the Cronjob

sudo crontab -l | grep -v cleanup-litemaas-keys-litellm-rhpds.sh | sudo crontab -
sudo rm /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh

Admin User Management

Promote a User to Admin

The user must have logged in via OAuth at least once before being promoted.

# Using the helper script (recommended)
./promote-admin.sh litellm-rhpds user@redhat.com

# Or directly via psql
oc exec -n litellm-rhpds \
  $(oc get pods -n litellm-rhpds -l app=litellm-postgres -o name | head -1) -- \
  psql -U litellm -d litellm -c \
  "UPDATE users SET roles = ARRAY['admin', 'user'] WHERE email = 'user@redhat.com';"

# Verify
oc exec -n litellm-rhpds litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT username, email, roles FROM users WHERE 'admin' = ANY(roles);"

Retrieve Credentials

# LiteMaaS admin API key
oc get secret backend-secret -n litellm-rhpds \
  -o jsonpath='{.data.ADMIN_API_KEY}' | base64 -d

# LiteLLM master key
oc get secret litellm-secret -n litellm-rhpds \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d

# JWT secret (for session debugging)
oc get secret backend-secret -n litellm-rhpds \
  -o jsonpath='{.data.JWT_SECRET}' | base64 -d

Quick Reference Commands

# Namespace shortcut
NS=litellm-rhpds

# Get all pod status
oc get pods -n $NS

# Get all routes
oc get routes -n $NS

# Tail LiteLLM logs (model routing)
oc logs -n $NS deployment/litellm -f --tail=100

# Tail LiteMaaS backend logs (subscription/key ops)
oc logs -n $NS deployment/litellm-backend -f --tail=100

# Tail frontend logs
oc logs -n $NS deployment/litellm-frontend -f --tail=100

# Health check
ROUTE=$(oc get route litellm-prod -n $NS -o jsonpath='{.spec.host}')
curl -sk https://$ROUTE/health/livenessz

# List all virtual keys (paginated)
LITELLM_KEY=$(oc get secret litellm-secret -n $NS \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
curl "https://$ROUTE/key/list?return_full_object=true&size=100&page=1" \
  -H "Authorization: Bearer $LITELLM_KEY" | jq '.keys | length'

# Count active subscriptions in LiteMaaS DB
oc exec -n $NS litellm-postgres-0 -- \
  psql -U litellm -d litellm -c \
  "SELECT COUNT(*) FROM subscriptions WHERE status = 'active';"

# Force Redis cache flush (after model changes)
REDIS_POD=$(oc get pods -n $NS -l app=litellm-redis -o name | head -1)
oc exec -n $NS $REDIS_POD -- redis-cli FLUSHALL

# Restart all LiteMaaS components (in order)
oc rollout restart deployment/litellm-backend -n $NS
oc rollout restart deployment/litellm -n $NS
oc rollout restart deployment/litellm-frontend -n $NS

← Previous Deployment Guide Next → Model Routing

RHDP LiteMaaS