Model as a Service for Red Hat Demo Platform
How LiteLLM routes traffic — GPU load balancing across on-prem servers, fallback configuration, and cost-based routing between cloud providers (Vertex AI and AWS Bedrock).
The LiteLLM Router is a Python Router class that runs inside the LiteLLM proxy process — it is not a separate service or sidecar.
When the proxy starts, it reads all registered model deployments and groups them by model_name.
For each incoming request, the Router:
Because state (health, cooldown timers, request counts) is held in memory per pod, each replica makes independent routing decisions. There is no shared routing state across replicas.
Use load balancing when: the same model runs on 2 or more GPU servers and you want to distribute traffic across all of them simultaneously. Users see one model name; all backends serve requests in parallel.
model_name — the router activates automaticallyUse fallback when: you want server B as a backup if server A goes down — not for simultaneous load distribution. Fallback only activates after the primary group fails, so under normal conditions all traffic goes to server A.
LiteLLM automatically load balances when two or more model registrations share the same model_name.
No explicit router configuration is needed — registering a second endpoint with the same name is all it takes.
Users see one model; LiteLLM round-robins requests across all registered backends.
Click the diagram to expand.
Set your admin key once, then use it for all commands below:
export ADMIN_KEY=sk-1234567890abcdef1234 export LITELLM_URL=https://litellm-prod.apps.maas.redhatworkshops.io
Same model_name, different api_base — load balancing activates automatically:
curl -sk -X POST "$LITELLM_URL/model/new" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model_name":"qwen3-14b","litellm_params":{"model":"openai/qwen3-14b","api_base":"https://qwen3-14b-llm-hosting.apps.server2.example.com/v1","custom_llm_provider":"openai"}}'
Run this during events — shows which backends are active and which are in cooldown:
curl -sk "$LITELLM_URL/health?model=qwen3-14b" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; d=json.load(sys.stdin); [print('OK ',e['api_base']) for e in d.get('healthy_endpoints',[]) if isinstance(e,dict)]; [print('DOWN',e['api_base']) for e in d.get('unhealthy_endpoints',[]) if isinstance(e,dict)]"
Expected output — one line per backend:
OK https://qwen3-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com/v1 OK https://qwen3-14b-llm-hosting.apps.smc00.rs-dfw3.infra.demo.redhat.com/v1
Each response includes x-litellm-model-id — different IDs confirm different backends are being hit:
for i in 1 2 3 4 5; do curl -sk -X POST "$LITELLM_URL/v1/chat/completions" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"hi"}],"max_tokens":1}' -D - -o /dev/null | grep -i "x-litellm-model-id"; done | sort | uniq -c
Current strategy and tunable parameters. Change via Admin UI → Router Settings.
curl -sk "$LITELLM_URL/router/settings" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; cv=json.load(sys.stdin).get('current_values',{}); [print(f'{k}: {v}') for k,v in cv.items() if v is not None and v != {} and v != []]"
A fallback routes requests to a backup model only when the primary group fails — it does not distribute traffic simultaneously. Under normal conditions, all requests go to the primary group; the fallback activates only after the primary is exhausted or in cooldown.
Click the diagram to expand.
Configure fallbacks in the LiteLLM config or via the Admin UI. Example config.yaml snippet:
# litellm config.yaml router_settings: fallbacks: - qwen3-14b: - qwen3-7b # used only when qwen3-14b group is fully down
You can also set fallbacks per-request using the fallbacks field in the API payload:
curl -sk -X POST "$LITELLM_URL/v1/chat/completions" \
-H "Authorization: Bearer $ADMIN_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"fallbacks": ["qwen3-7b"],
"messages": [{"role": "user", "content": "hello"}],
"max_tokens": 50
}'
Fallback vs. load balancing: If you want both maas00 and smc00 to serve qwen3-14b simultaneously, register both under the same model_name (load balancing). If you want smc00 only as a safety net, use fallback configuration.
The same model_name grouping that drives GPU load balancing also works across cloud providers.
RHDP LiteMaaS uses this in production to route four OSS models between Google Vertex AI and AWS Bedrock,
using LiteLLM's cost-based-routing strategy instead of round-robin.
Each model has two registrations — one Vertex AI backend and one Bedrock backend — both under the same model_name.
For every request, LiteLLM estimates the cost on each provider and dispatches to the cheapest one.
If that provider accumulates two failures it enters a 60-second cooldown and all traffic shifts to the other automatically.
This differs from GPU load balancing in a key way:
| Strategy | Traffic distribution | Primary driver | Example |
|---|---|---|---|
| Round-robin (GPU) | Every request rotates across all backends | Even load spread | qwen3-14b on maas00 + smc00 |
| Cost-based (cloud) | Cheapest provider gets all traffic; other is standby | Lowest cost per request | gpt-oss-120b on Vertex + Bedrock |
No user-facing change. Both providers are registered under the same model_name. Users call gpt-oss-120b — the router selects the provider silently. Bedrock backends are hidden from the LiteMaaS UI entirely.
Cost-based routing isn't just about cost — it also provides automatic failover at no extra configuration cost.
When the preferred provider (Vertex AI) starts failing, LiteLLM doesn't drop the request.
After allowed_fails: 2 consecutive errors, the Vertex backend enters a 60-second cooldown
and every subsequent request is automatically routed to Bedrock until Vertex recovers.
This means a full Vertex AI outage is handled transparently — users see no errors,
just slightly higher latency from the first two failed attempts before the switch kicks in.
The same works in reverse: if Bedrock is the preferred provider (equal price models like minimax-m2)
and Bedrock goes down, traffic shifts to Vertex.
Cost-based routing = cost optimisation + automatic failover in one. No separate fallback config is needed when you register both providers under the same model_name with cost-based routing enabled.
| model_name | Vertex cost (in/out per 1M) | Bedrock cost (in/out per 1M) | Router picks |
|---|---|---|---|
gpt-oss-120b |
$0.09 / $0.36 | $0.15 / $0.60 | Vertex (cheaper) |
gpt-oss-20b |
$0.07 / $0.25 | $0.07 / $0.30 | Vertex (cheaper) |
minimax-m2 |
$0.30 / $1.20 | $0.30 / $1.20 | Either (same price) |
qwen3-235b |
$0.22 / $0.88 | $0.22 / $0.88 | Either (same price) |
The routing_strategy is set globally in the litellm-router-config ConfigMap and applies to all multi-backend model groups:
# litellm-router-config ConfigMap router_settings: routing_strategy: cost-based-routing num_retries: 3 retry_after: 5 allowed_fails: 2 cooldown_time: 60
For the full implementation — IAM setup, OCP secret creation, Bedrock model registration commands, and master key rotation gotcha — see the Vertex AI and AWS Bedrock provider pages.