Model as a Service for Red Hat Demo Platform
Distributing traffic across multiple GPU backends and configuring fallback when a server goes down.
The LiteLLM Router is a Python Router class that runs inside the LiteLLM proxy process — it is not a separate service or sidecar.
When the proxy starts, it reads all registered model deployments and groups them by model_name.
For each incoming request, the Router:
Because state (health, cooldown timers, request counts) is held in memory per pod, each replica makes independent routing decisions. There is no shared routing state across replicas.
Use load balancing when: the same model runs on 2 or more GPU servers and you want to distribute traffic across all of them simultaneously. Users see one model name; all backends serve requests in parallel.
model_name — the router activates automaticallyUse fallback when: you want server B as a backup if server A goes down — not for simultaneous load distribution. Fallback only activates after the primary group fails, so under normal conditions all traffic goes to server A.
LiteLLM automatically load balances when two or more model registrations share the same model_name.
No explicit router configuration is needed — registering a second endpoint with the same name is all it takes.
Users see one model; LiteLLM round-robins requests across all registered backends.
Click the diagram to expand.
Set your admin key once, then use it for all commands below:
export ADMIN_KEY=sk-1234567890abcdef1234 export LITELLM_URL=https://litellm-prod.apps.maas.redhatworkshops.io
Same model_name, different api_base — load balancing activates automatically:
curl -sk -X POST "$LITELLM_URL/model/new" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model_name":"qwen3-14b","litellm_params":{"model":"openai/qwen3-14b","api_base":"https://qwen3-14b-llm-hosting.apps.server2.example.com/v1","custom_llm_provider":"openai"}}'
Run this during events — shows which backends are active and which are in cooldown:
curl -sk "$LITELLM_URL/health?model=qwen3-14b" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; d=json.load(sys.stdin); [print('OK ',e['api_base']) for e in d.get('healthy_endpoints',[]) if isinstance(e,dict)]; [print('DOWN',e['api_base']) for e in d.get('unhealthy_endpoints',[]) if isinstance(e,dict)]"
Expected output — one line per backend:
OK https://qwen3-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com/v1 OK https://qwen3-14b-llm-hosting.apps.smc00.rs-dfw3.infra.demo.redhat.com/v1
Each response includes x-litellm-model-id — different IDs confirm different backends are being hit:
for i in 1 2 3 4 5; do curl -sk -X POST "$LITELLM_URL/v1/chat/completions" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"hi"}],"max_tokens":1}' -D - -o /dev/null | grep -i "x-litellm-model-id"; done | sort | uniq -c
Current strategy and tunable parameters. Change via Admin UI → Router Settings.
curl -sk "$LITELLM_URL/router/settings" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; cv=json.load(sys.stdin).get('current_values',{}); [print(f'{k}: {v}') for k,v in cv.items() if v is not None and v != {} and v != []]"
A fallback routes requests to a backup model only when the primary group fails — it does not distribute traffic simultaneously. Under normal conditions, all requests go to the primary group; the fallback activates only after the primary is exhausted or in cooldown.
Click the diagram to expand.
Configure fallbacks in the LiteLLM config or via the Admin UI. Example config.yaml snippet:
# litellm config.yaml router_settings: fallbacks: - qwen3-14b: - qwen3-7b # used only when qwen3-14b group is fully down
You can also set fallbacks per-request using the fallbacks field in the API payload:
curl -sk -X POST "$LITELLM_URL/v1/chat/completions" \
-H "Authorization: Bearer $ADMIN_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"fallbacks": ["qwen3-7b"],
"messages": [{"role": "user", "content": "hello"}],
"max_tokens": 50
}'
Fallback vs. load balancing: If you want both maas00 and smc00 to serve qwen3-14b simultaneously, register both under the same model_name (load balancing). If you want smc00 only as a safety net, use fallback configuration.