LiteLLM Router & Load Balancing

How LiteLLM routes traffic — GPU load balancing across on-prem servers, fallback configuration, and cost-based routing between cloud providers (Vertex AI and AWS Bedrock).

What Is the LiteLLM Router?

The LiteLLM Router is a Python Router class that runs inside the LiteLLM proxy process — it is not a separate service or sidecar. When the proxy starts, it reads all registered model deployments and groups them by model_name.

For each incoming request, the Router:

Selects a backend from the group using the configured strategy (default: round-robin)
Tracks backend health — failed requests trigger a cooldown period
Retries on transient errors and routes around backends in cooldown
Runs fallback logic if a primary group is exhausted

Because state (health, cooldown timers, request counts) is held in memory per pod, each replica makes independent routing decisions. There is no shared routing state across replicas.

When to Use This

Use load balancing when: the same model runs on 2 or more GPU servers and you want to distribute traffic across all of them simultaneously. Users see one model name; all backends serve requests in parallel.

Register each backend with the same model_name — the router activates automatically
All backends must serve the same model weights for consistent responses

Use fallback when: you want server B as a backup if server A goes down — not for simultaneous load distribution. Fallback only activates after the primary group fails, so under normal conditions all traffic goes to server A.

Useful for event safety: primary GPU server + warm standby
Fallback can point to a smaller or different model if no identical standby is available

Load Balancing — Adding Multiple Backends

LiteLLM automatically load balances when two or more model registrations share the same model_name. No explicit router configuration is needed — registering a second endpoint with the same name is all it takes. Users see one model; LiteLLM round-robins requests across all registered backends.

Click the diagram to expand.

graph LR U[User Request] --> R[LiteLLM Router
model: qwen3-14b] R -->|round-robin| A[maas00
qwen3-14b] R -->|round-robin| B[smc00
qwen3-14b] style A fill:#d4edda,stroke:#28a745 style B fill:#d4edda,stroke:#28a745 style R fill:#e8f4fd,stroke:#0066cc

Set your admin key once, then use it for all commands below:

export ADMIN_KEY=sk-1234567890abcdef1234
export LITELLM_URL=https://litellm-prod.apps.maas.redhatworkshops.io

Add a second backend (zero disruption)

Same model_name, different api_base — load balancing activates automatically:

curl -sk -X POST "$LITELLM_URL/model/new" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model_name":"qwen3-14b","litellm_params":{"model":"openai/qwen3-14b","api_base":"https://qwen3-14b-llm-hosting.apps.server2.example.com/v1","custom_llm_provider":"openai"}}'

Check live health of both backends

Run this during events — shows which backends are active and which are in cooldown:

curl -sk "$LITELLM_URL/health?model=qwen3-14b" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; d=json.load(sys.stdin); [print('OK  ',e['api_base']) for e in d.get('healthy_endpoints',[]) if isinstance(e,dict)]; [print('DOWN',e['api_base']) for e in d.get('unhealthy_endpoints',[]) if isinstance(e,dict)]"

Expected output — one line per backend:

OK   https://qwen3-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com/v1
OK   https://qwen3-14b-llm-hosting.apps.smc00.rs-dfw3.infra.demo.redhat.com/v1

Confirm router is distributing requests

Each response includes x-litellm-model-id — different IDs confirm different backends are being hit:

for i in 1 2 3 4 5; do curl -sk -X POST "$LITELLM_URL/v1/chat/completions" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"hi"}],"max_tokens":1}' -D - -o /dev/null | grep -i "x-litellm-model-id"; done | sort | uniq -c

View router settings

Current strategy and tunable parameters. Change via Admin UI → Router Settings.

curl -sk "$LITELLM_URL/router/settings" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; cv=json.load(sys.stdin).get('current_values',{}); [print(f'{k}: {v}') for k,v in cv.items() if v is not None and v != {} and v != []]"

Fallback — Primary + Standby

A fallback routes requests to a backup model only when the primary group fails — it does not distribute traffic simultaneously. Under normal conditions, all requests go to the primary group; the fallback activates only after the primary is exhausted or in cooldown.

Click the diagram to expand.

graph LR U[User Request] --> R[LiteLLM Router] R -->|normal| A[Primary
qwen3-14b on maas00] A -->|fails / cooldown| FB[Fallback
qwen3-7b on smc00] style A fill:#d4edda,stroke:#28a745 style FB fill:#fff3cd,stroke:#f0a500 style R fill:#e8f4fd,stroke:#0066cc

Configure fallbacks in the LiteLLM config or via the Admin UI. Example config.yaml snippet:

# litellm config.yaml
router_settings:
  fallbacks:
    - qwen3-14b:
        - qwen3-7b   # used only when qwen3-14b group is fully down

You can also set fallbacks per-request using the fallbacks field in the API payload:

curl -sk -X POST "$LITELLM_URL/v1/chat/completions" \
  -H "Authorization: Bearer $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "fallbacks": ["qwen3-7b"],
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 50
  }'

Fallback vs. load balancing: If you want both maas00 and smc00 to serve qwen3-14b simultaneously, register both under the same model_name (load balancing). If you want smc00 only as a safety net, use fallback configuration.

Cost-Based Cross-Provider Routing

The same model_name grouping that drives GPU load balancing also works across cloud providers. RHDP LiteMaaS uses this in production to route four OSS models between Google Vertex AI and AWS Bedrock, using LiteLLM's cost-based-routing strategy instead of round-robin.

Each model has two registrations — one Vertex AI backend and one Bedrock backend — both under the same model_name. For every request, LiteLLM estimates the cost on each provider and dispatches to the cheapest one. If that provider accumulates two failures it enters a 60-second cooldown and all traffic shifts to the other automatically.

flowchart LR U([User]) -->|model: gpt-oss-120b| Router{LiteLLM\nCost Router} Router -->|"$0.09/1M in — preferred"| V["Google Vertex AI\nModel Garden"] Router -->|"$0.15/1M in — fallback"| B["AWS Bedrock\nus-west-2"] V -.->|"≥2 failures → cooldown"| B

This differs from GPU load balancing in a key way:

Strategy	Traffic distribution	Primary driver	Example
Round-robin (GPU)	Every request rotates across all backends	Even load spread	`qwen3-14b` on maas00 + smc00
Cost-based (cloud)	Cheapest provider gets all traffic; other is standby	Lowest cost per request	`gpt-oss-120b` on Vertex + Bedrock

No user-facing change. Both providers are registered under the same model_name. Users call gpt-oss-120b — the router selects the provider silently. Bedrock backends are hidden from the LiteMaaS UI entirely.

Automatic failover — cost-based routing doubles as fallback

Cost-based routing isn't just about cost — it also provides automatic failover at no extra configuration cost. When the preferred provider (Vertex AI) starts failing, LiteLLM doesn't drop the request. After allowed_fails: 2 consecutive errors, the Vertex backend enters a 60-second cooldown and every subsequent request is automatically routed to Bedrock until Vertex recovers.

This means a full Vertex AI outage is handled transparently — users see no errors, just slightly higher latency from the first two failed attempts before the switch kicks in. The same works in reverse: if Bedrock is the preferred provider (equal price models like minimax-m2) and Bedrock goes down, traffic shifts to Vertex.

Cost-based routing = cost optimisation + automatic failover in one. No separate fallback config is needed when you register both providers under the same model_name with cost-based routing enabled.

Models with Vertex + Bedrock backends

model_name	Vertex cost (in/out per 1M)	Bedrock cost (in/out per 1M)	Router picks
`gpt-oss-120b`	$0.09 / $0.36	$0.15 / $0.60	Vertex (cheaper)
`gpt-oss-20b`	$0.07 / $0.25	$0.07 / $0.30	Vertex (cheaper)
`minimax-m2`	$0.30 / $1.20	$0.30 / $1.20	Either (same price)
`qwen3-235b`	$0.22 / $0.88	$0.22 / $0.88	Either (same price)

Router configuration

The routing_strategy is set globally in the litellm-router-config ConfigMap and applies to all multi-backend model groups:

# litellm-router-config ConfigMap
router_settings:
  routing_strategy: cost-based-routing
  num_retries: 3
  retry_after: 5
  allowed_fails: 2
  cooldown_time: 60

For the full implementation — IAM setup, OCP secret creation, Bedrock model registration commands, and master key rotation gotcha — see the Vertex AI and AWS Bedrock provider pages.

← Previous Day-2 Operations Next → Model Management