RHDP LiteMaaS

Model as a Service for Red Hat Demo Platform

LiteLLM Router & Load Balancing

Distributing traffic across multiple GPU backends and configuring fallback when a server goes down.

On This Page

What Is the LiteLLM Router?

The LiteLLM Router is a Python Router class that runs inside the LiteLLM proxy process — it is not a separate service or sidecar. When the proxy starts, it reads all registered model deployments and groups them by model_name.

For each incoming request, the Router:

Because state (health, cooldown timers, request counts) is held in memory per pod, each replica makes independent routing decisions. There is no shared routing state across replicas.

When to Use This

Use load balancing when: the same model runs on 2 or more GPU servers and you want to distribute traffic across all of them simultaneously. Users see one model name; all backends serve requests in parallel.

  • Register each backend with the same model_name — the router activates automatically
  • All backends must serve the same model weights for consistent responses

Use fallback when: you want server B as a backup if server A goes down — not for simultaneous load distribution. Fallback only activates after the primary group fails, so under normal conditions all traffic goes to server A.

  • Useful for event safety: primary GPU server + warm standby
  • Fallback can point to a smaller or different model if no identical standby is available

Load Balancing — Adding Multiple Backends

LiteLLM automatically load balances when two or more model registrations share the same model_name. No explicit router configuration is needed — registering a second endpoint with the same name is all it takes. Users see one model; LiteLLM round-robins requests across all registered backends.

Click the diagram to expand.

graph LR U[User Request] --> R[LiteLLM Router
model: qwen3-14b] R -->|round-robin| A[maas00
qwen3-14b] R -->|round-robin| B[smc00
qwen3-14b] style A fill:#d4edda,stroke:#28a745 style B fill:#d4edda,stroke:#28a745 style R fill:#e8f4fd,stroke:#0066cc

Set your admin key once, then use it for all commands below:

export ADMIN_KEY=sk-1234567890abcdef1234
export LITELLM_URL=https://litellm-prod.apps.maas.redhatworkshops.io

Add a second backend (zero disruption)

Same model_name, different api_base — load balancing activates automatically:

curl -sk -X POST "$LITELLM_URL/model/new" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model_name":"qwen3-14b","litellm_params":{"model":"openai/qwen3-14b","api_base":"https://qwen3-14b-llm-hosting.apps.server2.example.com/v1","custom_llm_provider":"openai"}}'

Check live health of both backends

Run this during events — shows which backends are active and which are in cooldown:

curl -sk "$LITELLM_URL/health?model=qwen3-14b" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; d=json.load(sys.stdin); [print('OK  ',e['api_base']) for e in d.get('healthy_endpoints',[]) if isinstance(e,dict)]; [print('DOWN',e['api_base']) for e in d.get('unhealthy_endpoints',[]) if isinstance(e,dict)]"

Expected output — one line per backend:

OK   https://qwen3-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com/v1
OK   https://qwen3-14b-llm-hosting.apps.smc00.rs-dfw3.infra.demo.redhat.com/v1

Confirm router is distributing requests

Each response includes x-litellm-model-id — different IDs confirm different backends are being hit:

for i in 1 2 3 4 5; do curl -sk -X POST "$LITELLM_URL/v1/chat/completions" -H "Authorization: Bearer $ADMIN_KEY" -H "Content-Type: application/json" -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"hi"}],"max_tokens":1}' -D - -o /dev/null | grep -i "x-litellm-model-id"; done | sort | uniq -c

View router settings

Current strategy and tunable parameters. Change via Admin UI → Router Settings.

curl -sk "$LITELLM_URL/router/settings" -H "Authorization: Bearer $ADMIN_KEY" | python3 -c "import sys,json; cv=json.load(sys.stdin).get('current_values',{}); [print(f'{k}: {v}') for k,v in cv.items() if v is not None and v != {} and v != []]"

Fallback — Primary + Standby

A fallback routes requests to a backup model only when the primary group fails — it does not distribute traffic simultaneously. Under normal conditions, all requests go to the primary group; the fallback activates only after the primary is exhausted or in cooldown.

Click the diagram to expand.

graph LR U[User Request] --> R[LiteLLM Router] R -->|normal| A[Primary
qwen3-14b on maas00] A -->|fails / cooldown| FB[Fallback
qwen3-7b on smc00] style A fill:#d4edda,stroke:#28a745 style FB fill:#fff3cd,stroke:#f0a500 style R fill:#e8f4fd,stroke:#0066cc

Configure fallbacks in the LiteLLM config or via the Admin UI. Example config.yaml snippet:

# litellm config.yaml
router_settings:
  fallbacks:
    - qwen3-14b:
        - qwen3-7b   # used only when qwen3-14b group is fully down

You can also set fallbacks per-request using the fallbacks field in the API payload:

curl -sk -X POST "$LITELLM_URL/v1/chat/completions" \
  -H "Authorization: Bearer $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "fallbacks": ["qwen3-7b"],
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 50
  }'

Fallback vs. load balancing: If you want both maas00 and smc00 to serve qwen3-14b simultaneously, register both under the same model_name (load balancing). If you want smc00 only as a safety net, use fallback configuration.