Module 2: GuideLLM benchmarking
GuideLLM is an open source tool for benchmarking LLM inference endpoints. It measures throughput, latency, time-to-first-token, and concurrency behavior — exactly the metrics that matter when serving models via MaaS on OpenShift.
In this module, you’ll benchmark several models from the MaaS platform and compare their performance profiles.
Learning objectives
By the end of this module, you’ll be able to:
-
Install and configure GuideLLM
-
Run benchmarks against MaaS model endpoints
-
Interpret throughput, latency, and TTFT metrics
-
Compare model performance across different architectures and GPU allocations
Exercise 1: Install GuideLLM
-
Install GuideLLM via pip:
pip install guidellm -
Verify the installation:
guidellm --version -
Export your API token:
export TOKEN=$(oc get secret maas-secret -o jsonpath='{.data.token}' | base64 -d) echo "Token obtained: ${TOKEN:0:20}..."
Exercise 2: Benchmark kimi-k2-5 (reasoning model)
kimi-k2-5 is the top-performing model on the WordSwarm leaderboard. Let’s measure its raw inference performance.
-
Run the benchmark:
guidellm benchmark \ --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/kimi-k25/kimi-k2-5/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'"}' \ --profile concurrent \ --rate 10 \ --model kimi-k2-5 \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/kimi-k2-5/concurrent-10 \ --outputs benchmark.json,benchmark.csv -
Review the output metrics:
-
Throughput (req/s) — how many requests per second the endpoint can handle
-
Latency (p50/p95/p99) — response time distribution
-
TTFT — time to first token
-
Tokens/sec — output token generation speed
-
Exercise 3: Benchmark Nemotron Cascade (efficient MoE)
Nemotron-Cascade-2-30B uses a cascade architecture: 30B total parameters but only 3B active per forward pass. It runs on a single 71 GB MIG slice.
-
Run the benchmark:
guidellm benchmark \ --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/prelude-maas/nemotron-cascade-2-30b/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'"}' \ --profile concurrent \ --rate 10 \ --model nemotron-cascade-2-30b \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/nemotron-cascade-2-30b/concurrent-10 \ --outputs benchmark.json,benchmark.csv
Exercise 4: Benchmark Llama 4 Scout (quantized)
Llama-4-Scout uses W4A16 quantization to fit on 2x MIG slices. Let’s see how quantization affects inference speed.
-
Run the benchmark:
guidellm benchmark \ --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/prelude-maas/llama-4-scout-17b-16e-w4a16/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'"}' \ --profile concurrent \ --rate 10 \ --model llama-4-scout-17b-16e-w4a16 \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/llama-4-scout-17b/concurrent-10 \ --outputs benchmark.json,benchmark.csv
Exercise 5: Benchmark Llama 3.2 3B (small dense model)
The smallest model in the lineup. Runs on a single 18 GB MIG slice.
-
Run the benchmark:
guidellm benchmark \ --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/prelude-maas/llama-32-3b/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'"}' \ --profile concurrent \ --rate 10 \ --model llama-32-3b \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/llama-32-3b/concurrent-10 \ --outputs benchmark.json,benchmark.csv
Exercise 6: Compare results
Collect your benchmark results and fill in this comparison table:
| Model | GPU Allocation | Throughput (req/s) | p50 Latency | TTFT (p50) | Tokens/s |
|---|---|---|---|---|---|
kimi-k2-5 |
8x H200 |
_ |
_ |
_ |
_ |
nemotron-cascade-2-30b |
1x MIG-3g.71GB |
_ |
_ |
_ |
_ |
llama-4-scout-17b |
2x MIG-3g.71GB |
_ |
_ |
_ |
_ |
llama-3.2-3b-instruct |
1x MIG-1g.18GB |
_ |
_ |
_ |
_ |
Consider these questions:
-
Which model has the highest throughput?
-
Which model has the lowest latency?
-
How does GPU allocation correlate with performance?
-
Does quantization (Llama 4 Scout) help or hurt throughput?
Compare with game performance
Now compare your synthetic benchmarks with the actual WordSwarm game results:
| Model | Benchmark Tokens/s | Game Tokens/s | Game Score | Game Accuracy |
|---|---|---|---|---|
kimi-k2-5 |
_ |
58.6 |
1,038 |
65% |
nemotron-cascade-2-30b |
_ |
128.6 |
264 |
59% |
llama-4-scout-17b |
_ |
34.9 |
153 |
76% |
llama-3.2-3b-instruct |
_ |
54.9 |
36 |
80% |
Key insight: Raw throughput doesn’t directly predict game performance. nemotron-cascade has the highest game tokens/s (128.6) but scores lower than kimi-k2-5. The reasoning capability of the model matters more than raw speed for complex tasks.
Exercise 7: Try different benchmark profiles
GuideLLM supports different load profiles. Try sweeping concurrency:
-
Synchronous (1 request at a time):
guidellm benchmark \ --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/kimi-k25/kimi-k2-5/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'"}' \ --profile synchronous \ --model kimi-k2-5 \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/kimi-k2-5/synchronous \ --outputs benchmark.json,benchmark.csv -
Concurrent with higher rate:
guidellm benchmark \ --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/kimi-k25/kimi-k2-5/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'"}' \ --profile concurrent \ --rate 50 \ --model kimi-k2-5 \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/kimi-k2-5/concurrent-50 \ --outputs benchmark.json,benchmark.csv
Compare how throughput and latency change under different concurrency levels.
Module summary
What you accomplished:
-
Installed and configured GuideLLM
-
Benchmarked 4 models with different architectures and GPU allocations
-
Compared throughput, latency, and TTFT across models
-
Correlated synthetic benchmarks with real-world game performance
-
Explored different concurrency profiles
Key takeaways:
-
Synthetic benchmarks measure raw inference speed — useful for capacity planning
-
Real-world performance depends on model capability, not just speed
-
MoE models (kimi-k2-5, Nemotron) show different throughput profiles than dense models
-
Quantization (W4A16) reduces memory but may affect throughput
-
GPU allocation directly impacts maximum concurrency and throughput
Next: Head to the Conclusion for a summary and next steps.