Module 2: GuideLLM benchmarking
GuideLLM is an open source tool for benchmarking LLM inference endpoints. It measures throughput, latency, time-to-first-token, and concurrency behavior — exactly the metrics that matter when serving models via MaaS on OpenShift.
In this module, you’ll benchmark the two models you just tested in WordSwarm and discover how GPU allocation, model size and model capabilities affect performance.
Learning objectives
By the end of this module, you’ll be able to:
-
Install and configure GuideLLM
-
Run benchmarks against MaaS model endpoints
-
Interpret standard performance metrics like throughput, latency, TTFT, and inter-token latency (ITL)
-
Understand how GPU allocation affects model performance
Exercise 1: Install GuideLLM
Go back to the terminal in your JupyterLab instance.
-
Install GuideLLM via pip:
pip install guidellmInstallation takes 2-3 minutes. While you wait, go back to WordSwarm and play a few rounds yourself against the AI. Try to beat the agents you tested earlier! -
Verify the installation:
guidellm --version -
Export your API token:
export TOKEN=$(oc get secret maas-secret -o jsonpath='{.data.token}' | base64 -d) echo "Token obtained: ${TOKEN:0:20}..."
Exercise 2: Benchmark kimi-k2-6 (reasoning model)
kimi-k2-6 is the top-performing model on the WordSwarm leaderboard. Let’s measure its raw inference performance.
kimi-k2-6 is a reasoning model with thinking enabled by default. When thinking is on, the model buffers its internal chain-of-thought before flushing all output tokens in a single chunk — TTFT and ITL will show as 0.0ms because GuideLLM can’t measure per-token timing. To get real TTFT/ITL metrics, we disable thinking via the extras backend kwarg below.
|
| The benchmark output is a wide table. Maximize your terminal window or drag it wider to see all columns clearly. |
-
Run the benchmark with thinking disabled:
guidellm benchmark \ --target "https://maas.apps.ocp.cloud.rhai-tmm.dev/prelude-maas/kimi-k2-6/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'", "extras": {"body": {"chat_template_kwargs": {"thinking": false}}}}' \ --profile concurrent \ --rate 10 \ --model kimi-k2-6 \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/kimi-k2-6/concurrent-10 \ --outputs benchmark.json,benchmark.csvThis runs 30 seconds of concurrent requests with 10 req/s target rate.
-
Look at the bottom table "Server Throughput Statistics":
-
Requests Per Sec (Mean) — actual requests/second handled (should be ~4.7)
-
Output Tokens Per Sec — token generation speed (should be ~605)
-
-
Look at the "Request Latency Statistics" table:
-
Request Latency (Mdn) — median total time per request (should be ~2.1s)
-
TTFT (Mdn) — median time to first token (should be ~292ms)
-
ITL (Mdn) — median time between tokens (should be ~14ms)
-
Exercise 3: Benchmark Llama 3.2 3B (small dense model)
Now let’s benchmark the small model you tested in WordSwarm. Llama 3.2 3B is a dense model with just 3 billion parameters, running on a single small MIG slice.
Based on your WordSwarm experience, which model do you expect to be faster: the massive reasoning model or this tiny one?
-
Run the benchmark:
guidellm benchmark \ --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/prelude-maas/llama-32-3b/v1" \ --backend-kwargs '{"api_key": "'$TOKEN'"}' \ --profile concurrent \ --rate 10 \ --model llama-32-3b \ --data "prompt_tokens=64,output_tokens=128" \ --max-seconds 30 \ --processor "gpt2" \ --output-dir ./results/llama-32-3b/concurrent-10 \ --outputs benchmark.json,benchmark.csv -
Look at the same tables you checked before:
Server Throughput Statistics (bottom table):
-
Requests Per Sec — should be ~3.7
-
Output Tokens Per Sec — should be ~496
Request Latency Statistics:
-
Request Latency (Mdn) — should be ~2.6s
-
TTFT (Mdn) — should be ~328ms
-
ITL (Mdn) — should be ~17ms
Exercise 4: Compare the results
Looking at your two benchmark runs, you might notice something surprising:
Expected: The tiny 3B model should be faster than the huge reasoning model
Reality: Compare your numbers:
| Model | GPU Allocation | Latency (Mdn) | TTFT (Mdn) | Output Tokens/s |
|---|---|---|---|---|
kimi-k2-6 |
8x H200 |
~2.1s |
~292ms |
~605 |
llama-3.2-3b |
1x MIG-1g.18GB |
~2.6s |
~328ms |
~496 |
kimi-k2-6 is faster across every metric despite being a massive reasoning model!
Why does this happen?
GPU allocation matters. kimi-k2-6 runs on 8x H200 GPUs with massive memory bandwidth and compute. Llama 3.2 3B runs on a single tiny MIG slice with limited resources.
The larger model has the compute budget to outperform the smaller one — both in raw speed AND in quality.
Compare with WordSwarm game performance
This matches what you saw in the game:
| Model | Benchmark Tokens/s | Game Score | Game Accuracy |
|---|---|---|---|
kimi-k2-6 |
~605 |
1,038 |
65% |
llama-3.2-3b |
~496 |
36 |
80% |
kimi-k2-6 generates tokens faster, finds better words, and scores ~29x higher despite lower accuracy. Quality matters more than speed for complex reasoning tasks.
Module summary
What you accomplished:
-
Installed and configured GuideLLM
-
Benchmarked two models with very different architectures and GPU allocations
-
Interpreted throughput, latency, TTFT, and ITL metrics from GuideLLM output
-
Discovered that GPU allocation matters more than model size for raw speed
-
Correlated synthetic benchmarks with real-world WordSwarm game performance
Key takeaways:
-
GPU allocation directly impacts inference speed — more resources = faster inference
-
A well-resourced large model can outperform a small model on limited hardware
-
Synthetic benchmarks measure raw speed, but capability matters for task performance
-
Model quality (reasoning, accuracy) often matters more than raw token throughput
-
GuideLLM provides detailed metrics for understanding LLM serving performance
Next: Head to the Conclusion for a summary and next steps.