Module 2: GuideLLM benchmarking

GuideLLM is an open source tool for benchmarking LLM inference endpoints. It measures throughput, latency, time-to-first-token, and concurrency behavior — exactly the metrics that matter when serving models via MaaS on OpenShift.

In this module, you’ll benchmark the two models you just tested in WordSwarm and discover how GPU allocation, model size and model capabilities affect performance.

Learning objectives

By the end of this module, you’ll be able to:

  • Install and configure GuideLLM

  • Run benchmarks against MaaS model endpoints

  • Interpret standard performance metrics like throughput, latency, TTFT, and inter-token latency (ITL)

  • Understand how GPU allocation affects model performance

Exercise 1: Install GuideLLM

Go back to the terminal in your JupyterLab instance.

  1. Install GuideLLM via pip:

    pip install guidellm
    Installation takes 2-3 minutes. While you wait, go back to WordSwarm and play a few rounds yourself against the AI. Try to beat the agents you tested earlier!
  2. Verify the installation:

    guidellm --version
  3. Export your API token:

    export TOKEN=$(oc get secret maas-secret -o jsonpath='{.data.token}' | base64 -d)
    echo "Token obtained: ${TOKEN:0:20}..."

Verify

✓ GuideLLM is installed
guidellm --version returns a version number
✓ Token is exported

Exercise 2: Benchmark kimi-k2-6 (reasoning model)

kimi-k2-6 is the top-performing model on the WordSwarm leaderboard. Let’s measure its raw inference performance.

kimi-k2-6 is a reasoning model with thinking enabled by default. When thinking is on, the model buffers its internal chain-of-thought before flushing all output tokens in a single chunk — TTFT and ITL will show as 0.0ms because GuideLLM can’t measure per-token timing. To get real TTFT/ITL metrics, we disable thinking via the extras backend kwarg below.
The benchmark output is a wide table. Maximize your terminal window or drag it wider to see all columns clearly.
  1. Run the benchmark with thinking disabled:

    guidellm benchmark \
      --target "https://maas.apps.ocp.cloud.rhai-tmm.dev/prelude-maas/kimi-k2-6/v1" \
      --backend-kwargs '{"api_key": "'$TOKEN'", "extras": {"body": {"chat_template_kwargs": {"thinking": false}}}}' \
      --profile concurrent \
      --rate 10 \
      --model kimi-k2-6 \
      --data "prompt_tokens=64,output_tokens=128" \
      --max-seconds 30 \
      --processor "gpt2" \
      --output-dir ./results/kimi-k2-6/concurrent-10 \
      --outputs benchmark.json,benchmark.csv

    This runs 30 seconds of concurrent requests with 10 req/s target rate.

  2. Look at the bottom table "Server Throughput Statistics":

    • Requests Per Sec (Mean) — actual requests/second handled (should be ~4.7)

    • Output Tokens Per Sec — token generation speed (should be ~605)

  3. Look at the "Request Latency Statistics" table:

    • Request Latency (Mdn) — median total time per request (should be ~2.1s)

    • TTFT (Mdn) — median time to first token (should be ~292ms)

    • ITL (Mdn) — median time between tokens (should be ~14ms)

Verify

✓ Benchmark completes in ~30 seconds
✓ Output shows real TTFT/ITL values (not 0.0ms)
✓ You can read the throughput and latency tables

Exercise 3: Benchmark Llama 3.2 3B (small dense model)

Now let’s benchmark the small model you tested in WordSwarm. Llama 3.2 3B is a dense model with just 3 billion parameters, running on a single small MIG slice.

Based on your WordSwarm experience, which model do you expect to be faster: the massive reasoning model or this tiny one?

  1. Run the benchmark:

    guidellm benchmark \
      --target "http://maas.apps.ocp.cloud.rhai-tmm.dev/prelude-maas/llama-32-3b/v1" \
      --backend-kwargs '{"api_key": "'$TOKEN'"}' \
      --profile concurrent \
      --rate 10 \
      --model llama-32-3b \
      --data "prompt_tokens=64,output_tokens=128" \
      --max-seconds 30 \
      --processor "gpt2" \
      --output-dir ./results/llama-32-3b/concurrent-10 \
      --outputs benchmark.json,benchmark.csv
  2. Look at the same tables you checked before:

Server Throughput Statistics (bottom table):

  • Requests Per Sec — should be ~3.7

  • Output Tokens Per Sec — should be ~496

Request Latency Statistics:

  • Request Latency (Mdn) — should be ~2.6s

  • TTFT (Mdn) — should be ~328ms

  • ITL (Mdn) — should be ~17ms

Verify

✓ Benchmark completes in ~30 seconds
✓ You can compare the numbers with kimi-k2-6’s results

Exercise 4: Compare the results

Looking at your two benchmark runs, you might notice something surprising:

Expected: The tiny 3B model should be faster than the huge reasoning model

Reality: Compare your numbers:

Model GPU Allocation Latency (Mdn) TTFT (Mdn) Output Tokens/s

kimi-k2-6

8x H200

~2.1s

~292ms

~605

llama-3.2-3b

1x MIG-1g.18GB

~2.6s

~328ms

~496

kimi-k2-6 is faster across every metric despite being a massive reasoning model!

Why does this happen?

GPU allocation matters. kimi-k2-6 runs on 8x H200 GPUs with massive memory bandwidth and compute. Llama 3.2 3B runs on a single tiny MIG slice with limited resources.

The larger model has the compute budget to outperform the smaller one — both in raw speed AND in quality.

Compare with WordSwarm game performance

This matches what you saw in the game:

Model Benchmark Tokens/s Game Score Game Accuracy

kimi-k2-6

~605

1,038

65%

llama-3.2-3b

~496

36

80%

kimi-k2-6 generates tokens faster, finds better words, and scores ~29x higher despite lower accuracy. Quality matters more than speed for complex reasoning tasks.

Verify

✓ Understood why the larger model is faster (GPU allocation)
✓ Connected benchmark results to game performance
✓ Recognized that model capability matters more than raw speed

Module summary

What you accomplished:

  • Installed and configured GuideLLM

  • Benchmarked two models with very different architectures and GPU allocations

  • Interpreted throughput, latency, TTFT, and ITL metrics from GuideLLM output

  • Discovered that GPU allocation matters more than model size for raw speed

  • Correlated synthetic benchmarks with real-world WordSwarm game performance

Key takeaways:

  • GPU allocation directly impacts inference speed — more resources = faster inference

  • A well-resourced large model can outperform a small model on limited hardware

  • Synthetic benchmarks measure raw speed, but capability matters for task performance

  • Model quality (reasoning, accuracy) often matters more than raw token throughput

  • GuideLLM provides detailed metrics for understanding LLM serving performance

Next: Head to the Conclusion for a summary and next steps.