Conclusion

Congratulations! You’ve completed the WordSwarm AI workshop.

What you accomplished

  • Module 1: Reasoning Prompting — Toggled thinking on/off via API calls to control LLM reasoning behavior, measured the token consumption and latency differences, and compared reasoning vs non-reasoning models in the WordSwarm game

  • Module 2: GuideLLM Benchmarking — Benchmarked kimi-k2-6 and Llama 3.2 3B with very different GPU allocations, discovered that infrastructure matters more than model size for raw speed, and connected synthetic metrics to real-world game performance

Key takeaways

Reasoning capability matters more than raw speed

kimi-k2-6 dominates the WordSwarm leaderboard because it reasons through ambiguity and follows the game loop reliably. Smaller models respond faster but struggle with complex reasoning tasks, leading to dramatically lower scores.

Model GPUs Score Level Words Accuracy Avg Latency

kimi-k2-6

8x H200

1,038

21

284

65%

977ms

nemotron-cascade-2-30b

1x MIG-3g.71GB

264

6

76

59%

2.0s

llama-4-scout-17b (W4A16)

2x MIG-3g.71GB

153

4

34

76%

516ms

llama-3.2-3b-instruct

1x MIG-1g.18GB

36

1

10

80%

567ms

The massive reasoning model (kimi-k2-6) scored 29x higher than the tiny model (Llama 3.2 3B) despite lower accuracy — reasoning capability matters more than raw speed for complex tasks.

GPU allocation directly impacts performance

You might expect the tiny 3B model to outperform the massive reasoning model, but the opposite is true. kimi-k2-6 (8x H200 GPUs) delivers better performance than Llama 3.2 3B (1x tiny MIG slice) across every metric: latency, TTFT, and tokens/sec. Infrastructure matters.

Thinking mode has significant trade-offs

Enabling reasoning (thinking mode) provides better quality responses but: * Consumes far more tokens (reasoning + content vs content alone) * Without token caps, responses include both reasoning and content * With token caps, reasoning can consume the entire budget, leaving content: null

Understanding when to enable/disable thinking is critical for balancing quality, cost, and latency.

Synthetic benchmarks don’t tell the whole story

GuideLLM measures raw inference capacity: tokens/sec, latency, TTFT. Real-world performance depends on how well the model uses those tokens. The highest-scoring models in WordSwarm aren’t always the fastest at token generation — reasoning quality matters more than raw speed for complex tasks.

Play to beat the AI

  • Play the game: https://red.ht/wordswarm

  • Watch the agent: Select a model, click START AGENT, and see how different models approach the same puzzle

  • Beat the AI: No human scores on the production leaderboard yet — can you beat kimi-k2-6’s 1,038?

Thank you!

We hope you enjoyed this workshop. The combination of reasoning prompting and model benchmarking gives you practical tools for evaluating and deploying LLMs on your own MaaS infrastructure.