Conclusion
Congratulations! You’ve completed the WordSwarm AI workshop.
What you accomplished
-
Module 1: Reasoning Prompting — Toggled thinking on/off via API calls to control LLM reasoning behavior, measured the token consumption and latency differences, and compared reasoning vs non-reasoning models in the WordSwarm game
-
Module 2: GuideLLM Benchmarking — Benchmarked kimi-k2-6 and Llama 3.2 3B with very different GPU allocations, discovered that infrastructure matters more than model size for raw speed, and connected synthetic metrics to real-world game performance
Key takeaways
Reasoning capability matters more than raw speed
kimi-k2-6 dominates the WordSwarm leaderboard because it reasons through ambiguity and follows the game loop reliably. Smaller models respond faster but struggle with complex reasoning tasks, leading to dramatically lower scores.
| Model | GPUs | Score | Level | Words | Accuracy | Avg Latency |
|---|---|---|---|---|---|---|
kimi-k2-6 |
8x H200 |
1,038 |
21 |
284 |
65% |
977ms |
nemotron-cascade-2-30b |
1x MIG-3g.71GB |
264 |
6 |
76 |
59% |
2.0s |
llama-4-scout-17b (W4A16) |
2x MIG-3g.71GB |
153 |
4 |
34 |
76% |
516ms |
llama-3.2-3b-instruct |
1x MIG-1g.18GB |
36 |
1 |
10 |
80% |
567ms |
The massive reasoning model (kimi-k2-6) scored 29x higher than the tiny model (Llama 3.2 3B) despite lower accuracy — reasoning capability matters more than raw speed for complex tasks.
GPU allocation directly impacts performance
You might expect the tiny 3B model to outperform the massive reasoning model, but the opposite is true. kimi-k2-6 (8x H200 GPUs) delivers better performance than Llama 3.2 3B (1x tiny MIG slice) across every metric: latency, TTFT, and tokens/sec. Infrastructure matters.
Thinking mode has significant trade-offs
Enabling reasoning (thinking mode) provides better quality responses but:
* Consumes far more tokens (reasoning + content vs content alone)
* Without token caps, responses include both reasoning and content
* With token caps, reasoning can consume the entire budget, leaving content: null
Understanding when to enable/disable thinking is critical for balancing quality, cost, and latency.
Synthetic benchmarks don’t tell the whole story
GuideLLM measures raw inference capacity: tokens/sec, latency, TTFT. Real-world performance depends on how well the model uses those tokens. The highest-scoring models in WordSwarm aren’t always the fastest at token generation — reasoning quality matters more than raw speed for complex tasks.
Play to beat the AI
-
Play the game: https://red.ht/wordswarm
-
Watch the agent: Select a model, click START AGENT, and see how different models approach the same puzzle
-
Beat the AI: No human scores on the production leaderboard yet — can you beat kimi-k2-6’s 1,038?