Conclusion
Congratulations! You’ve completed the WordSwarm AI workshop.
What you accomplished
-
Module 1: Reasoning Prompting — Explored
/thinkand/no-thinkto control LLM reasoning behavior, measured the latency-quality tradeoff, and watched reasoning models outperform non-reasoning models on the WordSwarm game -
Module 2: GuideLLM Benchmarking — Benchmarked four models across different architectures and GPU allocations, compared synthetic throughput with real-world game performance, and explored concurrency profiles
Key takeaways
Reasoning matters for complex tasks
kimi-k2-5 dominates the WordSwarm leaderboard because it reasons through ambiguity. Smaller models are faster but can’t follow a game loop reliably. The /think and /no-think tags give you direct control over this tradeoff.
Benchmarks vs. reality
Synthetic benchmarks measure raw inference capacity — tokens/sec, latency percentiles, TTFT. Real-world performance depends on how well the model uses those tokens. nemotron-cascade has the highest raw throughput in-game (128.6 t/s) but scores 4x lower than kimi-k2-5.
GPU cost-efficiency varies
| Model | GPU Memory | Best Score | Memory per Point |
|---|---|---|---|
kimi-k2-5 |
1,128 GB |
1,038 |
1.09 GB/pt |
nemotron-cascade-2-30b |
71 GB |
264 |
0.27 GB/pt |
llama-4-scout-17b |
142 GB |
153 |
0.93 GB/pt |
llama-3.2-3b-instruct |
18 GB |
36 |
0.50 GB/pt |
nemotron-cascade is the most GPU-efficient scorer at 0.27 GB per point. kimi-k2-5 wins on absolute score but needs 1 TB of HBM3e for a word game.
Try it yourself
-
Play the game: https://red.ht/wordswarm
-
Watch the agent: Select a model, click START AGENT, and see how different models approach the same puzzle
-
Beat the AI: No human scores on the production leaderboard yet — can you beat kimi-k2-5’s 1,038?