Conclusion

Congratulations! You’ve completed the WordSwarm AI workshop.

What you accomplished

Module 1: Reasoning Prompting — Toggled thinking on/off via API calls to control LLM reasoning behavior, measured the token consumption and latency differences, and compared reasoning vs non-reasoning models in the WordSwarm game
Module 2: GuideLLM Benchmarking — Benchmarked kimi-k2-6 and Llama 3.2 3B with very different GPU allocations, discovered that infrastructure matters more than model size for raw speed, and connected synthetic metrics to real-world game performance

Key takeaways

Reasoning capability matters more than raw speed

kimi-k2-6 dominates the WordSwarm leaderboard because it reasons through ambiguity and follows the game loop reliably. Smaller models respond faster but struggle with complex reasoning tasks, leading to dramatically lower scores.

Model	GPUs	Score	Level	Words	Accuracy	Avg Latency
kimi-k2-6	8x H200	1,038	21	284	65%	977ms
nemotron-cascade-2-30b	1x MIG-3g.71GB	264	6	76	59%	2.0s
llama-4-scout-17b (W4A16)	2x MIG-3g.71GB	153	4	34	76%	516ms
llama-3.2-3b-instruct	1x MIG-1g.18GB	36	1	10	80%	567ms

The massive reasoning model (kimi-k2-6) scored 29x higher than the tiny model (Llama 3.2 3B) despite lower accuracy — reasoning capability matters more than raw speed for complex tasks.

GPU allocation directly impacts performance

You might expect the tiny 3B model to outperform the massive reasoning model, but the opposite is true. kimi-k2-6 (8x H200 GPUs) delivers better performance than Llama 3.2 3B (1x tiny MIG slice) across every metric: latency, TTFT, and tokens/sec. Infrastructure matters.

Thinking mode has significant trade-offs

Enabling reasoning (thinking mode) provides better quality responses but: * Consumes far more tokens (reasoning + content vs content alone) * Without token caps, responses include both reasoning and content * With token caps, reasoning can consume the entire budget, leaving content: null

Understanding when to enable/disable thinking is critical for balancing quality, cost, and latency.

Synthetic benchmarks don’t tell the whole story

GuideLLM measures raw inference capacity: tokens/sec, latency, TTFT. Real-world performance depends on how well the model uses those tokens. The highest-scoring models in WordSwarm aren’t always the fastest at token generation — reasoning quality matters more than raw speed for complex tasks.

Play to beat the AI

Play the game: https://red.ht/wordswarm
Watch the agent: Select a model, click START AGENT, and see how different models approach the same puzzle
Beat the AI: No human scores on the production leaderboard yet — can you beat kimi-k2-6’s 1,038?

Resources

Thank you!

We hope you enjoyed this workshop. The combination of reasoning prompting and model benchmarking gives you practical tools for evaluating and deploying LLMs on your own MaaS infrastructure.