Conclusion

Congratulations! You’ve completed the Voice Agents workshop.

What you accomplished

  • Architecture — Learned the voice sandwich pattern (STT → LLM → TTS), the model stack, WebSocket data flow, and gen-x performance considerations for real-time speech

  • Speech Models — Deployed Whisper (STT) and Higgs-Audio (TTS) on GPU MIG slices using KServe and vLLM, tested transcription and speech generation APIs, and measured TTS generation speed

  • Pizza Shop Demo — Deployed a multi-agent voice application with Helm, connecting speech models to a LangGraph supervisor that routes conversations through specialist agents (pizza, order, delivery)

Key takeaways

The voice sandwich is simple but powerful

Wrapping an LLM agent between STT and TTS layers turns any text-based agent into a voice agent. The pattern is model-agnostic — swap Whisper for another STT model or Higgs-Audio for another TTS model without changing the agent logic.

Gen-x determines voice quality

If TTS generation speed (gen x) drops below 1.0, users hear choppy audio or silence gaps. GPU selection matters — a smaller GPU can’t keep up with real-time, while MIG slices on modern GPUs (2-3x gen-x) provide smooth, natural-sounding playback.

Multi-agent routing keeps conversations natural

The LangGraph supervisor pattern lets specialist agents handle their domain (pizza selection, order totals, delivery) without each agent needing to understand the full conversation. The supervisor analyzes intent and routes to the right agent, creating smooth conversation flows.

Thank you!

We hope you enjoyed this workshop. The combination of speech models and multi-agent orchestration gives you a practical foundation for building voice-enabled AI applications on OpenShift AI.