Conclusion
Congratulations! You’ve completed the Voice Agents workshop.
What you accomplished
-
Architecture — Learned the voice sandwich pattern (STT → LLM → TTS), the model stack, WebSocket data flow, and gen-x performance considerations for real-time speech
-
Speech Models — Deployed Whisper (STT) and Higgs-Audio (TTS) on GPU MIG slices using KServe and vLLM, tested transcription and speech generation, and measured TTS generation speed
-
Pizza Shop Demo — Deployed a multi-agent voice application with Helm, connecting speech models to a LangGraph supervisor that routes conversations through specialist agents (pizza, order, delivery)
-
Observability — Explored MLflow tracing and Llama Stack for monitoring agent interactions, latency, and model behaviour across the pipeline
-
Guardrails — Configured TrustyAI FMS and NeMo guardrails for input/output screening, prompt injection detection, and content safety
Key takeaways
The voice sandwich is simple but powerful
Wrapping an LLM agent between STT and TTS layers turns any text-based agent into a voice agent. The pattern is model-agnostic — swap Whisper for another STT model or Higgs-Audio for another TTS model without changing the agent logic.
Gen-x determines voice quality
If TTS generation speed (gen x) drops below 1.0, users hear choppy audio or silence gaps. GPU selection matters — an L4 (0.78x) can’t keep up with real-time, while an L40S (1.95x) or H200 MIG slice (2-3x) provides smooth playback.