Voice Agents: Build an AI Voice Agent Workshop

Welcome to the Voice Agents workshop!

The live demo is a voice-enabled AI agent that can take pizza orders via spoken conversation.

What you’ll learn

In this workshop, you will:

  • Deploy speech models — set up Whisper (STT) and Higgs-Audio (TTS) on GPU MIG slices using KServe and vLLM

  • Build a voice agent — deploy a multi-agent LangGraph application that takes pizza orders via spoken conversation

  • Measure TTS performance — understand gen-x (generation speed vs real-time) and why GPU selection matters for voice quality

  • Add observability — monitor agent interactions, latency, and model behaviour with MLflow tracing

  • Apply guardrails — configure TrustyAI FMS and NeMo for prompt injection detection and content safety

The app

The pizza shop voice agent follows the voice sandwich pattern — STT and TTS layers wrap an LLM agent graph. You speak into the microphone, a supervisor agent routes your request to specialist agents (pizza, order, delivery), and the response is spoken back to you in real-time.

Pizza Shop Conversation

Who this is for

This workshop is designed for AI/ML engineers, platform engineers, and developers who want hands-on experience with:

  • Voice-enabled AI applications (STT + LLM + TTS)

  • Multi-agent orchestration with LangGraph

  • Model serving on OpenShift AI with vLLM and KServe

  • AI safety guardrails (TrustyAI, NeMo)

Experience level: Beginner to Intermediate

Prerequisites

  • Access to an OpenShift AI cluster with GPU nodes

  • A Hugging Face account and API token

  • A web browser with microphone access (for the voice demo)

Estimated time

This workshop takes approximately 20 minutes to complete.

All images have lightbox attached to them so they can overlay on top of the web page so you can see them. Just click on them! (and click to minimize again)

Let’s get started!

Click on Workshop Overview in the navigation to begin.