Voice Agents: Build an AI Voice Agent Workshop

Welcome to the Voice Agents workshop! In this hands-on session, you’ll build a production voice AI pipeline by deploying speech models, orchestrating multi-agent workflows, and measuring real-time performance.

You’ll work in notebooks and terminal deploying models to OpenShift AI and benchmarking TTS generation speed (gen-x metrics). The Pizza Shop Demo showcases the complete voice agent in action — during the exercises you’ll deploy the models powering it, measure their performance, and have a spoken conversation about pizza.

What you’ll learn

In this workshop, you will:

Deploy speech models — set up Whisper (STT) and Higgs-Audio (TTS) on GPU MIG slices using KServe and vLLM
Build a voice agent — deploy a multi-agent LangGraph application that takes pizza orders via spoken conversation
Measure TTS performance — understand gen-x (generation speed vs real-time) and why GPU selection matters for voice quality

Understanding and navigating the workshop UI

The workshop interface is designed to let you read instructions while working in your environment.

Left panel: Workshop instructions and content

Right panel: Embedded environment for hands-on exercises

View modes (top navigation):

Instructions: Full-page instructions (hide environment tabs)
Split: Side-by-side view (current default) - see both instructions and environment
Tabs: Full-page environment (hide instructions)

Adjusting the layout: In Split mode, drag the middle divider left or right to resize the panels.

The app

The pizza shop voice agent follows the voice sandwich pattern — STT and TTS layers wrap an LLM agent graph. You speak into the microphone, a supervisor agent routes your request to specialist agents (pizza, order, delivery), and the response is spoken back to you in real-time.

The voice agent operates as a multi-agent graph using LangGraph:

Browser captures microphone audio and sends WAV over WebSocket
Whisper transcribes the audio to text
A supervisor agent analyses intent and routes to the right specialist
Specialist agents (pizza, order, delivery) handle their domain using tools
Higgs-Audio converts the response text to speech
PCM audio streams back to the browser in real-time (~20ms chunks)

The AI agent

The system uses a supervisor pattern — one agent coordinates several specialists:

Agent	Role
Supervisor	Analyses user intent and routes to the right specialist. Handles general conversation directly.
Pizza Agent	Knows the menu, helps you pick toppings and pizza types. Has access to menu lookup tools.
Order Agent	Calculates totals, manages quantities, and summarises your order.
Delivery Agent	Collects delivery address and timing preferences.

Agent

Role

Supervisor

Analyses user intent and routes to the right specialist. Handles general conversation directly.

Pizza Agent

Knows the menu, helps you pick toppings and pizza types. Has access to menu lookup tools.

Order Agent

Calculates totals, manages quantities, and summarises your order.

Delivery Agent

Collects delivery address and timing preferences.

The agent uses LangGraph with the ReAct (Reason + Act) pattern — at each step it observes the conversation state, decides which specialist to route to, and adapts based on your requests.

Understanding gen-x

For real-time voice agents, gen-x (generation speed) is the critical metric. It measures how fast the TTS model produces audio compared to real-time playback:

gen x = audio seconds produced / wall clock seconds elapsed

gen x	What It Means
< 1.0x	Choppy playback, gaps in speech. The model can’t keep up with real-time.
1.0—2.0x	Smooth but with occasional micro-pauses during complex responses.
> 2.0x	Smooth, natural-sounding speech with no perceptible delay.

gen x

What It Means

< 1.0x

Choppy playback, gaps in speech. The model can’t keep up with real-time.

1.0—2.0x

Smooth but with occasional micro-pauses during complex responses.

> 2.0x

Smooth, natural-sounding speech with no perceptible delay.

GPU selection drives gen-x. An L4 GPU delivers ~0.78x (too slow for smooth voice), while an H200 MIG slice achieves 2—3x. You’ll measure this yourself during the speech models module.

Workshop modules

In this workshop you’ll work through these modules:

Architecture — The voice sandwich pattern, model stack, WebSocket data flow, and gen-x performance
Speech Models — Deploy Whisper (STT) and Higgs-Audio (TTS) on GPU MIG slices, test the APIs, and measure generation speed
Pizza Shop Demo — Deploy the multi-agent voice app with Helm, connect everything together, and have a spoken conversation about pizza

Who this is for

This workshop is designed for AI/ML engineers, platform engineers, and developers who want hands-on experience with:

Voice-enabled AI applications (STT + LLM + TTS)
Multi-agent orchestration with LangGraph
Model serving on OpenShift AI with vLLM and KServe

Experience level: Beginner to Intermediate

All images have lightbox attached to them so they can overlay on top of the web page so you can see them. Just click on them! (and click to minimize again)

Let’s get started!

Ready to build your voice agent? Let’s begin!