Architecture

Discover how the voice agent application is built — the voice sandwich pattern, model stack, data flow, and performance considerations.

The Voice Sandwich

The "voice sandwich" pattern wraps an LLM agent between speech-to-text (STT) and text-to-speech (TTS) layers:

Voice Sandwich

Current Stack

  • Python WebSocket server

  • Next.js client

  • LangChain agents SDK (LangGraph)

  • WebSockets for client-server communication

  • RHOAI 3.x Platform on AWS

Simplifications

  • No VAD — manual start/stop recording via a button

  • WebSockets instead of WebRTC (simpler, sufficient for demo)

Model Selection

Model Name License Description Notes

Whisper

apache-2.0

Speech to Text

Quantized version of openai/whisper-large-v3-turbo whisper-large-v3-turbo-quantized.w4a16

Higgs-Audio

Higgs Audio 2 (Meta 3) Community

Text to Speech

Audio foundation model github higgs-audio-v2-generation-3B-base

Llama4 Scout

LLama 4 Community

Agent Model

Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

Voice Flow

Voice Flow Diagram

The user records audio in the browser, the UI sends the WAV to the Python WebSocket server, the server runs STT + the agent graph, then streams TTS audio back to the browser for playback.

Infrastructure Performance

Prevent buffering and choppy sound when generating speech.

GPU Performance

Measure Text to Speech generation speed to ensure smooth, real-time playback:

(gen x) = (audio seconds produced) / (wall clock seconds elapsed)
  • If gen x < 1.0 for significant periods, the TTS stream is arriving slower than realtime — you will get underruns/chops unless you add more buffering (delay) or pause/rebuffer.

  • If gen x >= 1.0 consistently, smooth playback is achievable. Above 2.0x provides excellent real-time performance.

In this workshop, you’ll deploy TTS on a MIG 2g.35gb slice and measure generation speed during the models module. Expect gen-x values around 2–3x real-time.

Gen-X Performance Chart

Client Architecture

Javascript client architecture using SharedArrayBuffer ring — shared-memory ring buffer (SharedArrayBuffer + Atomics) between main thread and worklet to eliminate per-chunk messaging entirely.


Next Steps

Now that you understand the voice sandwich architecture and performance considerations, let’s deploy the speech models and measure their generation speed in action.

Continue to Speech Models to deploy Whisper (STT) and Higgs-Audio (TTS).