Architecture
Discover how the voice agent application is built — the voice sandwich pattern, model stack, data flow, and performance considerations.
The Voice Sandwich
The "voice sandwich" pattern wraps an LLM agent between speech-to-text (STT) and text-to-speech (TTS) layers:
Current Stack
-
Python WebSocket server
-
Next.js client
-
LangChain agents SDK (LangGraph)
-
MLflow / Llama Stack for observability (optional)
-
WebSockets for client-server communication
-
RHOAI 3.x Platform on AWS
Model Selection
| Model Name | License | Description | Notes |
|---|---|---|---|
Whisper |
Speech to Text |
Quantized version of openai/whisper-large-v3-turbo whisper-large-v3-turbo-quantized.w4a16 |
|
Higgs-Audio |
Text to Speech |
Audio foundation model github higgs-audio-v2-generation-3B-base |
|
Llama4 Scout |
Agent Model |
Voice Flow
The user records audio in the browser, the UI sends the WAV to the Python WebSocket server, the server runs STT + the agent graph, then streams TTS audio back to the browser for playback.
Infrastructure Performance
Prevent buffering and choppy sound when generating speech.
GPU Card
Measure Text to Speech generation speed early:
(gen x) = (audio seconds produced) / (wall clock seconds elapsed)
-
If
gen x < 1.0for significant periods, the TTS stream is arriving slower than realtime — you will get underruns/chops unless you add more buffering (delay) or pause/rebuffer. -
If
gen x >= 1.0consistently, then underruns are coming from client-side issues.
| Card Details | Cores | AMI Instance Type | Gen-X |
|---|---|---|---|
NVIDIA L4 - 24 GB |
7,424 CUDA Cores, 240 Tensor Cores (Gen 4), 60 RT Cores (Gen 3) |
g6-8xlarge |
0.78 |
NVIDIA L40S - 48 GB |
18,176 CUDA Cores, 568 Tensor Cores (Gen 4), 142 RT Cores (Gen 3) |
g6e-xlarge |
1.95 |
Client Architecture
Javascript client architecture using SharedArrayBuffer ring — shared-memory ring buffer (SharedArrayBuffer + Atomics) between main thread and worklet to eliminate per-chunk messaging entirely.