Architecture

Discover how the voice agent application is built — the voice sandwich pattern, model stack, data flow, and performance considerations.

The Voice Sandwich

The "voice sandwich" pattern wraps an LLM agent between speech-to-text (STT) and text-to-speech (TTS) layers:

Voice Sandwich

Current Stack

  • Python WebSocket server

  • Next.js client

  • LangChain agents SDK (LangGraph)

  • MLflow / Llama Stack for observability (optional)

  • WebSockets for client-server communication

  • RHOAI 3.x Platform on AWS

Simplifications

  • No VAD — manual start/stop recording via a button

  • WebSockets instead of WebRTC (simpler, sufficient for demo)

Model Selection

Model Name License Description Notes

Whisper

apache-2.0

Speech to Text

Quantized version of openai/whisper-large-v3-turbo whisper-large-v3-turbo-quantized.w4a16

Higgs-Audio

Higgs Audio 2 (Meta 3) Community

Text to Speech

Audio foundation model github higgs-audio-v2-generation-3B-base

Llama4 Scout

LLama 4 Community

Agent Model

Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

Voice Flow

Voice Flow Diagram

The user records audio in the browser, the UI sends the WAV to the Python WebSocket server, the server runs STT + the agent graph, then streams TTS audio back to the browser for playback.

Infrastructure Performance

Prevent buffering and choppy sound when generating speech.

GPU Card

Measure Text to Speech generation speed early:

(gen x) = (audio seconds produced) / (wall clock seconds elapsed)
  • If gen x < 1.0 for significant periods, the TTS stream is arriving slower than realtime — you will get underruns/chops unless you add more buffering (delay) or pause/rebuffer.

  • If gen x >= 1.0 consistently, then underruns are coming from client-side issues.

Card Details Cores AMI Instance Type Gen-X

NVIDIA L4 - 24 GB

7,424 CUDA Cores, 240 Tensor Cores (Gen 4), 60 RT Cores (Gen 3)

g6-8xlarge

0.78

NVIDIA L40S - 48 GB

18,176 CUDA Cores, 568 Tensor Cores (Gen 4), 142 RT Cores (Gen 3)

g6e-xlarge

1.95

Gen-X Performance Chart

Client Architecture

Javascript client architecture using SharedArrayBuffer ring — shared-memory ring buffer (SharedArrayBuffer + Atomics) between main thread and worklet to eliminate per-chunk messaging entirely.