Speech Models
Deploy and test the Speech to Text (Whisper) and Text to Speech (Higgs-Audio) models using RHOAI.
These two models form the speech layer of the voice agent pipeline:
User Speech → [Whisper STT] → Text → [LLM Agent] → Text → [Higgs-Audio TTS] → Speech
| Model | Type | GPU | API Endpoint |
|---|---|---|---|
Whisper |
Speech to Text |
MIG 1g.18gb |
|
Higgs-Audio |
Text to Speech |
MIG 2g.35gb |
|
Run the Notebook
The fastest way through this section is the notebook. It deploys both models, waits for them, and tests the APIs — all from your workbench.
Prerequisites
Open a Terminal in JupyterLab and log in to OpenShift:
git clone https://github.com/rhai-code/voice-agents.git
Open and run
In the File Explorer, navigate to voice-agents/content/notebooks/ and open:
Run all cells (Run > Run All Cells). The notebook will:
-
Create the Hugging Face secret (you need to paste your token)
-
Deploy Whisper and verify it is ready
-
Generate a test WAV and transcribe it with Whisper
-
Deploy Higgs-Audio and verify it is ready
-
Send text to the TTS endpoint and play the audio inline
-
Measure TTS generation speed (gen x)
|
Models need GPU resources (MIG slices) and take a few minutes to start. Re-run the wait cells until you see |
If the notebook completes successfully, skip to the next section. The rest of this page describes each step in detail.
Step-by-step Reference
Speech to Text — Whisper
Whisper converts spoken audio into text. We deploy the whisper-large-v3-turbo model (quantized W4A16) from the Red Hat model registry using the KServe LLMInferenceService custom resource. It runs on a small GPU slice (MIG 1g.18gb).
Test
Get the model URL and service account token:
export MODEL_URL=https://inference.apps.ocp.cloud.rhai-tmm.dev/voice-agents/whisper
Export your MaaS API token:
export STT_TOKEN=$(oc get secret maas-secret -o jsonpath='{.data.stt-token}' | base64 -d)
echo "Token obtained: ${STT_TOKEN:0:20}..."
Send an audio file to the transcription endpoint:
curl -s -X POST ${MODEL_URL}/v1/audio/transcriptions \
-H "Authorization: Bearer ${STT_TOKEN}" \
--form file=@test.wav \
--form model=whisper | jq .
Expected response:
{
"text": " Hello.",
"usage": {
"type": "duration",
"seconds": 3
}
}
Text to Speech — Higgs-Audio
Higgs-Audio generates natural-sounding speech from text, completing the voice agent loop. It runs as a standard Kubernetes Deployment with vLLM on a GPU (MIG 2g.35gb slice) and downloads the model from Hugging Face.
Test
Send a text prompt to the TTS endpoint:
curl -X POST https://higgs-audio-predictor-voice-agents.apps.ocp.cloud.rhai-tmm.dev/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "higgs-audio-v2-generation-3B-base",
"voice": "belinda",
"input": "What would you like on your pizza?",
"response_format": "pcm"
}' \
--output - | ffmpeg -f s16le -ar 24000 -ac 1 -i pipe:0 -f wav - | ffplay -nodisp -autoexit -
This sends text to the TTS model, receives PCM audio (signed 16-bit, 24kHz, mono), converts to WAV with ffmpeg, and plays it with ffplay.
Measure TTS Generation Speed (gen x)
For a real-time voice agent, TTS must generate audio faster than real-time — otherwise the user hears silence while waiting. We measure this as gen x:
gen x = audio seconds produced / wall clock seconds elapsed
A gen x of 1.0 means real-time. Above 1.0 means the model generates faster than playback — the higher the better. Below 1.0 and the user will experience latency gaps.
Send a few prompts and measure the generation speed:
for prompt in \
"What would you like on your pizza?" \
"Your order has been placed and will be ready in about twenty minutes." \
"We have pepperoni, mushrooms, olives, onions, and extra cheese available as toppings."
do
START=$(date +%s%N)
curl -s -X POST https://higgs-audio-predictor-voice-agents.apps.ocp.cloud.rhai-tmm.dev/v1/audio/speech \
-H "Content-Type: application/json" \
-d "{\"model\":\"higgs-audio-v2-generation-3B-base\",\"voice\":\"belinda\",\"input\":\"${prompt}\",\"response_format\":\"pcm\"}" \
--output /tmp/tts-bench.pcm
END=$(date +%s%N)
WALL_MS=$(( (END - START) / 1000000 ))
PCM_BYTES=$(stat -c%s /tmp/tts-bench.pcm)
AUDIO_MS=$(( PCM_BYTES * 1000 / (24000 * 2) ))
echo "${prompt}: audio=${AUDIO_MS}ms wall=${WALL_MS}ms gen_x=$(echo "scale=1; ${AUDIO_MS}.0 / ${WALL_MS}.0" | bc)x"
done
On a MIG 2g.35gb slice, expect gen x values around 2–3x real-time once the model is warm (the first request may be slower due to model warm-up).