Speech Models

Deploy and test the Speech to Text (Whisper) and Text to Speech (Higgs-Audio) models using RHOAI.

These two models form the speech layer of the voice agent pipeline:

User Speech → [Whisper STT] → Text → [LLM Agent] → Text → [Higgs-Audio TTS] → Speech
Model Type GPU API Endpoint

Whisper

Speech to Text

MIG 1g.18gb

/v1/audio/transcriptions

Higgs-Audio

Text to Speech

MIG 2g.35gb

/v1/audio/speech

Run the Notebook

The fastest way through this section is the notebook. It deploys both models, waits for them, and tests the APIs — all from your workbench.

Prerequisites

Open a Terminal in JupyterLab and log in to OpenShift:

git clone https://github.com/rhai-code/voice-agents.git

Open and run

In the File Explorer, navigate to voice-agents/content/notebooks/ and open:

Run all cells (Run > Run All Cells). The notebook will:

  1. Create the Hugging Face secret (you need to paste your token)

  2. Deploy Whisper and verify it is ready

  3. Generate a test WAV and transcribe it with Whisper

  4. Deploy Higgs-Audio and verify it is ready

  5. Send text to the TTS endpoint and play the audio inline

  6. Measure TTS generation speed (gen x)

Models Notebook

Models need GPU resources (MIG slices) and take a few minutes to start. Re-run the wait cells until you see Ready = True for Whisper and 1/1 Running for Higgs-Audio.

If the notebook completes successfully, skip to the next section. The rest of this page describes each step in detail.


Step-by-step Reference

Speech to Text — Whisper

Whisper converts spoken audio into text. We deploy the whisper-large-v3-turbo model (quantized W4A16) from the Red Hat model registry using the KServe LLMInferenceService custom resource. It runs on a small GPU slice (MIG 1g.18gb).

Test

Get the model URL and service account token:

export MODEL_URL=https://inference.apps.ocp.cloud.rhai-tmm.dev/voice-agents/whisper

Export your MaaS API token:

export STT_TOKEN=$(oc get secret maas-secret -o jsonpath='{.data.stt-token}' | base64 -d)
echo "Token obtained: ${STT_TOKEN:0:20}..."

Send an audio file to the transcription endpoint:

curl -s -X POST ${MODEL_URL}/v1/audio/transcriptions \
  -H "Authorization: Bearer ${STT_TOKEN}" \
  --form file=@test.wav \
  --form model=whisper | jq .

Expected response:

{
  "text": " Hello.",
  "usage": {
    "type": "duration",
    "seconds": 3
  }
}

Text to Speech — Higgs-Audio

Higgs-Audio generates natural-sounding speech from text, completing the voice agent loop. It runs as a standard Kubernetes Deployment with vLLM on a GPU (MIG 2g.35gb slice) and downloads the model from Hugging Face.

Test

Send a text prompt to the TTS endpoint:

curl -X POST https://higgs-audio-predictor-voice-agents.apps.ocp.cloud.rhai-tmm.dev/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "higgs-audio-v2-generation-3B-base",
    "voice": "belinda",
    "input": "What would you like on your pizza?",
    "response_format": "pcm"
  }' \
  --output - | ffmpeg -f s16le -ar 24000 -ac 1 -i pipe:0 -f wav - | ffplay -nodisp -autoexit -

This sends text to the TTS model, receives PCM audio (signed 16-bit, 24kHz, mono), converts to WAV with ffmpeg, and plays it with ffplay.

Measure TTS Generation Speed (gen x)

For a real-time voice agent, TTS must generate audio faster than real-time — otherwise the user hears silence while waiting. We measure this as gen x:

gen x = audio seconds produced / wall clock seconds elapsed

A gen x of 1.0 means real-time. Above 1.0 means the model generates faster than playback — the higher the better. Below 1.0 and the user will experience latency gaps.

Send a few prompts and measure the generation speed:

for prompt in \
  "What would you like on your pizza?" \
  "Your order has been placed and will be ready in about twenty minutes." \
  "We have pepperoni, mushrooms, olives, onions, and extra cheese available as toppings."
do
  START=$(date +%s%N)
  curl -s -X POST https://higgs-audio-predictor-voice-agents.apps.ocp.cloud.rhai-tmm.dev/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{\"model\":\"higgs-audio-v2-generation-3B-base\",\"voice\":\"belinda\",\"input\":\"${prompt}\",\"response_format\":\"pcm\"}" \
    --output /tmp/tts-bench.pcm
  END=$(date +%s%N)

  WALL_MS=$(( (END - START) / 1000000 ))
  PCM_BYTES=$(stat -c%s /tmp/tts-bench.pcm)
  AUDIO_MS=$(( PCM_BYTES * 1000 / (24000 * 2) ))
  echo "${prompt}: audio=${AUDIO_MS}ms wall=${WALL_MS}ms gen_x=$(echo "scale=1; ${AUDIO_MS}.0 / ${WALL_MS}.0" | bc)x"
done

On a MIG 2g.35gb slice, expect gen x values around 2–3x real-time once the model is warm (the first request may be slower due to model warm-up).