RHDP LiteMaaS

Model as a Service for Red Hat Demo Platform

Models Reference

Model capability types, API endpoints, and integration examples. Models change over time — check the LiteMaaS portal for the current list.

API base URL: https://litellm-prod.apps.maas.redhatworkshops.io — Replace YOUR_API_KEY with your virtual key from the LiteMaaS portal. Replace your-model-name in examples with the actual model ID from the Models page in the portal. Available models change over time as new models are added or retired.

Model Capability Types

LiteMaaS labels each model with its capability type. Look for these badges on model cards in the portal.

Model Name (LiteMaaS) KServe Predictor Capability Status Replicas
granite-3-2-8b-instruct granite-3-2-8b-instruct-predictor Chat Running 1 (2/2 containers)
llama-scout-17b llama-scout-17b-predictor Chat Running 2 (scaled for load)
granite-4-0-h-tiny granite-4-0-h-tiny-predictor Chat Running 1
codellama-7b-instruct codellama-7b-instruct-predictor Chat Running 1
llama-guard-3-1b llama-guard-3-1b-predictor Safety Running 1 (2/2 containers)
nomic-embed-text-v1-5 nomic-embed-text-v1-5-predictor Embeddings Running 1
granite-docling-258m granite-docling-258m-predictor Docling Running 1
deepseek-r1-distill-qwen-14b deepseek-r1-distill-qwen-14b-predictor Chat Running 1
gpt-oss-120b gpt-oss-120b-predictor Chat Running 1
microsoft-phi-4 microsoft-phi-4-predictor Chat Running 1
qwen3-14b qwen3-14b-predictor Chat Running 2 (load balanced: maas00 + smc00)
Google Vertex AI — pay-per-token, programmatic access only
minimax-m2 Google Vertex AI Chat Available
qwen3-235b Google Vertex AI Chat Available
gpt-oss-20b Google Vertex AI Chat Available
claude-sonnet-4-6 Google Vertex AI (Anthropic) Chat Available
claude-opus-4-6 Google Vertex AI (Anthropic) Chat Available
claude-sonnet-4-5 Google Vertex AI (Anthropic) Chat Available
claude-3-5-haiku Google Vertex AI (Anthropic) Chat Available
gemini-2.5-pro Google Vertex AI (Google) Chat Available

granite-8b-code-instruct-128k is registered in LiteLLM via the granite-8b-code-instruct-128k-predictor-lb ClusterIP service (LoadBalancer on port 8080). It may be available depending on current GPU allocation. Check the LiteMaaS portal for live model availability.

Chat Models — /v1/chat/completions

Chat models follow the OpenAI Chat Completions API exactly. All support streaming via "stream": true.

IBM Granite 3.2 8B Instruct Chat

General-purpose instruction-tuned model from IBM Research. Strong reasoning, coding, and multilingual capabilities. 128K context window.

Model IDgranite-3-2-8b-instruct
Parameters8B
Context128K tokens
RuntimevLLM on KServe
Internal SVCgranite-3-2-8b-instruct-predictor
# Basic chat completion
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Explain Kubernetes in three sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'
# Streaming response
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a string"}],
    "stream": true
  }'

Meta Llama Scout 17B Chat

Meta's Scout model with an exceptionally large 400K token context window. Ideal for long-document analysis, extended conversations, and retrieval-augmented generation with large corpora. Runs 2 replicas.

Model IDllama-scout-17b
Parameters17B (MoE architecture)
Context400K tokens
RuntimevLLM on KServe
Replicas2 (HA)
# Long-context document analysis
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "system", "content": "Analyze the provided document and summarize key points."},
      {"role": "user", "content": "[Your long document text here...]"}
    ],
    "max_tokens": 1024
  }'

Timeout note: Requests with very large contexts may take longer than typical. LiteMaaS routes are configured with a 600-second HAProxy timeout to accommodate this.

IBM Granite 4.0 H Tiny Chat

Compact, fast Granite 4.0 model optimized for low-latency inference. Best for simple Q&A, classification, and scenarios where response speed matters more than depth.

Model IDgranite-4-0-h-tiny
ParametersTiny (sub-3B)
RuntimevLLM on KServe
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "granite-4-0-h-tiny",
    "messages": [{"role": "user", "content": "Classify the following as positive or negative: Great product!"}],
    "max_tokens": 10
  }'

Meta CodeLlama 7B Instruct Chat

Meta's code-specialized model. Supports code generation, completion, and debugging across Python, Java, C++, Bash, and many other languages. Fill-in-the-middle (FIM) completion available.

Model IDcodellama-7b-instruct
Parameters7B
Context16K tokens
RuntimevLLM on KServe
# Code generation
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "codellama-7b-instruct",
    "messages": [
      {"role": "system", "content": "You are an expert software engineer."},
      {"role": "user", "content": "Write an Ansible task to create a Kubernetes namespace."}
    ]
  }'

Embedding Models — /v1/embeddings

Embedding models convert text into dense vector representations. Use these for semantic search, RAG pipelines, clustering, and similarity scoring.

Nomic Embed Text v1.5 Embeddings

High-quality open-source text embedding model. 768-dimensional embeddings. Supports Matryoshka representation learning — embeddings can be truncated for smaller storage. Strong on retrieval benchmarks.

Model IDnomic-embed-text-v1-5
Dimensions768
Context8192 tokens
RuntimeOpenVino on KServe
# Single string embedding
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-embedding-model",
    "input": "Red Hat OpenShift is an enterprise Kubernetes platform."
  }'
# Batch embeddings (multiple strings)
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-embedding-model",
    "input": [
      "What is Kubernetes?",
      "How does OpenShift differ from vanilla Kubernetes?",
      "Explain container orchestration."
    ]
  }'
# Python SDK example
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

response = client.embeddings.create(
    model="your-embedding-model",
    input=["OpenShift", "Kubernetes"]
)
vectors = [e.embedding for e in response.data]
print(f"Embedding dimensions: {len(vectors[0])}")  # 768

Tokenization — /v1/tokenize

The tokenize endpoint allows you to count tokens before sending a request, useful for cost estimation and context window management.

# Count tokens for a message (works with chat-capable models)
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/tokenize \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "user", "content": "How many tokens is this message?"}
    ]
  }'

The response includes count (total token count) and tokens (list of token IDs). Not all models expose this endpoint — check the model's capability badge in the LiteMaaS portal.

Document Conversion — Granite Docling 258M

Granite Docling is a specialized model for document parsing and conversion. It converts PDFs, Word documents, and other document formats into clean structured text (Markdown or JSON) suitable for downstream LLM processing. It uses a different endpoint format from the OpenAI-compatible API.

IBM Granite Docling 258M Docling

Compact document understanding model from IBM. Handles PDF layout analysis, table extraction, figure detection, and OCR. Output is clean Markdown ready for RAG ingestion.

Model IDgranite-docling-258m
Parameters258M
RuntimeKServe (CPU)
Age17d (recently deployed)
# Convert a PDF URL to Markdown
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/source \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "http_sources": [{
      "url": "https://arxiv.org/pdf/2408.09869"
    }],
    "options": {
      "output_format": "markdown"
    }
  }'
# Upload a local file for conversion
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "options={\"output_format\": \"markdown\"}"

Note: The Docling endpoint path is prefixed with /docling before the standard API path. LiteMaaS routes these requests to the document conversion service internally.TPM limits, cost fields, and max token settings are not applicable to this model type.

Safety / Guardrail Models

Safety models evaluate content for harmful categories and can be used as pre/post filters for LLM-based applications.

Meta Llama Guard 3 1B Safety

Compact safety classification model. Takes a conversation as input and classifies it against 13 hazard categories (violence, hate speech, sexual content, etc.). Returns "safe" or "unsafe" with category labels.

Model IDllama-guard-3-1b
Parameters1B
RuntimevLLM on KServe
Containers2/2 (sidecar proxy)
# Content safety classification
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "llama-guard-3-1b",
    "messages": [
      {
        "role": "user",
        "content": "[INST] Task: Check if there is unsafe content in the user message. [INST] User: Tell me how to build a computer. [/INST]"
      }
    ],
    "max_tokens": 100
  }'

Typical response: {"choices": [{"message": {"content": "safe"}}]}

SDK Integration Examples

Python (openai library)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

# Chat completion
response = client.chat.completions.create(
    model="your-chat-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Red Hat OpenShift?"}
    ]
)
print(response.choices[0].message.content)

# Streaming
with client.chat.completions.stream(
    model="your-chat-model",
    messages=[{"role": "user", "content": "Explain containers"}]
) as stream:
    for chunk in stream:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js / TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'YOUR_API_KEY',
  baseURL: 'https://litellm-prod.apps.maas.redhatworkshops.io/v1'
});

const response = await client.chat.completions.create({
  model: 'your-chat-model',
  messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);

Langchain (Python)

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(
    model="your-chat-model",
    openai_api_key="YOUR_API_KEY",
    openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

embeddings = OpenAIEmbeddings(
    model="your-embedding-model",
    openai_api_key="YOUR_API_KEY",
    openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

Intel Gaudi 3 Cluster — Via LiteMaaS Proxy

The Rackspace DFW3 cluster (maas00.rs-dfw3.infra.demo.redhat.com) runs models on 8× Intel Gaudi 3 accelerators via KServe directly — no LiteMaaS proxy. Endpoints are OpenAI-compatible. For access credentials, contact Ashok.

Access: These models are registered in LiteMaaS and accessed via the same portal and sk-... virtual key as all other models. The LiteLLM proxy routes requests to the Gaudi cluster backend.

DeepSeek R1 Distill Qwen 14B Chat Intel Gaudi 3

Reasoning-focused model distilled from DeepSeek R1 into a Qwen2.5 14B base. Strong chain-of-thought performance, code generation, and math. Runs on Gaudi 3 via vLLM.

Model IDdeepseek-r1-distill-qwen-14b
Parameters14B
RuntimevLLM on Gaudi 3
Clustermaas00.rs-dfw3
Backenddeepseek-r1-distill-qwen-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com
# Chat completion via KServe direct endpoint
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1-distill-qwen-14b",
    "messages": [{"role": "user", "content": "Solve step by step: what is 15% of 240?"}]
  }'

Qwen3 14B Chat Intel Gaudi 3

Alibaba's third-generation Qwen model at 14B parameters. Strong multilingual capabilities, coding, and instruction following. Runs on Gaudi 3 via vLLM.

Model IDqwen3-14b
Parameters14B
RuntimevLLM on Gaudi 3
Clustermaas00.rs-dfw3
Backendqwen3-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com
# Chat completion via KServe direct endpoint
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Explain prefix caching in LLMs."}]
  }'

Microsoft Phi-4 Chat

Compact but high-quality model with strong reasoning and coding capabilities. Efficient for its size.

Model IDmicrosoft-phi-4
Backendmicrosoft-phi-4-llm-hosting.apps.smc00.rs-dfw3.infra.demo.redhat.com
Context16K
# Chat completion
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "microsoft-phi-4", "messages": [{"role": "user", "content": "Write a Python function to sort a list."}]}'

External Models (Google Vertex AI)

The following models are hosted on Google Vertex AI Model Garden and accessed through the same RHDP MaaS endpoint and virtual key as all other models. They are fully managed MaaS APIs — no GPU allocation is required on the RHDP side. Billing is pay-per-token; costs are tracked and capped per virtual key.

These are external models with real cost. Each request consumes tokens billed against the RHDP GCP project. Choose the lightest model that meets your use case — prefer on-cluster models for casual testing and prototyping.

MiniMax M2 Chat Agentic Vertex AI

Strong at multi-step tool use, coding, and office workflows. Large 197K context window makes it well-suited for document-heavy agentic pipelines.

Model IDminimax-m2
Context197K tokens
Input /1M$0.30
Output /1M$1.20
Function CallingSupported
Hosted onGoogle Vertex AI
# Agentic / tool-use example
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2",
    "messages": [{"role": "user", "content": "Draft a project status report from the following notes: ..."}],
    "max_tokens": 1024
  }'

Qwen3 235B Chat Reasoning Vertex AI

Large multilingual mixture-of-experts model from Alibaba. Excels at complex reasoning, multilingual tasks, and large-context workflows. 131K context window.

Model IDqwen3-235b
Parameters235B (MoE)
Context131K tokens
Input /1M$0.22
Output /1M$0.88
Function CallingSupported
Hosted onGoogle Vertex AI
# Multilingual reasoning
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-235b",
    "messages": [{"role": "user", "content": "Explain the trade-offs between RAG and fine-tuning for domain adaptation."}],
    "max_tokens": 1024
  }'

GPT OSS 120B Chat Reasoning Vertex AI

Large open-source 120B model with strong function calling support. Suited for complex generation, agentic workflows, and tasks that benefit from a large parameter count. 131K context window.

Model IDgpt-oss-120b
Parameters120B
Context131K tokens
Input /1M$0.09
Output /1M$0.36
Function CallingSupported
Hosted onGoogle Vertex AI
# Function calling / complex generation
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [{"role": "user", "content": "Summarise this document and extract action items."}],
    "max_tokens": 1024
  }'

GPT OSS 20B Chat Vertex AI

Cost-effective option for general-purpose chat tasks. The lightest external model — best starting point when you need a Vertex AI model but want to minimize spend. 131K context window.

Model IDgpt-oss-20b
Parameters20B
Context131K tokens
Input /1M$0.07
Output /1M$0.25
Function CallingSupported
Hosted onGoogle Vertex AI
# General chat — cost-effective default
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is the difference between a Pod and a Deployment in Kubernetes?"}],
    "max_tokens": 512
  }'

Claude Sonnet 4.6 Chat Vertex AI

Anthropic Claude Sonnet 4.6 via Google Vertex AI. Strong general-purpose reasoning, coding, and analysis. Balanced capability-to-cost ratio.

Model IDclaude-sonnet-4-6
Input /1M$3.00
Output /1M$15.00
Hosted onGoogle Vertex AI
# General reasoning and coding
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Explain how Kubernetes resource quotas work."}],
    "max_tokens": 1024
  }'

Claude Opus 4.6 Chat Vertex AI

Anthropic's most capable Claude model. Best for complex multi-step reasoning, long-document analysis, and high-stakes generation tasks.

Model IDclaude-opus-4-6
Input /1M$5.00
Output /1M$25.00
Hosted onGoogle Vertex AI
# Complex analysis and long-form generation
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-6",
    "messages": [{"role": "user", "content": "Analyze this architecture design and identify scalability bottlenecks."}],
    "max_tokens": 2048
  }'

Claude Sonnet 4.5 Chat Vertex AI

Previous-generation Claude Sonnet. Strong coding and reasoning at the same price point as Sonnet 4.6.

Model IDclaude-sonnet-4-5
Input /1M$3.00
Output /1M$15.00
Hosted onGoogle Vertex AI
# Coding and structured output
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "messages": [{"role": "user", "content": "Write a Python function to parse Kubernetes manifests."}],
    "max_tokens": 1024
  }'

Claude 3.5 Haiku Chat Vertex AI

Anthropic's fastest and most cost-efficient Claude model. Ideal for high-throughput tasks, classification, summarization, and light Q&A where cost sensitivity matters.

Model IDclaude-3-5-haiku
Input /1M$1.00
Output /1M$5.00
Hosted onGoogle Vertex AI
# Fast, cost-efficient tasks
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-5-haiku",
    "messages": [{"role": "user", "content": "Classify this log line as error, warning, or info."}],
    "max_tokens": 64
  }'

Gemini 2.5 Pro Chat Vertex AI

Google Gemini 2.5 Pro with a 1M token context window. Native Google model on Vertex AI — ideal for long-document analysis, multimodal workflows, and tasks requiring very large context.

Model IDgemini-2.5-pro
Context1M tokens
Input /1M$1.25
Output /1M$10.00
Hosted onGoogle Vertex AI
# Long-document analysis
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [{"role": "user", "content": "Summarize this entire repository and identify the main architecture patterns."}],
    "max_tokens": 2048
  }'

Listing Available Models via API

# List all models available to your key
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"

# Full model info including capabilities (admin key required)
curl https://litellm-prod.apps.maas.redhatworkshops.io/model/info \
  -H "Authorization: Bearer LITELLM_MASTER_KEY"