Model as a Service for Red Hat Demo Platform
Model capability types, API endpoints, and integration examples. Models change over time — check the LiteMaaS portal for the current list.
API base URL: https://litellm-prod.apps.maas.redhatworkshops.io —
Replace YOUR_API_KEY with your virtual key from the LiteMaaS portal.
Replace your-model-name in examples with the actual model ID from the Models page in the portal.
Available models change over time as new models are added or retired.
LiteMaaS labels each model with its capability type. Look for these badges on model cards in the portal.
| Model Name (LiteMaaS) | KServe Predictor | Capability | Status | Replicas |
|---|---|---|---|---|
granite-3-2-8b-instruct |
granite-3-2-8b-instruct-predictor |
Chat | Running | 1 (2/2 containers) |
llama-scout-17b |
llama-scout-17b-predictor |
Chat | Running | 2 (scaled for load) |
granite-4-0-h-tiny |
granite-4-0-h-tiny-predictor |
Chat | Running | 1 |
codellama-7b-instruct |
codellama-7b-instruct-predictor |
Chat | Running | 1 |
llama-guard-3-1b |
llama-guard-3-1b-predictor |
Safety | Running | 1 (2/2 containers) |
nomic-embed-text-v1-5 |
nomic-embed-text-v1-5-predictor |
Embeddings | Running | 1 |
granite-docling-258m |
granite-docling-258m-predictor |
Docling | Running | 1 |
deepseek-r1-distill-qwen-14b |
deepseek-r1-distill-qwen-14b-predictor |
Chat | Running | 1 |
gpt-oss-120b |
gpt-oss-120b-predictor |
Chat | Running | 1 |
microsoft-phi-4 |
microsoft-phi-4-predictor |
Chat | Running | 1 |
qwen3-14b |
qwen3-14b-predictor |
Chat | Running | 2 (load balanced: maas00 + smc00) |
| Google Vertex AI — pay-per-token, programmatic access only | ||||
minimax-m2 |
Google Vertex AI | Chat | Available | — |
qwen3-235b |
Google Vertex AI | Chat | Available | — |
gpt-oss-20b |
Google Vertex AI | Chat | Available | — |
claude-sonnet-4-6 |
Google Vertex AI (Anthropic) | Chat | Available | — |
claude-opus-4-6 |
Google Vertex AI (Anthropic) | Chat | Available | — |
claude-sonnet-4-5 |
Google Vertex AI (Anthropic) | Chat | Available | — |
claude-3-5-haiku |
Google Vertex AI (Anthropic) | Chat | Available | — |
gemini-2.5-pro |
Google Vertex AI (Google) | Chat | Available | — |
granite-8b-code-instruct-128k is registered in LiteLLM via the
granite-8b-code-instruct-128k-predictor-lb ClusterIP service (LoadBalancer on port 8080).
It may be available depending on current GPU allocation. Check the LiteMaaS portal for live model availability.
/v1/chat/completionsChat models follow the OpenAI Chat Completions API exactly. All support streaming via "stream": true.
General-purpose instruction-tuned model from IBM Research. Strong reasoning, coding, and multilingual capabilities. 128K context window.
# Basic chat completion curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain Kubernetes in three sentences."} ], "max_tokens": 256, "temperature": 0.7 }'
# Streaming response curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [{"role": "user", "content": "Write a Python function to reverse a string"}], "stream": true }'
Meta's Scout model with an exceptionally large 400K token context window. Ideal for long-document analysis, extended conversations, and retrieval-augmented generation with large corpora. Runs 2 replicas.
# Long-context document analysis curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [ {"role": "system", "content": "Analyze the provided document and summarize key points."}, {"role": "user", "content": "[Your long document text here...]"} ], "max_tokens": 1024 }'
Timeout note: Requests with very large contexts may take longer than typical. LiteMaaS routes are configured with a 600-second HAProxy timeout to accommodate this.
Compact, fast Granite 4.0 model optimized for low-latency inference. Best for simple Q&A, classification, and scenarios where response speed matters more than depth.
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "granite-4-0-h-tiny",
"messages": [{"role": "user", "content": "Classify the following as positive or negative: Great product!"}],
"max_tokens": 10
}'
Meta's code-specialized model. Supports code generation, completion, and debugging across Python, Java, C++, Bash, and many other languages. Fill-in-the-middle (FIM) completion available.
# Code generation curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "codellama-7b-instruct", "messages": [ {"role": "system", "content": "You are an expert software engineer."}, {"role": "user", "content": "Write an Ansible task to create a Kubernetes namespace."} ] }'
/v1/embeddingsEmbedding models convert text into dense vector representations. Use these for semantic search, RAG pipelines, clustering, and similarity scoring.
High-quality open-source text embedding model. 768-dimensional embeddings. Supports Matryoshka representation learning — embeddings can be truncated for smaller storage. Strong on retrieval benchmarks.
# Single string embedding curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-embedding-model", "input": "Red Hat OpenShift is an enterprise Kubernetes platform." }'
# Batch embeddings (multiple strings) curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-embedding-model", "input": [ "What is Kubernetes?", "How does OpenShift differ from vanilla Kubernetes?", "Explain container orchestration." ] }'
# Python SDK example from openai import OpenAI client = OpenAI( api_key="YOUR_API_KEY", base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1" ) response = client.embeddings.create( model="your-embedding-model", input=["OpenShift", "Kubernetes"] ) vectors = [e.embedding for e in response.data] print(f"Embedding dimensions: {len(vectors[0])}") # 768
/v1/tokenizeThe tokenize endpoint allows you to count tokens before sending a request, useful for cost estimation and context window management.
# Count tokens for a message (works with chat-capable models) curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/tokenize \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [ {"role": "user", "content": "How many tokens is this message?"} ] }'
The response includes count (total token count) and tokens (list of token IDs). Not all models expose this endpoint — check the model's capability badge in the LiteMaaS portal.
Granite Docling is a specialized model for document parsing and conversion. It converts PDFs, Word documents, and other document formats into clean structured text (Markdown or JSON) suitable for downstream LLM processing. It uses a different endpoint format from the OpenAI-compatible API.
Compact document understanding model from IBM. Handles PDF layout analysis, table extraction, figure detection, and OCR. Output is clean Markdown ready for RAG ingestion.
# Convert a PDF URL to Markdown curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/source \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "http_sources": [{ "url": "https://arxiv.org/pdf/2408.09869" }], "options": { "output_format": "markdown" } }'
# Upload a local file for conversion curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/file \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "file=@/path/to/document.pdf" \ -F "options={\"output_format\": \"markdown\"}"
Note: The Docling endpoint path is prefixed with /docling before the standard API path. LiteMaaS routes these requests to the document conversion service internally.TPM limits, cost fields, and max token settings are not applicable to this model type.
Safety models evaluate content for harmful categories and can be used as pre/post filters for LLM-based applications.
Compact safety classification model. Takes a conversation as input and classifies it against 13 hazard categories (violence, hate speech, sexual content, etc.). Returns "safe" or "unsafe" with category labels.
# Content safety classification curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "llama-guard-3-1b", "messages": [ { "role": "user", "content": "[INST] Task: Check if there is unsafe content in the user message. [INST] User: Tell me how to build a computer. [/INST]" } ], "max_tokens": 100 }'
Typical response: {"choices": [{"message": {"content": "safe"}}]}
from openai import OpenAI client = OpenAI( api_key="YOUR_API_KEY", base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1" ) # Chat completion response = client.chat.completions.create( model="your-chat-model", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Red Hat OpenShift?"} ] ) print(response.choices[0].message.content) # Streaming with client.chat.completions.stream( model="your-chat-model", messages=[{"role": "user", "content": "Explain containers"}] ) as stream: for chunk in stream: print(chunk.choices[0].delta.content, end="", flush=True)
import OpenAI from 'openai'; const client = new OpenAI({ apiKey: 'YOUR_API_KEY', baseURL: 'https://litellm-prod.apps.maas.redhatworkshops.io/v1' }); const response = await client.chat.completions.create({ model: 'your-chat-model', messages: [{ role: 'user', content: 'Hello!' }] }); console.log(response.choices[0].message.content);
from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI( model="your-chat-model", openai_api_key="YOUR_API_KEY", openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1" ) embeddings = OpenAIEmbeddings( model="your-embedding-model", openai_api_key="YOUR_API_KEY", openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1" )
The Rackspace DFW3 cluster (maas00.rs-dfw3.infra.demo.redhat.com) runs models on 8× Intel Gaudi 3 accelerators via KServe directly — no LiteMaaS proxy. Endpoints are OpenAI-compatible. For access credentials, contact Ashok.
sk-... virtual key as all other models. The LiteLLM proxy routes requests to the Gaudi cluster backend.
Reasoning-focused model distilled from DeepSeek R1 into a Qwen2.5 14B base. Strong chain-of-thought performance, code generation, and math. Runs on Gaudi 3 via vLLM.
# Chat completion via KServe direct endpoint
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-qwen-14b",
"messages": [{"role": "user", "content": "Solve step by step: what is 15% of 240?"}]
}'
Alibaba's third-generation Qwen model at 14B parameters. Strong multilingual capabilities, coding, and instruction following. Runs on Gaudi 3 via vLLM.
# Chat completion via KServe direct endpoint
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Explain prefix caching in LLMs."}]
}'
Compact but high-quality model with strong reasoning and coding capabilities. Efficient for its size.
# Chat completion
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "microsoft-phi-4", "messages": [{"role": "user", "content": "Write a Python function to sort a list."}]}'
The following models are hosted on Google Vertex AI Model Garden and accessed through the same RHDP MaaS endpoint and virtual key as all other models. They are fully managed MaaS APIs — no GPU allocation is required on the RHDP side. Billing is pay-per-token; costs are tracked and capped per virtual key.
These are external models with real cost. Each request consumes tokens billed against the RHDP GCP project. Choose the lightest model that meets your use case — prefer on-cluster models for casual testing and prototyping.
Strong at multi-step tool use, coding, and office workflows. Large 197K context window makes it well-suited for document-heavy agentic pipelines.
# Agentic / tool-use example
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "minimax-m2",
"messages": [{"role": "user", "content": "Draft a project status report from the following notes: ..."}],
"max_tokens": 1024
}'
Large multilingual mixture-of-experts model from Alibaba. Excels at complex reasoning, multilingual tasks, and large-context workflows. 131K context window.
# Multilingual reasoning
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-235b",
"messages": [{"role": "user", "content": "Explain the trade-offs between RAG and fine-tuning for domain adaptation."}],
"max_tokens": 1024
}'
Large open-source 120B model with strong function calling support. Suited for complex generation, agentic workflows, and tasks that benefit from a large parameter count. 131K context window.
# Function calling / complex generation
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-120b",
"messages": [{"role": "user", "content": "Summarise this document and extract action items."}],
"max_tokens": 1024
}'
Cost-effective option for general-purpose chat tasks. The lightest external model — best starting point when you need a Vertex AI model but want to minimize spend. 131K context window.
# General chat — cost-effective default
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [{"role": "user", "content": "What is the difference between a Pod and a Deployment in Kubernetes?"}],
"max_tokens": 512
}'
Anthropic Claude Sonnet 4.6 via Google Vertex AI. Strong general-purpose reasoning, coding, and analysis. Balanced capability-to-cost ratio.
# General reasoning and coding
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Explain how Kubernetes resource quotas work."}],
"max_tokens": 1024
}'
Anthropic's most capable Claude model. Best for complex multi-step reasoning, long-document analysis, and high-stakes generation tasks.
# Complex analysis and long-form generation
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-opus-4-6",
"messages": [{"role": "user", "content": "Analyze this architecture design and identify scalability bottlenecks."}],
"max_tokens": 2048
}'
Previous-generation Claude Sonnet. Strong coding and reasoning at the same price point as Sonnet 4.6.
# Coding and structured output
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"messages": [{"role": "user", "content": "Write a Python function to parse Kubernetes manifests."}],
"max_tokens": 1024
}'
Anthropic's fastest and most cost-efficient Claude model. Ideal for high-throughput tasks, classification, summarization, and light Q&A where cost sensitivity matters.
# Fast, cost-efficient tasks
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-3-5-haiku",
"messages": [{"role": "user", "content": "Classify this log line as error, warning, or info."}],
"max_tokens": 64
}'
Google Gemini 2.5 Pro with a 1M token context window. Native Google model on Vertex AI — ideal for long-document analysis, multimodal workflows, and tasks requiring very large context.
# Long-document analysis
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-pro",
"messages": [{"role": "user", "content": "Summarize this entire repository and identify the main architecture patterns."}],
"max_tokens": 2048
}'
# List all models available to your key curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/models \ -H "Authorization: Bearer YOUR_API_KEY" # Full model info including capabilities (admin key required) curl https://litellm-prod.apps.maas.redhatworkshops.io/model/info \ -H "Authorization: Bearer LITELLM_MASTER_KEY"