Model as a Service for Red Hat Demo Platform
Model capability types, API endpoints, and integration examples. Models change over time — check the LiteMaaS portal for the current list.
API base URL: https://litellm-prod.apps.maas.redhatworkshops.io —
Replace YOUR_API_KEY with your virtual key from the LiteMaaS portal.
Replace your-model-name in examples with the actual model ID from the Models page in the portal.
Available models change over time as new models are added or retired.
LiteMaaS labels each model with its capability type. Look for these badges on model cards in the portal.
| Model Name (LiteMaaS) | KServe Predictor | Capability | Status | Replicas |
|---|---|---|---|---|
granite-3-2-8b-instruct |
granite-3-2-8b-instruct-predictor |
Chat | Running | 1 (2/2 containers) |
llama-scout-17b |
llama-scout-17b-predictor |
Chat | Running | 2 (scaled for load) |
granite-4-0-h-tiny |
granite-4-0-h-tiny-predictor |
Chat | Running | 1 |
codellama-7b-instruct |
codellama-7b-instruct-predictor |
Chat | Running | 1 |
llama-guard-3-1b |
llama-guard-3-1b-predictor |
Safety | Running | 1 (2/2 containers) |
nomic-embed-text-v1-5 |
nomic-embed-text-v1-5-predictor |
Embeddings | Running | 1 |
granite-docling-258m |
granite-docling-258m-predictor |
Docling | Running | 1 (1 pending scale-up) |
granite-8b-code-instruct-128k is registered in LiteLLM via the
granite-8b-code-instruct-128k-predictor-lb ClusterIP service (LoadBalancer on port 8080).
It may be available depending on current GPU allocation. Check the LiteMaaS portal for live model availability.
/v1/chat/completionsChat models follow the OpenAI Chat Completions API exactly. All support streaming via "stream": true.
General-purpose instruction-tuned model from IBM Research. Strong reasoning, coding, and multilingual capabilities. 128K context window.
# Basic chat completion curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain Kubernetes in three sentences."} ], "max_tokens": 256, "temperature": 0.7 }'
# Streaming response curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [{"role": "user", "content": "Write a Python function to reverse a string"}], "stream": true }'
Meta's Scout model with an exceptionally large 400K token context window. Ideal for long-document analysis, extended conversations, and retrieval-augmented generation with large corpora. Runs 2 replicas.
# Long-context document analysis curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [ {"role": "system", "content": "Analyze the provided document and summarize key points."}, {"role": "user", "content": "[Your long document text here...]"} ], "max_tokens": 1024 }'
Timeout note: Requests with very large contexts may take longer than typical. LiteMaaS routes are configured with a 600-second HAProxy timeout to accommodate this.
Compact, fast Granite 4.0 model optimized for low-latency inference. Best for simple Q&A, classification, and scenarios where response speed matters more than depth.
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "granite-4-0-h-tiny",
"messages": [{"role": "user", "content": "Classify the following as positive or negative: Great product!"}],
"max_tokens": 10
}'
Meta's code-specialized model. Supports code generation, completion, and debugging across Python, Java, C++, Bash, and many other languages. Fill-in-the-middle (FIM) completion available.
# Code generation curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "codellama-7b-instruct", "messages": [ {"role": "system", "content": "You are an expert software engineer."}, {"role": "user", "content": "Write an Ansible task to create a Kubernetes namespace."} ] }'
/v1/embeddingsEmbedding models convert text into dense vector representations. Use these for semantic search, RAG pipelines, clustering, and similarity scoring.
High-quality open-source text embedding model. 768-dimensional embeddings. Supports Matryoshka representation learning — embeddings can be truncated for smaller storage. Strong on retrieval benchmarks.
# Single string embedding curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-embedding-model", "input": "Red Hat OpenShift is an enterprise Kubernetes platform." }'
# Batch embeddings (multiple strings) curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-embedding-model", "input": [ "What is Kubernetes?", "How does OpenShift differ from vanilla Kubernetes?", "Explain container orchestration." ] }'
# Python SDK example from openai import OpenAI client = OpenAI( api_key="YOUR_API_KEY", base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1" ) response = client.embeddings.create( model="your-embedding-model", input=["OpenShift", "Kubernetes"] ) vectors = [e.embedding for e in response.data] print(f"Embedding dimensions: {len(vectors[0])}") # 768
/v1/tokenizeThe tokenize endpoint allows you to count tokens before sending a request, useful for cost estimation and context window management.
# Count tokens for a message (works with chat-capable models) curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/tokenize \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-chat-model", "messages": [ {"role": "user", "content": "How many tokens is this message?"} ] }'
The response includes count (total token count) and tokens (list of token IDs). Not all models expose this endpoint — check the model's capability badge in the LiteMaaS portal.
Granite Docling is a specialized model for document parsing and conversion. It converts PDFs, Word documents, and other document formats into clean structured text (Markdown or JSON) suitable for downstream LLM processing. It uses a different endpoint format from the OpenAI-compatible API.
Compact document understanding model from IBM. Handles PDF layout analysis, table extraction, figure detection, and OCR. Output is clean Markdown ready for RAG ingestion.
# Convert a PDF URL to Markdown curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/source \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "http_sources": [{ "url": "https://arxiv.org/pdf/2408.09869" }], "options": { "output_format": "markdown" } }'
# Upload a local file for conversion curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/file \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "file=@/path/to/document.pdf" \ -F "options={\"output_format\": \"markdown\"}"
Note: The Docling endpoint path is prefixed with /docling before the standard API path. LiteMaaS routes these requests to the document conversion service internally.TPM limits, cost fields, and max token settings are not applicable to this model type.
Safety models evaluate content for harmful categories and can be used as pre/post filters for LLM-based applications.
Compact safety classification model. Takes a conversation as input and classifies it against 13 hazard categories (violence, hate speech, sexual content, etc.). Returns "safe" or "unsafe" with category labels.
# Content safety classification curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "llama-guard-3-1b", "messages": [ { "role": "user", "content": "[INST] Task: Check if there is unsafe content in the user message. [INST] User: Tell me how to build a computer. [/INST]" } ], "max_tokens": 100 }'
Typical response: {"choices": [{"message": {"content": "safe"}}]}
from openai import OpenAI client = OpenAI( api_key="YOUR_API_KEY", base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1" ) # Chat completion response = client.chat.completions.create( model="your-chat-model", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Red Hat OpenShift?"} ] ) print(response.choices[0].message.content) # Streaming with client.chat.completions.stream( model="your-chat-model", messages=[{"role": "user", "content": "Explain containers"}] ) as stream: for chunk in stream: print(chunk.choices[0].delta.content, end="", flush=True)
import OpenAI from 'openai'; const client = new OpenAI({ apiKey: 'YOUR_API_KEY', baseURL: 'https://litellm-prod.apps.maas.redhatworkshops.io/v1' }); const response = await client.chat.completions.create({ model: 'your-chat-model', messages: [{ role: 'user', content: 'Hello!' }] }); console.log(response.choices[0].message.content);
from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI( model="your-chat-model", openai_api_key="YOUR_API_KEY", openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1" ) embeddings = OpenAIEmbeddings( model="your-embedding-model", openai_api_key="YOUR_API_KEY", openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1" )
# List all models available to your key curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/models \ -H "Authorization: Bearer YOUR_API_KEY" # Full model info including capabilities (admin key required) curl https://litellm-prod.apps.maas.redhatworkshops.io/model/info \ -H "Authorization: Bearer LITELLM_MASTER_KEY"