RHDP LiteMaaS

Model as a Service for Red Hat Demo Platform

Models Reference

Model capability types, API endpoints, and integration examples. Models change over time — check the LiteMaaS portal for the current list.

API base URL: https://litellm-prod.apps.maas.redhatworkshops.io — Replace YOUR_API_KEY with your virtual key from the LiteMaaS portal. Replace your-model-name in examples with the actual model ID from the Models page in the portal. Available models change over time as new models are added or retired.

Model Capability Types

LiteMaaS labels each model with its capability type. Look for these badges on model cards in the portal.

Model Name (LiteMaaS) KServe Predictor Capability Status Replicas
granite-3-2-8b-instruct granite-3-2-8b-instruct-predictor Chat Running 1 (2/2 containers)
llama-scout-17b llama-scout-17b-predictor Chat Running 2 (scaled for load)
granite-4-0-h-tiny granite-4-0-h-tiny-predictor Chat Running 1
codellama-7b-instruct codellama-7b-instruct-predictor Chat Running 1
llama-guard-3-1b llama-guard-3-1b-predictor Safety Running 1 (2/2 containers)
nomic-embed-text-v1-5 nomic-embed-text-v1-5-predictor Embeddings Running 1
granite-docling-258m granite-docling-258m-predictor Docling Running 1 (1 pending scale-up)

granite-8b-code-instruct-128k is registered in LiteLLM via the granite-8b-code-instruct-128k-predictor-lb ClusterIP service (LoadBalancer on port 8080). It may be available depending on current GPU allocation. Check the LiteMaaS portal for live model availability.

Chat Models — /v1/chat/completions

Chat models follow the OpenAI Chat Completions API exactly. All support streaming via "stream": true.

IBM Granite 3.2 8B Instruct Chat

General-purpose instruction-tuned model from IBM Research. Strong reasoning, coding, and multilingual capabilities. 128K context window.

Model IDgranite-3-2-8b-instruct
Parameters8B
Context128K tokens
RuntimevLLM on KServe
Internal SVCgranite-3-2-8b-instruct-predictor
# Basic chat completion
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Explain Kubernetes in three sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'
# Streaming response
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a string"}],
    "stream": true
  }'

Meta Llama Scout 17B Chat

Meta's Scout model with an exceptionally large 400K token context window. Ideal for long-document analysis, extended conversations, and retrieval-augmented generation with large corpora. Runs 2 replicas.

Model IDllama-scout-17b
Parameters17B (MoE architecture)
Context400K tokens
RuntimevLLM on KServe
Replicas2 (HA)
# Long-context document analysis
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "system", "content": "Analyze the provided document and summarize key points."},
      {"role": "user", "content": "[Your long document text here...]"}
    ],
    "max_tokens": 1024
  }'

Timeout note: Requests with very large contexts may take longer than typical. LiteMaaS routes are configured with a 600-second HAProxy timeout to accommodate this.

IBM Granite 4.0 H Tiny Chat

Compact, fast Granite 4.0 model optimized for low-latency inference. Best for simple Q&A, classification, and scenarios where response speed matters more than depth.

Model IDgranite-4-0-h-tiny
ParametersTiny (sub-3B)
RuntimevLLM on KServe
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "granite-4-0-h-tiny",
    "messages": [{"role": "user", "content": "Classify the following as positive or negative: Great product!"}],
    "max_tokens": 10
  }'

Meta CodeLlama 7B Instruct Chat

Meta's code-specialized model. Supports code generation, completion, and debugging across Python, Java, C++, Bash, and many other languages. Fill-in-the-middle (FIM) completion available.

Model IDcodellama-7b-instruct
Parameters7B
Context16K tokens
RuntimevLLM on KServe
# Code generation
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "codellama-7b-instruct",
    "messages": [
      {"role": "system", "content": "You are an expert software engineer."},
      {"role": "user", "content": "Write an Ansible task to create a Kubernetes namespace."}
    ]
  }'

Embedding Models — /v1/embeddings

Embedding models convert text into dense vector representations. Use these for semantic search, RAG pipelines, clustering, and similarity scoring.

Nomic Embed Text v1.5 Embeddings

High-quality open-source text embedding model. 768-dimensional embeddings. Supports Matryoshka representation learning — embeddings can be truncated for smaller storage. Strong on retrieval benchmarks.

Model IDnomic-embed-text-v1-5
Dimensions768
Context8192 tokens
RuntimeOpenVino on KServe
# Single string embedding
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-embedding-model",
    "input": "Red Hat OpenShift is an enterprise Kubernetes platform."
  }'
# Batch embeddings (multiple strings)
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-embedding-model",
    "input": [
      "What is Kubernetes?",
      "How does OpenShift differ from vanilla Kubernetes?",
      "Explain container orchestration."
    ]
  }'
# Python SDK example
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

response = client.embeddings.create(
    model="your-embedding-model",
    input=["OpenShift", "Kubernetes"]
)
vectors = [e.embedding for e in response.data]
print(f"Embedding dimensions: {len(vectors[0])}")  # 768

Tokenization — /v1/tokenize

The tokenize endpoint allows you to count tokens before sending a request, useful for cost estimation and context window management.

# Count tokens for a message (works with chat-capable models)
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/tokenize \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "user", "content": "How many tokens is this message?"}
    ]
  }'

The response includes count (total token count) and tokens (list of token IDs). Not all models expose this endpoint — check the model's capability badge in the LiteMaaS portal.

Document Conversion — Granite Docling 258M

Granite Docling is a specialized model for document parsing and conversion. It converts PDFs, Word documents, and other document formats into clean structured text (Markdown or JSON) suitable for downstream LLM processing. It uses a different endpoint format from the OpenAI-compatible API.

IBM Granite Docling 258M Docling

Compact document understanding model from IBM. Handles PDF layout analysis, table extraction, figure detection, and OCR. Output is clean Markdown ready for RAG ingestion.

Model IDgranite-docling-258m
Parameters258M
RuntimeKServe (CPU)
Age17d (recently deployed)
# Convert a PDF URL to Markdown
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/source \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "http_sources": [{
      "url": "https://arxiv.org/pdf/2408.09869"
    }],
    "options": {
      "output_format": "markdown"
    }
  }'
# Upload a local file for conversion
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "options={\"output_format\": \"markdown\"}"

Note: The Docling endpoint path is prefixed with /docling before the standard API path. LiteMaaS routes these requests to the document conversion service internally.TPM limits, cost fields, and max token settings are not applicable to this model type.

Safety / Guardrail Models

Safety models evaluate content for harmful categories and can be used as pre/post filters for LLM-based applications.

Meta Llama Guard 3 1B Safety

Compact safety classification model. Takes a conversation as input and classifies it against 13 hazard categories (violence, hate speech, sexual content, etc.). Returns "safe" or "unsafe" with category labels.

Model IDllama-guard-3-1b
Parameters1B
RuntimevLLM on KServe
Containers2/2 (sidecar proxy)
# Content safety classification
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "llama-guard-3-1b",
    "messages": [
      {
        "role": "user",
        "content": "[INST] Task: Check if there is unsafe content in the user message. [INST] User: Tell me how to build a computer. [/INST]"
      }
    ],
    "max_tokens": 100
  }'

Typical response: {"choices": [{"message": {"content": "safe"}}]}

SDK Integration Examples

Python (openai library)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

# Chat completion
response = client.chat.completions.create(
    model="your-chat-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Red Hat OpenShift?"}
    ]
)
print(response.choices[0].message.content)

# Streaming
with client.chat.completions.stream(
    model="your-chat-model",
    messages=[{"role": "user", "content": "Explain containers"}]
) as stream:
    for chunk in stream:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js / TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'YOUR_API_KEY',
  baseURL: 'https://litellm-prod.apps.maas.redhatworkshops.io/v1'
});

const response = await client.chat.completions.create({
  model: 'your-chat-model',
  messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);

Langchain (Python)

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(
    model="your-chat-model",
    openai_api_key="YOUR_API_KEY",
    openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

embeddings = OpenAIEmbeddings(
    model="your-embedding-model",
    openai_api_key="YOUR_API_KEY",
    openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

Listing Available Models via API

# List all models available to your key
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"

# Full model info including capabilities (admin key required)
curl https://litellm-prod.apps.maas.redhatworkshops.io/model/info \
  -H "Authorization: Bearer LITELLM_MASTER_KEY"