Models Reference

Model capability types, API endpoints, and integration examples. Models change over time — check the LiteMaaS portal for the current list.

API base URL: https://litellm-prod.apps.maas.redhatworkshops.io — Replace YOUR_API_KEY with your virtual key from the LiteMaaS portal. Replace your-model-name in examples with the actual model ID from the Models page in the portal. Available models change over time as new models are added or retired.

Model Capability Types

LiteMaaS labels each model with its capability type. Look for these badges on model cards in the portal.

Model Name (LiteMaaS)	KServe Predictor	Capability	Status	Replicas
`granite-3-2-8b-instruct`	`granite-3-2-8b-instruct-predictor`	Chat	Running	1 (2/2 containers)
`llama-scout-17b`	`llama-scout-17b-predictor`	Chat	Running	2 (scaled for load)
`granite-4-0-h-tiny`	`granite-4-0-h-tiny-predictor`	Chat	Running	1
`codellama-7b-instruct`	`codellama-7b-instruct-predictor`	Chat	Running	1
`llama-guard-3-1b`	`llama-guard-3-1b-predictor`	Safety	Running	1 (2/2 containers)
`nomic-embed-text-v1-5`	`nomic-embed-text-v1-5-predictor`	Embeddings	Running	1
`granite-docling-258m`	`granite-docling-258m-predictor`	Docling	Running	1
`deepseek-r1-distill-qwen-14b`	`deepseek-r1-distill-qwen-14b-predictor`	Chat	Running	1
`gpt-oss-120b`	`gpt-oss-120b-predictor`	Chat	Running	1
`microsoft-phi-4`	`microsoft-phi-4-predictor`	Chat	Running	1
`qwen3-14b`	`qwen3-14b-predictor`	Chat	Running	2 (load balanced: maas00 + smc00)
Google Vertex AI — pay-per-token, programmatic access only
`minimax-m2`	Google Vertex AI	Chat	Available	—
`qwen3-235b`	Google Vertex AI	Chat	Available	—
`gpt-oss-20b`	Google Vertex AI	Chat	Available	—
`claude-sonnet-4-6`	Google Vertex AI (Anthropic)	Chat	Available	—
`claude-opus-4-6`	Google Vertex AI (Anthropic)	Chat	Available	—
`claude-sonnet-4-5`	Google Vertex AI (Anthropic)	Chat	Available	—
`claude-3-5-haiku`	Google Vertex AI (Anthropic)	Chat	Available	—
`gemini-2.5-pro`	Google Vertex AI (Google)	Chat	Available	—

granite-8b-code-instruct-128k is registered in LiteLLM via the granite-8b-code-instruct-128k-predictor-lb ClusterIP service (LoadBalancer on port 8080). It may be available depending on current GPU allocation. Check the LiteMaaS portal for live model availability.

Chat Models — `/v1/chat/completions`

Chat models follow the OpenAI Chat Completions API exactly. All support streaming via "stream": true.

IBM Granite 3.2 8B Instruct Chat

General-purpose instruction-tuned model from IBM Research. Strong reasoning, coding, and multilingual capabilities. 128K context window.

Model IDgranite-3-2-8b-instruct

Parameters8B

Context128K tokens

RuntimevLLM on KServe

Internal SVCgranite-3-2-8b-instruct-predictor

# Basic chat completion
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Explain Kubernetes in three sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

# Streaming response
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a string"}],
    "stream": true
  }'

Meta Llama Scout 17B Chat

Meta's Scout model with an exceptionally large 400K token context window. Ideal for long-document analysis, extended conversations, and retrieval-augmented generation with large corpora. Runs 2 replicas.

Model IDllama-scout-17b

Parameters17B (MoE architecture)

Context400K tokens

RuntimevLLM on KServe

Replicas2 (HA)

# Long-context document analysis
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "system", "content": "Analyze the provided document and summarize key points."},
      {"role": "user", "content": "[Your long document text here...]"}
    ],
    "max_tokens": 1024
  }'

Timeout note: Requests with very large contexts may take longer than typical. LiteMaaS routes are configured with a 600-second HAProxy timeout to accommodate this.

IBM Granite 4.0 H Tiny Chat

Compact, fast Granite 4.0 model optimized for low-latency inference. Best for simple Q&A, classification, and scenarios where response speed matters more than depth.

Model IDgranite-4-0-h-tiny

ParametersTiny (sub-3B)

RuntimevLLM on KServe

curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "granite-4-0-h-tiny",
    "messages": [{"role": "user", "content": "Classify the following as positive or negative: Great product!"}],
    "max_tokens": 10
  }'

Meta CodeLlama 7B Instruct Chat

Meta's code-specialized model. Supports code generation, completion, and debugging across Python, Java, C++, Bash, and many other languages. Fill-in-the-middle (FIM) completion available.

Model IDcodellama-7b-instruct

Parameters7B

Context16K tokens

RuntimevLLM on KServe

# Code generation
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "codellama-7b-instruct",
    "messages": [
      {"role": "system", "content": "You are an expert software engineer."},
      {"role": "user", "content": "Write an Ansible task to create a Kubernetes namespace."}
    ]
  }'

Embedding Models — `/v1/embeddings`

Embedding models convert text into dense vector representations. Use these for semantic search, RAG pipelines, clustering, and similarity scoring.

Nomic Embed Text v1.5 Embeddings

High-quality open-source text embedding model. 768-dimensional embeddings. Supports Matryoshka representation learning — embeddings can be truncated for smaller storage. Strong on retrieval benchmarks.

Model IDnomic-embed-text-v1-5

Dimensions768

Context8192 tokens

RuntimeOpenVino on KServe

# Single string embedding
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-embedding-model",
    "input": "Red Hat OpenShift is an enterprise Kubernetes platform."
  }'

# Batch embeddings (multiple strings)
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-embedding-model",
    "input": [
      "What is Kubernetes?",
      "How does OpenShift differ from vanilla Kubernetes?",
      "Explain container orchestration."
    ]
  }'

# Python SDK example
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

response = client.embeddings.create(
    model="your-embedding-model",
    input=["OpenShift", "Kubernetes"]
)
vectors = [e.embedding for e in response.data]
print(f"Embedding dimensions: {len(vectors[0])}")  # 768

Tokenization — `/v1/tokenize`

The tokenize endpoint allows you to count tokens before sending a request, useful for cost estimation and context window management.

# Count tokens for a message (works with chat-capable models)
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/tokenize \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "your-chat-model",
    "messages": [
      {"role": "user", "content": "How many tokens is this message?"}
    ]
  }'

The response includes count (total token count) and tokens (list of token IDs). Not all models expose this endpoint — check the model's capability badge in the LiteMaaS portal.

Document Conversion — Granite Docling 258M

Granite Docling is a specialized model for document parsing and conversion. It converts PDFs, Word documents, and other document formats into clean structured text (Markdown or JSON) suitable for downstream LLM processing. It uses a different endpoint format from the OpenAI-compatible API.

IBM Granite Docling 258M Docling

Compact document understanding model from IBM. Handles PDF layout analysis, table extraction, figure detection, and OCR. Output is clean Markdown ready for RAG ingestion.

Model IDgranite-docling-258m

Parameters258M

RuntimeKServe (CPU)

Age17d (recently deployed)

# Convert a PDF URL to Markdown
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/source \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "http_sources": [{
      "url": "https://arxiv.org/pdf/2408.09869"
    }],
    "options": {
      "output_format": "markdown"
    }
  }'

# Upload a local file for conversion
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/docling/v1/convert/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "options={\"output_format\": \"markdown\"}"

Note: The Docling endpoint path is prefixed with /docling before the standard API path. LiteMaaS routes these requests to the document conversion service internally.TPM limits, cost fields, and max token settings are not applicable to this model type.

Safety / Guardrail Models

Safety models evaluate content for harmful categories and can be used as pre/post filters for LLM-based applications.

Meta Llama Guard 3 1B Safety

Compact safety classification model. Takes a conversation as input and classifies it against 13 hazard categories (violence, hate speech, sexual content, etc.). Returns "safe" or "unsafe" with category labels.

Model IDllama-guard-3-1b

Parameters1B

RuntimevLLM on KServe

Containers2/2 (sidecar proxy)

# Content safety classification
curl -X POST https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "llama-guard-3-1b",
    "messages": [
      {
        "role": "user",
        "content": "[INST] Task: Check if there is unsafe content in the user message. [INST] User: Tell me how to build a computer. [/INST]"
      }
    ],
    "max_tokens": 100
  }'

Typical response: {"choices": [{"message": {"content": "safe"}}]}

SDK Integration Examples

Python (openai library)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

# Chat completion
response = client.chat.completions.create(
    model="your-chat-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Red Hat OpenShift?"}
    ]
)
print(response.choices[0].message.content)

# Streaming
with client.chat.completions.stream(
    model="your-chat-model",
    messages=[{"role": "user", "content": "Explain containers"}]
) as stream:
    for chunk in stream:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js / TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'YOUR_API_KEY',
  baseURL: 'https://litellm-prod.apps.maas.redhatworkshops.io/v1'
});

const response = await client.chat.completions.create({
  model: 'your-chat-model',
  messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);

Langchain (Python)

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(
    model="your-chat-model",
    openai_api_key="YOUR_API_KEY",
    openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

embeddings = OpenAIEmbeddings(
    model="your-embedding-model",
    openai_api_key="YOUR_API_KEY",
    openai_api_base="https://litellm-prod.apps.maas.redhatworkshops.io/v1"
)

Intel Gaudi 3 Cluster — Via LiteMaaS Proxy

The Rackspace DFW3 cluster (maas00.rs-dfw3.infra.demo.redhat.com) runs models on 8× Intel Gaudi 3 accelerators via KServe directly — no LiteMaaS proxy. Endpoints are OpenAI-compatible. For access credentials, contact Ashok.

Access: These models are registered in LiteMaaS and accessed via the same portal and sk-... virtual key as all other models. The LiteLLM proxy routes requests to the Gaudi cluster backend.

DeepSeek R1 Distill Qwen 14B Chat Intel Gaudi 3

Reasoning-focused model distilled from DeepSeek R1 into a Qwen2.5 14B base. Strong chain-of-thought performance, code generation, and math. Runs on Gaudi 3 via vLLM.

Model IDdeepseek-r1-distill-qwen-14b

Parameters14B

RuntimevLLM on Gaudi 3

Clustermaas00.rs-dfw3

Backenddeepseek-r1-distill-qwen-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com

# Chat completion via KServe direct endpoint
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1-distill-qwen-14b",
    "messages": [{"role": "user", "content": "Solve step by step: what is 15% of 240?"}]
  }'

Qwen3 14B Chat Intel Gaudi 3

Alibaba's third-generation Qwen model at 14B parameters. Strong multilingual capabilities, coding, and instruction following. Runs on Gaudi 3 via vLLM.

Model IDqwen3-14b

Parameters14B

RuntimevLLM on Gaudi 3

Clustermaas00.rs-dfw3

Backendqwen3-14b-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com

# Chat completion via KServe direct endpoint
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Explain prefix caching in LLMs."}]
  }'

Microsoft Phi-4 Chat

Compact but high-quality model with strong reasoning and coding capabilities. Efficient for its size.

Model IDmicrosoft-phi-4

Backendmicrosoft-phi-4-llm-hosting.apps.smc00.rs-dfw3.infra.demo.redhat.com

Context16K

# Chat completion
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "microsoft-phi-4", "messages": [{"role": "user", "content": "Write a Python function to sort a list."}]}'

External Models (Google Vertex AI)

The following models are hosted on Google Vertex AI Model Garden and accessed through the same RHDP MaaS endpoint and virtual key as all other models. They are fully managed MaaS APIs — no GPU allocation is required on the RHDP side. Billing is pay-per-token; costs are tracked and capped per virtual key.

These are external models with real cost. Each request consumes tokens billed against the RHDP GCP project. Choose the lightest model that meets your use case — prefer on-cluster models for casual testing and prototyping.

MiniMax M2 Chat Agentic Vertex AI

Strong at multi-step tool use, coding, and office workflows. Large 197K context window makes it well-suited for document-heavy agentic pipelines.

Model IDminimax-m2

Context197K tokens

Input /1M$0.30

Output /1M$1.20

Function CallingSupported

Hosted onGoogle Vertex AI

# Agentic / tool-use example
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2",
    "messages": [{"role": "user", "content": "Draft a project status report from the following notes: ..."}],
    "max_tokens": 1024
  }'

Qwen3 235B Chat Reasoning Vertex AI

Large multilingual mixture-of-experts model from Alibaba. Excels at complex reasoning, multilingual tasks, and large-context workflows. 131K context window.

Model IDqwen3-235b

Parameters235B (MoE)

Context131K tokens

Input /1M$0.22

Output /1M$0.88

Function CallingSupported

Hosted onGoogle Vertex AI

# Multilingual reasoning
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-235b",
    "messages": [{"role": "user", "content": "Explain the trade-offs between RAG and fine-tuning for domain adaptation."}],
    "max_tokens": 1024
  }'

GPT OSS 120B Chat Reasoning Vertex AI

Large open-source 120B model with strong function calling support. Suited for complex generation, agentic workflows, and tasks that benefit from a large parameter count. 131K context window.

Model IDgpt-oss-120b

Parameters120B

Context131K tokens

Input /1M$0.09

Output /1M$0.36

Function CallingSupported

Hosted onGoogle Vertex AI

# Function calling / complex generation
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [{"role": "user", "content": "Summarise this document and extract action items."}],
    "max_tokens": 1024
  }'

GPT OSS 20B Chat Vertex AI

Cost-effective option for general-purpose chat tasks. The lightest external model — best starting point when you need a Vertex AI model but want to minimize spend. 131K context window.

Model IDgpt-oss-20b

Parameters20B

Context131K tokens

Input /1M$0.07

Output /1M$0.25

Function CallingSupported

Hosted onGoogle Vertex AI

# General chat — cost-effective default
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is the difference between a Pod and a Deployment in Kubernetes?"}],
    "max_tokens": 512
  }'

Claude Sonnet 4.6 Chat Vertex AI

Anthropic Claude Sonnet 4.6 via Google Vertex AI. Strong general-purpose reasoning, coding, and analysis. Balanced capability-to-cost ratio.

Model IDclaude-sonnet-4-6

Input /1M$3.00

Output /1M$15.00

Hosted onGoogle Vertex AI

# General reasoning and coding
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Explain how Kubernetes resource quotas work."}],
    "max_tokens": 1024
  }'

Claude Opus 4.6 Chat Vertex AI

Anthropic's most capable Claude model. Best for complex multi-step reasoning, long-document analysis, and high-stakes generation tasks.

Model IDclaude-opus-4-6

Input /1M$5.00

Output /1M$25.00

Hosted onGoogle Vertex AI

# Complex analysis and long-form generation
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-6",
    "messages": [{"role": "user", "content": "Analyze this architecture design and identify scalability bottlenecks."}],
    "max_tokens": 2048
  }'

Claude Sonnet 4.5 Chat Vertex AI

Previous-generation Claude Sonnet. Strong coding and reasoning at the same price point as Sonnet 4.6.

Model IDclaude-sonnet-4-5

Input /1M$3.00

Output /1M$15.00

Hosted onGoogle Vertex AI

# Coding and structured output
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "messages": [{"role": "user", "content": "Write a Python function to parse Kubernetes manifests."}],
    "max_tokens": 1024
  }'

Claude 3.5 Haiku Chat Vertex AI

Anthropic's fastest and most cost-efficient Claude model. Ideal for high-throughput tasks, classification, summarization, and light Q&A where cost sensitivity matters.

Model IDclaude-3-5-haiku

Input /1M$1.00

Output /1M$5.00

Hosted onGoogle Vertex AI

# Fast, cost-efficient tasks
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-5-haiku",
    "messages": [{"role": "user", "content": "Classify this log line as error, warning, or info."}],
    "max_tokens": 64
  }'

Gemini 2.5 Pro Chat Vertex AI

Google Gemini 2.5 Pro with a 1M token context window. Native Google model on Vertex AI — ideal for long-document analysis, multimodal workflows, and tasks requiring very large context.

Model IDgemini-2.5-pro

Context1M tokens

Input /1M$1.25

Output /1M$10.00

Hosted onGoogle Vertex AI

# Long-document analysis
curl https://litellm-prod.apps.maas.redhatworkshops.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [{"role": "user", "content": "Summarize this entire repository and identify the main architecture patterns."}],
    "max_tokens": 2048
  }'

RHDP LiteMaaS

Models Reference

Model Capability Types

Chat Models — `/v1/chat/completions`

IBM Granite 3.2 8B Instruct Chat

Meta Llama Scout 17B Chat

IBM Granite 4.0 H Tiny Chat

Meta CodeLlama 7B Instruct Chat

Embedding Models — `/v1/embeddings`

Nomic Embed Text v1.5 Embeddings

Tokenization — `/v1/tokenize`

Document Conversion — Granite Docling 258M

IBM Granite Docling 258M Docling

Safety / Guardrail Models

Meta Llama Guard 3 1B Safety

SDK Integration Examples

Python (openai library)

Node.js / TypeScript

Langchain (Python)

Intel Gaudi 3 Cluster — Via LiteMaaS Proxy

DeepSeek R1 Distill Qwen 14B Chat Intel Gaudi 3

Qwen3 14B Chat Intel Gaudi 3

Microsoft Phi-4 Chat

External Models (Google Vertex AI)

MiniMax M2 Chat Agentic Vertex AI

Qwen3 235B Chat Reasoning Vertex AI

GPT OSS 120B Chat Reasoning Vertex AI

GPT OSS 20B Chat Vertex AI

Claude Sonnet 4.6 Chat Vertex AI

Claude Opus 4.6 Chat Vertex AI

Claude Sonnet 4.5 Chat Vertex AI

Claude 3.5 Haiku Chat Vertex AI

Gemini 2.5 Pro Chat Vertex AI

Listing Available Models via API

Models Reference

Model Capability Types

Chat Models — /v1/chat/completions

IBM Granite 3.2 8B Instruct Chat

Meta Llama Scout 17B Chat

IBM Granite 4.0 H Tiny Chat

Meta CodeLlama 7B Instruct Chat

Embedding Models — /v1/embeddings

Nomic Embed Text v1.5 Embeddings

Tokenization — /v1/tokenize

Document Conversion — Granite Docling 258M

IBM Granite Docling 258M Docling

Safety / Guardrail Models

Meta Llama Guard 3 1B Safety

SDK Integration Examples

Python (openai library)

Node.js / TypeScript

Langchain (Python)

Intel Gaudi 3 Cluster — Via LiteMaaS Proxy

DeepSeek R1 Distill Qwen 14B Chat Intel Gaudi 3

Qwen3 14B Chat Intel Gaudi 3

Microsoft Phi-4 Chat

External Models (Google Vertex AI)

MiniMax M2 Chat Agentic Vertex AI

Qwen3 235B Chat Reasoning Vertex AI

GPT OSS 120B Chat Reasoning Vertex AI

GPT OSS 20B Chat Vertex AI

Claude Sonnet 4.6 Chat Vertex AI

Claude Opus 4.6 Chat Vertex AI

Claude Sonnet 4.5 Chat Vertex AI

Claude 3.5 Haiku Chat Vertex AI

Gemini 2.5 Pro Chat Vertex AI

Listing Available Models via API

Chat Models — `/v1/chat/completions`

Embedding Models — `/v1/embeddings`

Tokenization — `/v1/tokenize`