Model as a Service for Red Hat Demo Platform
What LiteMaaS is, what problem it solves, and how it works.
RHDP MaaS (Model as a Service) is the Red Hat Demo Platform's AI model access platform. It provides users with a unified, governed interface to production-grade AI models hosted on OpenShift AI — without requiring each user to manage credentials, endpoints, or rate limits themselves.
RHDP MaaS is built on top of LiteMaaS, an open-source platform developed by the Red Hat AI Services BU team. LiteMaaS wraps the open-source LiteLLM proxy with a subscription and key management layer — adding per-user API key lifecycle management, subscription approval workflows, usage analytics, and branding. RHDP uses and extends LiteMaaS to run its own MaaS infrastructure.
Production deployment: The RHDP production instance runs at
litellm-prod.apps.maas.redhatworkshops.io (API),
litellm-prod-frontend.apps.maas.redhatworkshops.io (user portal), and
litellm-prod-admin.apps.maas.redhatworkshops.io/api (admin API).
It currently serves 8 running model predictors across chat, embedding, safety, and document conversion workloads.
RHDP delivers hundreds of workshops and demos per year, many of which require participants to call AI model APIs. Without a managed service, every workshop had to solve these problems independently:
LiteMaaS addresses all of these at the platform level, so individual workshops don't have to.
Each user receives a personal virtual API key scoped to their subscription. Keys can have per-key TPM, RPM, and budget ceilings independently of other users.
Users subscribe to models they need. Admins can mark models as "restricted" requiring approval before access is granted. Full audit trail on all state changes.
Day-by-day incremental usage caching with multi-dimensional filtering (user, model, provider, API key). Export to CSV/JSON. Admin-only system-wide view.
Per-user configurable TPM (tokens per minute), RPM (requests per minute), max budget, budget duration, and soft budget thresholds enforced at the LiteLLM proxy layer.
Built-in browser-based chat UI in the frontend. Users can interactively test models they are subscribed to without writing code. Only chat-capable models are shown.
Color-coded labels on model cards: Chat (blue), Embeddings (green), Tokenize (red-orange), Document Conversion (orange). Curl examples adapt to capability type.
Three-tier RBAC (admin / adminReadonly / user) backed by OpenShift OAuth. Users log in with their OpenShift credentials — no separate account needed.
All components run with multiple replicas. LiteLLM runs 3 replicas behind Redis for session and key caching. PostgreSQL 16 on a dedicated PVC. Stateless frontend and backend scale horizontally.
Admin-controlled login page branding: custom logo, title, subtitle, header logos (light/dark), and footer text. Stored in database, served via public API endpoint.
Daily cron job on the bastion host purges expired or stale keys (older than 30 days) from LiteLLM and syncs revocation status back to the LiteMaaS database.
Monthly automated pg_dump to S3 with VolumeSnapshot fallback. Admin-initiated backup and test restore from the Settings UI. 12-month retention for SQL dumps.
Granite Docling 258M model exposed via a dedicated document conversion endpoint. Frontend hides irrelevant fields (TPM costs, max tokens) for this model type.
These three terms are often confused. Here is the precise distinction:
An open-source Python library that provides a single interface to 100+ LLM providers. It knows how to translate requests to the format each provider expects — OpenAI, Anthropic, Vertex AI, Bedrock, etc. — and converts responses back to a standard shape. This is the translator/engine.
A FastAPI HTTP server built on top of the LiteLLM library. It exposes an OpenAI-compatible endpoint so any OpenAI client can connect to it without modification. It adds virtual keys, rate limiting (TPM/RPM), spend tracking, load balancing, and model routing. This is the gateway — raw infrastructure, no user-facing UI.
Built by Red Hat AI Services BU on top of the LiteLLM Proxy. It adds the self-service portal layer: users log in with Red Hat SSO, browse a model catalog, subscribe to models, create and manage API keys, and view their usage — without talking to an admin. LiteMaaS is what turns the raw proxy into a product.
Simple analogy: LiteLLM library = engine | LiteLLM Proxy = car | LiteMaaS = dealership with showroom, customer accounts, and self-service.
When something breaks at the routing or key level → LiteLLM Proxy. When something breaks at the user, subscription, or catalog level → LiteMaaS. When a new AI provider needs support → LiteLLM library.
LiteMaaS is composed of three distinct layers. Each layer has a clear responsibility and communicates with the others over well-defined internal interfaces.
The LiteMaaS custom application: React + PatternFly 6 frontend, Fastify backend API. Handles user authentication (OpenShift OAuth), subscription management, API key lifecycle, usage analytics, admin workflows, and branding. Runs as litellm-frontend and litellm-backend deployments. Does not handle LLM inference directly — all model calls go through Layer 2.
The open-source LiteLLM proxy (quay.io/rh-aiservices-bu/litellm-non-root:main-v1.81.0-stable-custom) running as 3 HA replicas. Handles all actual model routing, virtual key enforcement (TPM/RPM/budget), request rate limiting, caching via Redis, and spend tracking. Stores its state in PostgreSQL. Exposes the OpenAI-compatible API on the litellm-prod route.
The actual model inference layer — Red Hat OpenShift AI (RHOAI) running KServe predictors in the llm-hosting namespace. Models are deployed as InferenceServices on GPU or CPU nodes. LiteLLM calls these internal ClusterIP services directly over the cluster network (no external routing required for model traffic). Currently hosting Granite, Llama Scout, CodeLlama, Nomic Embed, Llama Guard, and Granite Docling models.
LiteMaaS implements a three-tier RBAC system enforced at both the backend API level and the frontend UI level.
| Role | How Assigned | Capabilities |
|---|---|---|
| admin | Manually via promote-admin.sh or psql UPDATE |
Full platform control: manage users, models, budgets, subscriptions, approve restricted access, view audit logs, configure branding, backup/restore |
| adminReadonly | Manually assigned | Read-only admin views: view all users, analytics, audit logs, and system state — cannot modify |
| user | Automatically on first OAuth login | Browse and subscribe to models, manage own API keys (create/revoke), view own usage, use Chat Playground |
Important: Users must log in via OAuth at least once before being promoted to admin. Direct database inserts will break OAuth login because the OpenShift OAuth ID (a UUID) will not be set. See Admin Setup for the correct procedure.
LiteMaaS supports more than just chat models. Each model is tagged with one or more capability types that control how it appears in the UI, which endpoints are available, and what curl examples are shown to users.
| Capability | Badge | API Endpoint | Example Models |
|---|---|---|---|
| Chat | Chat | /v1/chat/completions |
granite-3-2-8b-instruct, llama-scout-17b, granite-4-0-h-tiny, codellama-7b-instruct |
| Embeddings | Embeddings | /v1/embeddings |
nomic-embed-text-v1-5 |
| Tokenize | Tokenize | /v1/tokenize |
Additive flag on chat models |
| Document Conversion | Docling | /docling |
granite-docling-258m |
| Safety / Guardrails | Safety | /v1/chat/completions |
llama-guard-3-1b |
Click the diagram to expand.
| Route Name | URL | Purpose | Audience |
|---|---|---|---|
litellm-prod |
https://litellm-prod.apps.maas.redhatworkshops.io |
OpenAI-compatible API endpoint (LiteLLM proxy) | All API users, SDK clients |
litellm-prod-frontend |
https://litellm-prod-frontend.apps.maas.redhatworkshops.io |
LiteMaaS user portal (React UI) | Workshop participants, end users |
litellm-prod-admin |
https://litellm-prod-admin.apps.maas.redhatworkshops.io/api |
LiteMaaS admin API (backend) | Admin scripts, automation |
All routes use edge TLS termination with automatic redirect from HTTP. HAProxy timeout is set to 600 seconds to support long-running inference with large context windows.