Model as a Service for Red Hat Demo Platform
What LiteMaaS is, what problem it solves, and how it works.
LiteMaaS (Model as a Service) is the AI model access platform for the Red Hat Demo Platform (RHDP). It provides workshop attendees, partners, and Red Hat engineers with a unified, governed interface to production-grade AI models hosted on OpenShift AI — without requiring each user to manage credentials, endpoints, or rate limits themselves.
At its core, LiteMaaS wraps the open-source LiteLLM proxy with a custom subscription and key management layer (the LiteMaaS backend and frontend), adding per-user API key lifecycle management, subscription approval workflows, usage analytics, and branding — all deployed natively on OpenShift.
Production deployment: The RHDP production instance runs at
litellm-prod.apps.maas.redhatworkshops.io (API),
litellm-prod-frontend.apps.maas.redhatworkshops.io (user portal), and
litellm-prod-admin.apps.maas.redhatworkshops.io/api (admin API).
It currently serves 8 running model predictors across chat, embedding, safety, and document conversion workloads.
RHDP delivers hundreds of workshops and demos per year, many of which require participants to call AI model APIs. Without a managed service, every workshop had to solve these problems independently:
LiteMaaS addresses all of these at the platform level, so individual workshops don't have to.
Each user receives a personal virtual API key scoped to their subscription. Keys can have per-key TPM, RPM, and budget ceilings independently of other users.
Users subscribe to models they need. Admins can mark models as "restricted" requiring approval before access is granted. Full audit trail on all state changes.
Day-by-day incremental usage caching with multi-dimensional filtering (user, model, provider, API key). Export to CSV/JSON. Admin-only system-wide view.
Per-user configurable TPM (tokens per minute), RPM (requests per minute), max budget, budget duration, and soft budget thresholds enforced at the LiteLLM proxy layer.
Built-in browser-based chat UI in the frontend. Users can interactively test models they are subscribed to without writing code. Only chat-capable models are shown.
Color-coded labels on model cards: Chat (blue), Embeddings (green), Tokenize (red-orange), Document Conversion (orange). Curl examples adapt to capability type.
Three-tier RBAC (admin / adminReadonly / user) backed by OpenShift OAuth. Users log in with their OpenShift credentials — no separate account needed.
All components run with multiple replicas. LiteLLM runs 3 replicas behind Redis for session and key caching. PostgreSQL 16 on a dedicated PVC. Stateless frontend and backend scale horizontally.
Admin-controlled login page branding: custom logo, title, subtitle, header logos (light/dark), and footer text. Stored in database, served via public API endpoint.
Daily cron job on the bastion host purges expired or stale keys (older than 30 days) from LiteLLM and syncs revocation status back to the LiteMaaS database.
Monthly automated pg_dump to S3 with VolumeSnapshot fallback. Admin-initiated backup and test restore from the Settings UI. 12-month retention for SQL dumps.
Granite Docling 258M model exposed via a dedicated document conversion endpoint. Frontend hides irrelevant fields (TPM costs, max tokens) for this model type.
LiteMaaS is composed of three distinct layers. Each layer has a clear responsibility and communicates with the others over well-defined internal interfaces.
The LiteMaaS custom application: React + PatternFly 6 frontend, Fastify backend API. Handles user authentication (OpenShift OAuth), subscription management, API key lifecycle, usage analytics, admin workflows, and branding. Runs as litellm-frontend and litellm-backend deployments. Does not handle LLM inference directly — all model calls go through Layer 2.
The open-source LiteLLM proxy (quay.io/rh-aiservices-bu/litellm-non-root:main-v1.81.0-stable-custom) running as 3 HA replicas. Handles all actual model routing, virtual key enforcement (TPM/RPM/budget), request rate limiting, caching via Redis, and spend tracking. Stores its state in PostgreSQL. Exposes the OpenAI-compatible API on the litellm-prod route.
The actual model inference layer — Red Hat OpenShift AI (RHOAI) running KServe predictors in the llm-hosting namespace. Models are deployed as InferenceServices on GPU or CPU nodes. LiteLLM calls these internal ClusterIP services directly over the cluster network (no external routing required for model traffic). Currently hosting Granite, Llama Scout, CodeLlama, Nomic Embed, Llama Guard, and Granite Docling models.
LiteMaaS implements a three-tier RBAC system enforced at both the backend API level and the frontend UI level.
| Role | How Assigned | Capabilities |
|---|---|---|
| admin | Manually via promote-admin.sh or psql UPDATE |
Full platform control: manage users, models, budgets, subscriptions, approve restricted access, view audit logs, configure branding, backup/restore |
| adminReadonly | Manually assigned | Read-only admin views: view all users, analytics, audit logs, and system state — cannot modify |
| user | Automatically on first OAuth login | Browse and subscribe to models, manage own API keys (create/revoke), view own usage, use Chat Playground |
Important: Users must log in via OAuth at least once before being promoted to admin. Direct database inserts will break OAuth login because the OpenShift OAuth ID (a UUID) will not be set. See Admin Setup for the correct procedure.
LiteMaaS supports more than just chat models. Each model is tagged with one or more capability types that control how it appears in the UI, which endpoints are available, and what curl examples are shown to users.
| Capability | Badge | API Endpoint | Example Models |
|---|---|---|---|
| Chat | Chat | /v1/chat/completions |
granite-3-2-8b-instruct, llama-scout-17b, granite-4-0-h-tiny, codellama-7b-instruct |
| Embeddings | Embeddings | /v1/embeddings |
nomic-embed-text-v1-5 |
| Tokenize | Tokenize | /v1/tokenize |
Additive flag on chat models |
| Document Conversion | Docling | /docling |
granite-docling-258m |
| Safety / Guardrails | Safety | /v1/chat/completions |
llama-guard-3-1b |
| Route Name | URL | Purpose | Audience |
|---|---|---|---|
litellm-prod |
https://litellm-prod.apps.maas.redhatworkshops.io |
OpenAI-compatible API endpoint (LiteLLM proxy) | All API users, SDK clients |
litellm-prod-frontend |
https://litellm-prod-frontend.apps.maas.redhatworkshops.io |
LiteMaaS user portal (React UI) | Workshop participants, end users |
litellm-prod-admin |
https://litellm-prod-admin.apps.maas.redhatworkshops.io/api |
LiteMaaS admin API (backend) | Admin scripts, automation |
All routes use edge TLS termination with automatic redirect from HTTP. HAProxy timeout is set to 600 seconds to support long-running inference with large context windows.