RHDP LiteMaaS

Model as a Service for Red Hat Demo Platform

Product Overview

What LiteMaaS is, what problem it solves, and how it works.

What is RHDP LiteMaaS?

LiteMaaS (Model as a Service) is the AI model access platform for the Red Hat Demo Platform (RHDP). It provides workshop attendees, partners, and Red Hat engineers with a unified, governed interface to production-grade AI models hosted on OpenShift AI — without requiring each user to manage credentials, endpoints, or rate limits themselves.

At its core, LiteMaaS wraps the open-source LiteLLM proxy with a custom subscription and key management layer (the LiteMaaS backend and frontend), adding per-user API key lifecycle management, subscription approval workflows, usage analytics, and branding — all deployed natively on OpenShift.

Production deployment: The RHDP production instance runs at litellm-prod.apps.maas.redhatworkshops.io (API), litellm-prod-frontend.apps.maas.redhatworkshops.io (user portal), and litellm-prod-admin.apps.maas.redhatworkshops.io/api (admin API). It currently serves 8 running model predictors across chat, embedding, safety, and document conversion workloads.

The Problem It Solves

RHDP delivers hundreds of workshops and demos per year, many of which require participants to call AI model APIs. Without a managed service, every workshop had to solve these problems independently:

LiteMaaS addresses all of these at the platform level, so individual workshops don't have to.

Key Features

Virtual API Key Management

Each user receives a personal virtual API key scoped to their subscription. Keys can have per-key TPM, RPM, and budget ceilings independently of other users.

Model Subscriptions

Users subscribe to models they need. Admins can mark models as "restricted" requiring approval before access is granted. Full audit trail on all state changes.

Usage Tracking & Analytics

Day-by-day incremental usage caching with multi-dimensional filtering (user, model, provider, API key). Export to CSV/JSON. Admin-only system-wide view.

Per-Key Rate & Budget Limits

Per-user configurable TPM (tokens per minute), RPM (requests per minute), max budget, budget duration, and soft budget thresholds enforced at the LiteLLM proxy layer.

Chat Playground

Built-in browser-based chat UI in the frontend. Users can interactively test models they are subscribed to without writing code. Only chat-capable models are shown.

Model Capability Labels

Color-coded labels on model cards: Chat (blue), Embeddings (green), Tokenize (red-orange), Document Conversion (orange). Curl examples adapt to capability type.

OpenShift OAuth Login

Three-tier RBAC (admin / adminReadonly / user) backed by OpenShift OAuth. Users log in with their OpenShift credentials — no separate account needed.

HA Architecture

All components run with multiple replicas. LiteLLM runs 3 replicas behind Redis for session and key caching. PostgreSQL 16 on a dedicated PVC. Stateless frontend and backend scale horizontally.

Branding Customization

Admin-controlled login page branding: custom logo, title, subtitle, header logos (light/dark), and footer text. Stored in database, served via public API endpoint.

Automated Key Cleanup

Daily cron job on the bastion host purges expired or stale keys (older than 30 days) from LiteLLM and syncs revocation status back to the LiteMaaS database.

Backup & Restore

Monthly automated pg_dump to S3 with VolumeSnapshot fallback. Admin-initiated backup and test restore from the Settings UI. 12-month retention for SQL dumps.

Docling Integration

Granite Docling 258M model exposed via a dedicated document conversion endpoint. Frontend hides irrelevant fields (TPM costs, max tokens) for this model type.

Three-Layer Architecture

LiteMaaS is composed of three distinct layers. Each layer has a clear responsibility and communicates with the others over well-defined internal interfaces.

Layer 1 — User Portal (LiteMaaS Frontend + Backend)

The LiteMaaS custom application: React + PatternFly 6 frontend, Fastify backend API. Handles user authentication (OpenShift OAuth), subscription management, API key lifecycle, usage analytics, admin workflows, and branding. Runs as litellm-frontend and litellm-backend deployments. Does not handle LLM inference directly — all model calls go through Layer 2.

Layer 2 — LiteLLM Proxy

The open-source LiteLLM proxy (quay.io/rh-aiservices-bu/litellm-non-root:main-v1.81.0-stable-custom) running as 3 HA replicas. Handles all actual model routing, virtual key enforcement (TPM/RPM/budget), request rate limiting, caching via Redis, and spend tracking. Stores its state in PostgreSQL. Exposes the OpenAI-compatible API on the litellm-prod route.

Layer 3 — Model Serving (OpenShift AI / KServe)

The actual model inference layer — Red Hat OpenShift AI (RHOAI) running KServe predictors in the llm-hosting namespace. Models are deployed as InferenceServices on GPU or CPU nodes. LiteLLM calls these internal ClusterIP services directly over the cluster network (no external routing required for model traffic). Currently hosting Granite, Llama Scout, CodeLlama, Nomic Embed, Llama Guard, and Granite Docling models.

graph TD U[User / Workshop Participant] -->|HTTPS| FE[litellm-prod-frontend
LiteMaaS Frontend] U -->|API calls with virtual key| PROXY[litellm-prod
LiteLLM Proxy x3] FE -->|REST API| BE[litellm-backend
LiteMaaS Backend x3] BE -->|virtual key management| PROXY PROXY -->|model routing| M1[granite-3-2-8b-instruct] PROXY -->|model routing| M2[llama-scout-17b] PROXY -->|model routing| M3[granite-4-0-h-tiny] PROXY -->|model routing| M4[nomic-embed-text-v1-5] PROXY -->|model routing| M5[granite-docling-258m] PROXY -->|model routing| M6[codellama-7b-instruct] PROXY -->|model routing| M7[llama-guard-3-1b] BE -->|store state| PG[(PostgreSQL 16
litellm-postgres-0)] PROXY -->|cache / sessions| RD[(Redis 7
litellm-redis)] PROXY -->|spend tracking| PG style PROXY fill:#fef0f0,stroke:#cc0000 style BE fill:#f0f4ff,stroke:#0066cc style FE fill:#f0faf2,stroke:#3e8635

User Role Hierarchy

LiteMaaS implements a three-tier RBAC system enforced at both the backend API level and the frontend UI level.

Role How Assigned Capabilities
admin Manually via promote-admin.sh or psql UPDATE Full platform control: manage users, models, budgets, subscriptions, approve restricted access, view audit logs, configure branding, backup/restore
adminReadonly Manually assigned Read-only admin views: view all users, analytics, audit logs, and system state — cannot modify
user Automatically on first OAuth login Browse and subscribe to models, manage own API keys (create/revoke), view own usage, use Chat Playground

Important: Users must log in via OAuth at least once before being promoted to admin. Direct database inserts will break OAuth login because the OpenShift OAuth ID (a UUID) will not be set. See Admin Setup for the correct procedure.

Model Capability System

LiteMaaS supports more than just chat models. Each model is tagged with one or more capability types that control how it appears in the UI, which endpoints are available, and what curl examples are shown to users.

Capability Badge API Endpoint Example Models
Chat Chat /v1/chat/completions granite-3-2-8b-instruct, llama-scout-17b, granite-4-0-h-tiny, codellama-7b-instruct
Embeddings Embeddings /v1/embeddings nomic-embed-text-v1-5
Tokenize Tokenize /v1/tokenize Additive flag on chat models
Document Conversion Docling /docling granite-docling-258m
Safety / Guardrails Safety /v1/chat/completions llama-guard-3-1b

End-to-End User Flow

sequenceDiagram participant U as User participant FE as LiteMaaS Frontend participant BE as LiteMaaS Backend participant LL as LiteLLM Proxy participant M as Model (KServe) U->>FE: Visit portal, click Login FE->>BE: GET /api/auth/login BE->>BE: Redirect to OpenShift OAuth U->>BE: Complete OAuth, callback BE->>BE: Create/update user in DB, issue JWT BE->>FE: Set session cookie U->>FE: Browse model catalog FE->>BE: GET /api/models BE->>FE: Return model list with capability badges U->>FE: Click "Subscribe" on a model FE->>BE: POST /api/subscriptions BE->>LL: POST /key/generate (scoped to model) LL->>BE: Return virtual key (sk-...) BE->>BE: Store key in api_keys table BE->>FE: Return key to user U->>LL: POST /v1/chat/completions with sk-... LL->>LL: Validate key, check TPM/RPM/budget LL->>M: Forward request to KServe predictor M->>LL: Return inference response LL->>U: Stream response back

Production Endpoints

Route Name URL Purpose Audience
litellm-prod https://litellm-prod.apps.maas.redhatworkshops.io OpenAI-compatible API endpoint (LiteLLM proxy) All API users, SDK clients
litellm-prod-frontend https://litellm-prod-frontend.apps.maas.redhatworkshops.io LiteMaaS user portal (React UI) Workshop participants, end users
litellm-prod-admin https://litellm-prod-admin.apps.maas.redhatworkshops.io/api LiteMaaS admin API (backend) Admin scripts, automation

All routes use edge TLS termination with automatic redirect from HTTP. HAProxy timeout is set to 600 seconds to support long-running inference with large context windows.