RHDP LiteMaaS — Monitoring

Grafana dashboards, ServiceMonitors, usage analytics, and automated key cleanup

Grafana — Infrastructure Monitoring

Grafana is deployed in the llm-hosting namespace by the ocp4_workload_rhoai_metrics Ansible role. It monitors vLLM model performance, GPU utilization, and KServe InferenceService health using OpenShift User Workload Monitoring as the data source.

Access

Production Grafana: grafana-route-llm-hosting.apps.maas.redhatworkshops.io — uses reencrypt TLS. Running for 94 days.
# Get Grafana route
oc get route grafana-route -n llm-hosting

# Get admin credentials
oc get secret -n llm-hosting | grep grafana-admin
oc get secret <grafana-admin-secret> -n llm-hosting \
  -o jsonpath='{.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 -d

What's Deployed

ComponentStatusAgeDetails
grafana-deploymentRunning94d2/2 containers (Grafana + sidecar dashboard loader)
grafana-operator-controller-manager-v5Running85dGrafana Operator v5 — manages GrafanaInstance CRs

ServiceMonitors — What Gets Scraped

Each model predictor service has a corresponding ServiceMonitor that tells OpenShift's Prometheus to scrape its /metrics endpoint on port 8080.

ServiceMonitorAgeScrapes
granite-3-2-8b-instruct-metrics175dvLLM metrics — throughput, TTFT, queue depth, KV cache
llama-scout-17b-metrics99dvLLM metrics
granite-4-0-h-tiny-metrics93dvLLM metrics
codellama-7b-instruct-metrics99dvLLM metrics
llama-guard-3-1b-metrics114dvLLM metrics
nomic-embed-text-v1-5-metrics175dOpenVino metrics
granite-docling-258m-metrics17dvLLM metrics
vllm-models94dCatch-all vLLM ServiceMonitor
# Verify ServiceMonitors
oc get servicemonitor -n llm-hosting

# Test a metrics endpoint directly
oc exec -n llm-hosting \
  $(oc get pods -n llm-hosting -l app=granite-3-2-8b-instruct -o name | head -1) -- \
  curl -s http://localhost:8080/metrics | grep vllm_requests

Grafana Dashboards

vLLM Performance

Request throughputRequests/sec per modelrate(vllm:request_success_total[5m])
Time to First TokenP50/P95/P99 latencyvllm:time_to_first_token_seconds_bucket
Queue DepthRequests waitingvllm:num_requests_waiting
KV Cache UsageGPU memory for contextvllm:gpu_cache_usage_perc
Tokens/secGeneration throughputrate(vllm:generation_tokens_total[5m])
Running RequestsIn-flight requestsvllm:num_requests_running

GPU — DCGM Exporter

GPU Utilization% time executing kernelsDCGM_FI_DEV_GPU_UTIL
GPU Memory UsedFramebuffer memoryDCGM_FI_DEV_FB_USED
GPU TemperatureDevice temp °CDCGM_FI_DEV_GPU_TEMP
Power DrawCurrent WattsDCGM_FI_DEV_POWER_USAGE

Dashboard Details

Dashboards are bundled in the ocp4_workload_rhoai_metrics role under files/grafana/ — no external downloads. Three dashboards are deployed:

DashboardWhat it showsData source
vLLM Model Performance Per-model throughput (req/s), Time to First Token (P50/P95/P99), scheduler queue depth, KV cache utilization, prompt/completion token histograms, generation tokens/sec vLLM /metrics on port 8080 via ServiceMonitor
GPU Node Overview Per-GPU utilization %, framebuffer memory used/free, temperature, power draw, NVLink bandwidth, PCIe throughput — across all model serving nodes NVIDIA DCGM Exporter on port 9400
OpenVino Server Inference request count, request duration histogram, in-flight requests — for CPU-based models (e.g. Nomic embeddings) OpenVino Model Server /metrics on port 8080

Key Metrics to Watch Before a Workshop

MetricHealthy rangeAction if outside
TTFT P95<15s (≤30 users), <30s (≤60 users), <60s (120+)Run benchmark, reduce attendees or add replicas
Queue depth<5 waiting requestsScale model predictor replicas
GPU memory used<90% framebufferReduce model context length or scale
KV cache utilization<80%Reduce concurrent requests or max tokens

Deploying the Monitoring Stack

Deployed via the ocp4_workload_rhoai_metrics role in the rhpds.litemaas collection.

# Deploy Grafana + ServiceMonitors to llm-hosting namespace
ansible-playbook playbooks/deploy_rhoai_metrics.yml \
  -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \
  -e ocp4_workload_rhoai_metrics_enable_gpu=true

LiteMaaS — Built-in Usage Analytics

The LiteMaaS admin portal provides application-level analytics from LiteLLM spend logs — per-user spend, per-model call counts, per-API-key usage, trends, and exports. No Grafana needed for this.

Access

Log in to litellm-prod-frontend.apps.maas.redhatworkshops.io as admin → Admin → Analytics

Available Views

ViewData
System overviewTotal requests, tokens, cost — with trend vs previous period
By userPer-user spend, request count, top models used
By modelPer-model usage, cost breakdown
By providerOn-cluster vs external (WatsonX, Bedrock) split
By API keyPer-key usage and budget consumption
ExportCSV or JSON export with filters applied

Caching

Historical data (past days) is cached permanently. Current day is refreshed every 5 minutes. Use Refresh Today in the UI to force an immediate update.

Benchmark — Pre-Event Capacity Validation

The ocp4_workload_litemaas_benchmark role (in rhpds.litemaas) runs a multi-turn conversation load test against LiteMaaS. Use it before Summit or large workshops to verify the platform can handle the expected load. Results are published as an HTML report on an OpenShift route.

What It Tests

Simulates N concurrent users each asking K questions in sequence. Each turn uses the same seed document — this tests prefix caching effectiveness (the speedup ratio). Reports P50/P95/P99 Time to First Token and requests per second.

Pass/Fail Criteria (auto-calculated by attendee count)

AttendeesP95 PASSP95 WARNP95 FAIL
5–29 users< 15s15–30s> 30s
30–59 users< 25s25–50s> 50s
60–119 users< 30s30–60s> 60s
120+ users< 60s60–120s> 120s

Cache Performance

Speedup ratioCache statusMeaning
> 2.5xEXCELLENTLater turns significantly faster than first — prefix caching working well
1.5–2.5xGOODAcceptable for workshops
< 1.5xPOORCheck if model supports prefix caching

MCP / Tool Calling Multiplier

LibreChat/MCP workshops apply a 3× multiplier to question count — each user question may trigger 2–3 AI calls (tool invocation, result processing, response). Enable this with benchmark_model_supports_mcp: true.

Running a Benchmark

# Example: test 60 attendees, 10 questions each, granite model
# Run from the RHDP catalog: Tests → LiteMaaS Benchmark CI
# Key parameters in AgnosticV common.yaml:

benchmark_conversations_count: 60    # simulated users (min 5, max 200)
benchmark_sessions_count: 1          # back-to-back sessions
benchmark_turns_count: 10            # questions per user (min 5, max 50)
benchmark_model_supports_mcp: false  # true = applies 3x multiplier
benchmark_model_granite_3_2_8b: true # select which model to test

Interpreting Results

The benchmark deploys an HTML report to an OpenShift route. The report shows:

# Get the report URL after benchmark completes
oc get route benchmark-report -n <guid> -o jsonpath='{.spec.host}'

Parallel Workers (auto-calculated)

AttendeesParallel workers
1–302
31–604
61–1008
101+16

Automated Key Cleanup — Bastion Cronjob

A shell script runs daily at 2 AM on the bastion host. It deletes expired and old virtual keys from LiteMaaS and keeps the LiteMaaS database in sync.

What It Does

StepAction
1Fetches all virtual keys from LiteMaaS API (paginated)
2Identifies expired keys (expires < now) and old keys (created > 30 days ago)
3Deletes each key from LiteMaaS via POST /key/delete
4Marks matching api_keys record inactive in LiteMaaS DB
5Final sweep — marks any remaining orphaned api_keys records inactive

Setup

# Run from workstation — installs script on bastion and creates crontab entry
./setup-key-cleanup-cronjob.sh litellm-rhpds

Operations

# Check the cronjob is scheduled
ssh bastion 'sudo crontab -l | grep cleanup'

# View recent cleanup runs
ssh bastion 'sudo tail -100 /var/log/litemaas-key-cleanup.log'

# Run manually
ssh bastion 'sudo /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh'
The script auto-discovers the LiteMaaS URL and master key from OpenShift secrets — it requires oc to be installed and logged in on the bastion host.