Monitoring — RHDP LiteMaaS

Grafana — Infrastructure Monitoring

Grafana is deployed in the llm-hosting namespace by the ocp4_workload_rhoai_metrics Ansible role. It monitors vLLM model performance, GPU utilization, and KServe InferenceService health using OpenShift User Workload Monitoring as the data source.

Access

AWS Cluster (Primary): grafana-route-llm-hosting.apps.maas.redhatworkshops.io — uses reencrypt TLS. Running for 94 days.

Intel Gaudi Cluster (Rackspace DFW3): grafana-route-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com — same stack deployed to llm-hosting namespace. For credentials, contact Ashok.

# Get Grafana route
oc get route grafana-route -n llm-hosting

# Get admin credentials
oc get secret -n llm-hosting | grep grafana-admin
oc get secret <grafana-admin-secret> -n llm-hosting \
  -o jsonpath='{.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 -d

What's Deployed

Component	Status	Age	Details
`grafana-deployment`	Running	94d	2/2 containers (Grafana + sidecar dashboard loader)
`grafana-operator-controller-manager-v5`	Running	85d	Grafana Operator v5 — manages GrafanaInstance CRs

ServiceMonitors — What Gets Scraped

Each model predictor service has a corresponding ServiceMonitor that tells OpenShift's Prometheus to scrape its /metrics endpoint on port 8080.

ServiceMonitor	Age	Scrapes
`granite-3-2-8b-instruct-metrics`	175d	vLLM metrics — throughput, TTFT, queue depth, KV cache
`llama-scout-17b-metrics`	99d	vLLM metrics
`granite-4-0-h-tiny-metrics`	93d	vLLM metrics
`codellama-7b-instruct-metrics`	99d	vLLM metrics
`llama-guard-3-1b-metrics`	114d	vLLM metrics
`nomic-embed-text-v1-5-metrics`	175d	OpenVino metrics
`granite-docling-258m-metrics`	17d	vLLM metrics
`vllm-models`	94d	Catch-all vLLM ServiceMonitor

# Verify ServiceMonitors
oc get servicemonitor -n llm-hosting

# Test a metrics endpoint directly
oc exec -n llm-hosting \
  $(oc get pods -n llm-hosting -l app=granite-3-2-8b-instruct -o name | head -1) -- \
  curl -s http://localhost:8080/metrics | grep vllm_requests

Grafana Dashboards

vLLM Performance

Request throughputRequests/sec per modelrate(vllm:request_success_total[5m])

Time to First TokenP50/P95/P99 latencyvllm:time_to_first_token_seconds_bucket

Queue DepthRequests waitingvllm:num_requests_waiting

KV Cache UsageGPU memory for contextvllm:gpu_cache_usage_perc

Tokens/secGeneration throughputrate(vllm:generation_tokens_total[5m])

Running RequestsIn-flight requestsvllm:num_requests_running

GPU — DCGM Exporter

GPU Utilization% time executing kernelsDCGM_FI_DEV_GPU_UTIL

GPU Memory UsedFramebuffer memoryDCGM_FI_DEV_FB_USED

GPU TemperatureDevice temp °CDCGM_FI_DEV_GPU_TEMP

Power DrawCurrent WattsDCGM_FI_DEV_POWER_USAGE

Dashboard Details

Dashboards are bundled in the ocp4_workload_rhoai_metrics role under files/grafana/ — no external downloads. Three dashboards are deployed:

Dashboard	What it shows	Data source
vLLM Model Performance	Per-model throughput (req/s), Time to First Token (P50/P95/P99), scheduler queue depth, KV cache utilization, prompt/completion token histograms, generation tokens/sec	vLLM `/metrics` on port 8080 via ServiceMonitor
GPU Node Overview	Per-GPU utilization %, framebuffer memory used/free, temperature, power draw, NVLink bandwidth, PCIe throughput — across all model serving nodes	NVIDIA DCGM Exporter on port 9400
OpenVino Server	Inference request count, request duration histogram, in-flight requests — for CPU-based models (e.g. Nomic embeddings)	OpenVino Model Server `/metrics` on port 8080

Key Metrics to Watch Before a Workshop

Metric	Healthy range	Action if outside
TTFT P95	<15s (≤30 users), <30s (≤60 users), <60s (120+)	Run benchmark, reduce attendees or add replicas
Queue depth	<5 waiting requests	Scale model predictor replicas
GPU memory used	<90% framebuffer	Reduce model context length or scale
KV cache utilization	<80%	Reduce concurrent requests or max tokens

Deploying the Monitoring Stack

Deployed via the ocp4_workload_rhoai_metrics role in the rhpds.litemaas collection.

# Deploy Grafana + ServiceMonitors to llm-hosting namespace
ansible-playbook playbooks/deploy_rhoai_metrics.yml \
  -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \
  -e ocp4_workload_rhoai_metrics_enable_gpu=true

LiteMaaS — Built-in Usage Analytics

The LiteMaaS admin portal provides application-level analytics from LiteLLM spend logs — per-user spend, per-model call counts, per-API-key usage, trends, and exports. No Grafana needed for this.

Access

Log in to litellm-prod-frontend.apps.maas.redhatworkshops.io as admin → Admin → Analytics

Available Views

View	Data
System overview	Total requests, tokens, cost — with trend vs previous period
By user	Per-user spend, request count, top models used
By model	Per-model usage, cost breakdown
By provider	On-cluster vs external (WatsonX, Bedrock) split
By API key	Per-key usage and budget consumption
Export	CSV or JSON export with filters applied

Caching

Historical data (past days) is cached permanently. Current day is refreshed every 5 minutes. Use Refresh Today in the UI to force an immediate update.

Benchmark — Pre-Event Capacity Validation

The ocp4_workload_litemaas_benchmark role (in rhpds.litemaas) runs a multi-turn conversation load test against LiteMaaS. Use it before Summit or large workshops to verify the platform can handle the expected load. Results are published as an HTML report on an OpenShift route.

What It Tests

Simulates N concurrent users each asking K questions in sequence. Each turn uses the same seed document — this tests prefix caching effectiveness (the speedup ratio). Reports P50/P95/P99 Time to First Token and requests per second.

Pass/Fail Criteria (auto-calculated by attendee count)

Attendees	P95 PASS	P95 WARN	P95 FAIL
5–29 users	< 15s	15–30s	> 30s
30–59 users	< 25s	25–50s	> 50s
60–119 users	< 30s	30–60s	> 60s
120+ users	< 60s	60–120s	> 120s

Cache Performance

Speedup ratio	Cache status	Meaning
> 2.5x	EXCELLENT	Later turns significantly faster than first — prefix caching working well
1.5–2.5x	GOOD	Acceptable for workshops
< 1.5x	POOR	Check if model supports prefix caching

MCP / Tool Calling Multiplier

LibreChat/MCP workshops apply a 3× multiplier to question count — each user question may trigger 2–3 AI calls (tool invocation, result processing, response). Enable this with benchmark_model_supports_mcp: true.

Running a Benchmark

# Example: test 60 attendees, 10 questions each, granite model
# Run from the RHDP catalog: Tests → LiteMaaS Benchmark CI
# Key parameters in AgnosticV common.yaml:

benchmark_conversations_count: 60    # simulated users (min 5, max 200)
benchmark_sessions_count: 1          # back-to-back sessions
benchmark_turns_count: 10            # questions per user (min 5, max 50)
benchmark_model_supports_mcp: false  # true = applies 3x multiplier
benchmark_model_granite_3_2_8b: true # select which model to test

Interpreting Results

The benchmark deploys an HTML report to an OpenShift route. The report shows:

Overall PASS / WARN / FAIL status with colour coding
P50, P95, P99 Time to First Token in milliseconds
Speedup ratio (cache effectiveness)
Total requests, RPS, test duration
Full raw benchmark output

# Get the report URL after benchmark completes
oc get route benchmark-report -n <guid> -o jsonpath='{.spec.host}'

Parallel Workers (auto-calculated)

Attendees	Parallel workers
1–30	2
31–60	4
61–100	8
101+	16

Automated Key Cleanup — Bastion Cronjob

A shell script runs daily at 2 AM on the bastion host. It deletes expired and old virtual keys from LiteMaaS and keeps the LiteMaaS database in sync.

What It Does

Step	Action
1	Fetches all virtual keys from LiteMaaS API (paginated)
2	Identifies expired keys (`expires < now`) and old keys (created > 30 days ago)
3	Deletes each key from LiteMaaS via `POST /key/delete`
4	Marks matching `api_keys` record inactive in LiteMaaS DB
5	Final sweep — marks any remaining orphaned `api_keys` records inactive

Setup

# Run from workstation — installs script on bastion and creates crontab entry
./setup-key-cleanup-cronjob.sh litellm-rhpds

Operations

# Check the cronjob is scheduled
ssh bastion 'sudo crontab -l | grep cleanup'

# View recent cleanup runs
ssh bastion 'sudo tail -100 /var/log/litemaas-key-cleanup.log'

# Run manually
ssh bastion 'sudo /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh'

The script auto-discovers the LiteMaaS URL and master key from OpenShift secrets — it requires oc to be installed and logged in on the bastion host.