Grafana dashboards, ServiceMonitors, usage analytics, and automated key cleanup
Grafana is deployed in the llm-hosting namespace by the ocp4_workload_rhoai_metrics Ansible role. It monitors vLLM model performance, GPU utilization, and KServe InferenceService health using OpenShift User Workload Monitoring as the data source.
# Get Grafana route oc get route grafana-route -n llm-hosting # Get admin credentials oc get secret -n llm-hosting | grep grafana-admin oc get secret <grafana-admin-secret> -n llm-hosting \ -o jsonpath='{.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 -d
| Component | Status | Age | Details |
|---|---|---|---|
grafana-deployment | Running | 94d | 2/2 containers (Grafana + sidecar dashboard loader) |
grafana-operator-controller-manager-v5 | Running | 85d | Grafana Operator v5 — manages GrafanaInstance CRs |
Each model predictor service has a corresponding ServiceMonitor that tells OpenShift's Prometheus to scrape its /metrics endpoint on port 8080.
| ServiceMonitor | Age | Scrapes |
|---|---|---|
granite-3-2-8b-instruct-metrics | 175d | vLLM metrics — throughput, TTFT, queue depth, KV cache |
llama-scout-17b-metrics | 99d | vLLM metrics |
granite-4-0-h-tiny-metrics | 93d | vLLM metrics |
codellama-7b-instruct-metrics | 99d | vLLM metrics |
llama-guard-3-1b-metrics | 114d | vLLM metrics |
nomic-embed-text-v1-5-metrics | 175d | OpenVino metrics |
granite-docling-258m-metrics | 17d | vLLM metrics |
vllm-models | 94d | Catch-all vLLM ServiceMonitor |
# Verify ServiceMonitors oc get servicemonitor -n llm-hosting # Test a metrics endpoint directly oc exec -n llm-hosting \ $(oc get pods -n llm-hosting -l app=granite-3-2-8b-instruct -o name | head -1) -- \ curl -s http://localhost:8080/metrics | grep vllm_requests
rate(vllm:request_success_total[5m])vllm:time_to_first_token_seconds_bucketvllm:num_requests_waitingvllm:gpu_cache_usage_percrate(vllm:generation_tokens_total[5m])vllm:num_requests_runningDCGM_FI_DEV_GPU_UTILDCGM_FI_DEV_FB_USEDDCGM_FI_DEV_GPU_TEMPDCGM_FI_DEV_POWER_USAGEDashboards are bundled in the ocp4_workload_rhoai_metrics role under files/grafana/ — no external downloads. Three dashboards are deployed:
| Dashboard | What it shows | Data source |
|---|---|---|
| vLLM Model Performance | Per-model throughput (req/s), Time to First Token (P50/P95/P99), scheduler queue depth, KV cache utilization, prompt/completion token histograms, generation tokens/sec | vLLM /metrics on port 8080 via ServiceMonitor |
| GPU Node Overview | Per-GPU utilization %, framebuffer memory used/free, temperature, power draw, NVLink bandwidth, PCIe throughput — across all model serving nodes | NVIDIA DCGM Exporter on port 9400 |
| OpenVino Server | Inference request count, request duration histogram, in-flight requests — for CPU-based models (e.g. Nomic embeddings) | OpenVino Model Server /metrics on port 8080 |
| Metric | Healthy range | Action if outside |
|---|---|---|
| TTFT P95 | <15s (≤30 users), <30s (≤60 users), <60s (120+) | Run benchmark, reduce attendees or add replicas |
| Queue depth | <5 waiting requests | Scale model predictor replicas |
| GPU memory used | <90% framebuffer | Reduce model context length or scale |
| KV cache utilization | <80% | Reduce concurrent requests or max tokens |
Deployed via the ocp4_workload_rhoai_metrics role in the rhpds.litemaas collection.
# Deploy Grafana + ServiceMonitors to llm-hosting namespace
ansible-playbook playbooks/deploy_rhoai_metrics.yml \
-e ocp4_workload_rhoai_metrics_namespace=llm-hosting \
-e ocp4_workload_rhoai_metrics_enable_gpu=true
The LiteMaaS admin portal provides application-level analytics from LiteLLM spend logs — per-user spend, per-model call counts, per-API-key usage, trends, and exports. No Grafana needed for this.
| View | Data |
|---|---|
| System overview | Total requests, tokens, cost — with trend vs previous period |
| By user | Per-user spend, request count, top models used |
| By model | Per-model usage, cost breakdown |
| By provider | On-cluster vs external (WatsonX, Bedrock) split |
| By API key | Per-key usage and budget consumption |
| Export | CSV or JSON export with filters applied |
Historical data (past days) is cached permanently. Current day is refreshed every 5 minutes. Use Refresh Today in the UI to force an immediate update.
The ocp4_workload_litemaas_benchmark role (in rhpds.litemaas) runs a multi-turn conversation load test against LiteMaaS. Use it before Summit or large workshops to verify the platform can handle the expected load. Results are published as an HTML report on an OpenShift route.
Simulates N concurrent users each asking K questions in sequence. Each turn uses the same seed document — this tests prefix caching effectiveness (the speedup ratio). Reports P50/P95/P99 Time to First Token and requests per second.
| Attendees | P95 PASS | P95 WARN | P95 FAIL |
|---|---|---|---|
| 5–29 users | < 15s | 15–30s | > 30s |
| 30–59 users | < 25s | 25–50s | > 50s |
| 60–119 users | < 30s | 30–60s | > 60s |
| 120+ users | < 60s | 60–120s | > 120s |
| Speedup ratio | Cache status | Meaning |
|---|---|---|
| > 2.5x | EXCELLENT | Later turns significantly faster than first — prefix caching working well |
| 1.5–2.5x | GOOD | Acceptable for workshops |
| < 1.5x | POOR | Check if model supports prefix caching |
LibreChat/MCP workshops apply a 3× multiplier to question count — each user question may trigger 2–3 AI calls (tool invocation, result processing, response). Enable this with benchmark_model_supports_mcp: true.
# Example: test 60 attendees, 10 questions each, granite model # Run from the RHDP catalog: Tests → LiteMaaS Benchmark CI # Key parameters in AgnosticV common.yaml: benchmark_conversations_count: 60 # simulated users (min 5, max 200) benchmark_sessions_count: 1 # back-to-back sessions benchmark_turns_count: 10 # questions per user (min 5, max 50) benchmark_model_supports_mcp: false # true = applies 3x multiplier benchmark_model_granite_3_2_8b: true # select which model to test
The benchmark deploys an HTML report to an OpenShift route. The report shows:
# Get the report URL after benchmark completes
oc get route benchmark-report -n <guid> -o jsonpath='{.spec.host}'
| Attendees | Parallel workers |
|---|---|
| 1–30 | 2 |
| 31–60 | 4 |
| 61–100 | 8 |
| 101+ | 16 |
A shell script runs daily at 2 AM on the bastion host. It deletes expired and old virtual keys from LiteMaaS and keeps the LiteMaaS database in sync.
| Step | Action |
|---|---|
| 1 | Fetches all virtual keys from LiteMaaS API (paginated) |
| 2 | Identifies expired keys (expires < now) and old keys (created > 30 days ago) |
| 3 | Deletes each key from LiteMaaS via POST /key/delete |
| 4 | Marks matching api_keys record inactive in LiteMaaS DB |
| 5 | Final sweep — marks any remaining orphaned api_keys records inactive |
# Run from workstation — installs script on bastion and creates crontab entry
./setup-key-cleanup-cronjob.sh litellm-rhpds
# Check the cronjob is scheduled ssh bastion 'sudo crontab -l | grep cleanup' # View recent cleanup runs ssh bastion 'sudo tail -100 /var/log/litemaas-key-cleanup.log' # Run manually ssh bastion 'sudo /usr/local/bin/cleanup-litemaas-keys-litellm-rhpds.sh'
oc to be installed and logged in on the bastion host.