LiteMaaS — Grafana & RHOAI Metrics

Overview

LiteMaaS observability is implemented at two levels:

Application metrics (LiteMaaS built-in): The LiteMaaS backend provides usage analytics per user, per model, and per API key through its admin UI. This data comes from LiteLLM's LiteLLM_SpendLogs table and is aggregated with intelligent day-by-day caching.
Infrastructure metrics (RHOAI + Grafana): The ocp4_workload_rhoai_metrics Ansible role deploys Grafana operator and dashboards to monitor the underlying model serving infrastructure — vLLM queue depth, request rates, GPU utilization, DCGM exporter metrics, and OpenVino throughput. This is deployed in the llm-hosting namespace.

graph LR A[LiteLLM Proxy] -->|spend logs| B[PostgreSQL] B --> C[LiteMaaS Backend] C --> D[Admin Usage Dashboard
per user / model / key] E[vLLM Predictors] -->|/metrics port 8080| F[ServiceMonitor] G[OpenVino Predictors] -->|/metrics port 8080| F H[DCGM Exporter] -->|port 9400| F F --> I[OpenShift UWM
Prometheus] I --> J[Grafana
llm-hosting ns] J --> K[vLLM Dashboard] J --> L[GPU Dashboard] J --> M[OpenVino Dashboard] style D fill:#f0f4ff,stroke:#0066cc style J fill:#f9a825,stroke:#f9a825,color:#000

Production Grafana Instance

Grafana is deployed in the llm-hosting namespace by the ocp4_workload_rhoai_metrics role. The production deployment has been running for 94 days.

Running Components (from oc get pods -n llm-hosting)

Pod	Status	Containers	Role
`grafana-deployment-*`	Running (94d)	2/2	Grafana server + sidecar dashboard loader
`grafana-operator-controller-manager-v5-*`	Running (85d)	1/1	Grafana Operator v5 controller

Access Grafana

# Get the Grafana route
oc get route -n llm-hosting | grep grafana

# Or check all services and find the grafana service
oc get svc -n llm-hosting | grep grafana

# Grafana service is on port 3000
# Service: grafana-service (ClusterIP 172.30.169.229) ports: 9091/TCP, 3000/TCP
# The route should be exposed at a hostname in llm-hosting namespace
oc get route -n llm-hosting

The Grafana operator version 5 channel is used (ocp4_workload_rhoai_metrics_grafana_operator_channel: "v5"). The Grafana instance name is rhoai-grafana. Grafana uses OpenShift user workload monitoring (UWM) as its Prometheus data source.

RHOAI Metrics Role Configuration

The ocp4_workload_rhoai_metrics role handles everything from enabling user workload monitoring to deploying dashboards. Key defaults from roles/ocp4_workload_rhoai_metrics/defaults/main.yml:

Variable	Default	Description
`ocp4_workload_rhoai_metrics_enable_uwm`	`true`	Enable OpenShift User Workload Monitoring (required for Grafana data source)
`ocp4_workload_rhoai_metrics_uwm_retention`	`7d`	Prometheus data retention period
`ocp4_workload_rhoai_metrics_enable_kserve`	`true`	Enable ServiceMonitors for KServe InferenceServices
`ocp4_workload_rhoai_metrics_kserve_namespace`	`redhat-ods-applications`	KServe control plane namespace
`ocp4_workload_rhoai_metrics_scrape_interval`	`30s`	Prometheus scrape interval for model endpoints
`ocp4_workload_rhoai_metrics_scrape_timeout`	`10s`	Prometheus scrape timeout
`ocp4_workload_rhoai_metrics_vllm_port`	`8080`	vLLM metrics port
`ocp4_workload_rhoai_metrics_vllm_path`	`/metrics`	vLLM metrics endpoint path
`ocp4_workload_rhoai_metrics_openvino_port`	`8080`	OpenVino metrics port
`ocp4_workload_rhoai_metrics_enable_gpu`	`true`	Enable NVIDIA DCGM Exporter metrics collection
`ocp4_workload_rhoai_metrics_gpu_operator_namespace`	`nvidia-gpu-operator`	GPU Operator namespace
`ocp4_workload_rhoai_metrics_dcgm_port`	`9400`	DCGM Exporter metrics port
`ocp4_workload_rhoai_metrics_install_grafana_operator`	`true`	Install Grafana Operator if not present
`ocp4_workload_rhoai_metrics_grafana_operator_channel`	`v5`	Grafana Operator OLM channel
`ocp4_workload_rhoai_metrics_grafana_instance_name`	`rhoai-grafana`	Name of the GrafanaInstance custom resource
`ocp4_workload_rhoai_metrics_grafana_overlay`	`overlays/grafana-uwm-user-app`	Kustomize overlay for dashboard deployment
`ocp4_workload_rhoai_metrics_runtimes`	`[vllm, openvino]`	Model runtimes to create ServiceMonitors for

Deploy the Metrics Role

# Deploy RHOAI metrics and Grafana dashboards
ansible-playbook playbooks/deploy_rhoai_metrics.yml \
  -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \
  -e ocp4_workload_rhoai_metrics_enable_gpu=true

# Monitor specific models only
ansible-playbook playbooks/deploy_rhoai_metrics.yml \
  -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \
  -e '{"ocp4_workload_rhoai_metrics_models": ["llama-scout-17b", "granite-3-2-8b-instruct"]}'

Grafana Dashboards

Dashboards are bundled directly in the role's files/grafana/ directory — no external git clone is needed. They are deployed via Kustomize using the grafana-uwm-user-app overlay.

vLLM Performance Dashboard

Monitors all vLLM-based KServe predictors in the llm-hosting namespace.

Request Throughput

Requests per second per model

rate(vllm:request_success_total[5m])

Time to First Token (TTFT)

Latency for first token generation

vllm:time_to_first_token_seconds_bucket

Tokens Per Second

Generation throughput

rate(vllm:generation_tokens_total[5m])

Queue Depth

Requests waiting in vLLM scheduler queue

vllm:num_requests_waiting

KV Cache Utilization

GPU memory used for key/value cache

vllm:gpu_cache_usage_perc

Prompt Token Length

Distribution of input token counts

vllm:request_prompt_tokens_bucket

GPU Dashboard (DCGM Exporter)

Monitors NVIDIA GPU resources across all model serving nodes.

GPU Utilization

Percentage of time GPU is executing kernels

DCGM_FI_DEV_GPU_UTIL

GPU Memory Used

Framebuffer memory in use

DCGM_FI_DEV_FB_USED

GPU Temperature

Device temperature in Celsius

DCGM_FI_DEV_GPU_TEMP

Power Draw

Current power consumption in Watts

DCGM_FI_DEV_POWER_USAGE

NVLink Bandwidth

Multi-GPU communication bandwidth

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

PCIe Throughput

Data transfer rate over PCIe bus

DCGM_FI_DEV_PCIE_TX_THROUGHPUT

OpenVino Dashboard

Monitors OpenVino Model Server predictors (used for nomic-embed-text-v1-5 and similar CPU-based models).

Inference Requests

Total inference requests served

ovms_requests_success

Inference Duration

Request processing time histogram

ovms_request_time_us_bucket

Current Requests

In-flight requests being processed

ovms_current_requests

ServiceMonitors — How Metrics are Scraped

The role creates Prometheus ServiceMonitor resources that tell the OpenShift User Workload Monitoring (UWM) Prometheus instance how to scrape metrics from each model predictor service.

Production services with metrics endpoints in llm-hosting:

Service	Metrics Port	Runtime	ClusterIP
`granite-3-2-8b-instruct-metrics`	8080	vLLM	172.30.63.126
`granite-4-0-h-tiny-metrics`	8080	vLLM	172.30.203.138
`codellama-7b-instruct-metrics`	8080	vLLM	172.30.54.173
`llama-guard-3-1b-metrics`	8080	vLLM	172.30.56.6
`llama-scout-17b-metrics`	8080	vLLM	172.30.32.164

Verify ServiceMonitors are Working

# List ServiceMonitors
oc get servicemonitor -n llm-hosting

# Check Prometheus can reach them
# In OpenShift UWM, targets are listed in the monitoring console
# Platform: Observe → Targets → filter by llm-hosting namespace

# Test a metrics endpoint directly
oc exec -n llm-hosting \
  $(oc get pods -n llm-hosting -l app=granite-3-2-8b-instruct -o name | head -1) -- \
  curl -s http://localhost:8080/metrics | head -20

OpenShift User Workload Monitoring

User Workload Monitoring (UWM) must be enabled for Grafana to have a Prometheus data source for model metrics. The role enables it automatically if ocp4_workload_rhoai_metrics_enable_uwm: true.

Verify UWM is Enabled

# Check the cluster monitoring config
oc get configmap cluster-monitoring-config \
  -n openshift-monitoring -o yaml | grep enableUserWorkload

# Expected output
# enableUserWorkload: true

# Check UWM pods are running
oc get pods -n openshift-user-workload-monitoring

Configure Prometheus Retention

# The role sets retention to 7d via ConfigMap patch
# To change it manually:
oc edit configmap user-workload-monitoring-config \
  -n openshift-user-workload-monitoring

Add Grafana Data Source Pointing to UWM Prometheus

# The role creates a GrafanaDataSource CR automatically
# Verify it exists:
oc get grafanadatasource -n llm-hosting

# If missing, the role's UWM overlay creates it pointing to:
# https://thanos-querier.openshift-monitoring.svc.cluster.local:9091

Accessing Grafana

Find the Grafana URL

# Look for a route in llm-hosting namespace
oc get routes -n llm-hosting

# Or port-forward to access locally
oc port-forward svc/grafana-service 3000:3000 -n llm-hosting &
open http://localhost:3000

Authentication

The Grafana instance deployed by the role uses basic authentication by default. The admin credentials are set in the GrafanaInstance custom resource. Check the Grafana secret:

# Get Grafana admin credentials
oc get secret -n llm-hosting | grep grafana

# Retrieve the secret (name may vary)
oc get secret rhoai-grafana-admin-credentials -n llm-hosting \
  -o jsonpath='{.data.GF_SECURITY_ADMIN_USER}' | base64 -d
oc get secret rhoai-grafana-admin-credentials -n llm-hosting \
  -o jsonpath='{.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 -d

Dashboard Navigation

After logging in, navigate to Dashboards → Browse. Dashboards deployed by the role appear in the RHOAI Monitoring folder:

vLLM Model Performance — per-model throughput, latency, queue depth, KV cache
GPU Node Overview — cluster-wide GPU utilization, memory, temperature
OpenVino Server — CPU-based model serving metrics

LiteMaaS Built-in Usage Analytics

Separately from Grafana, the LiteMaaS admin portal provides application-level analytics: per-user spend, per-model call counts, per-API-key usage, and trend analysis. This data comes from LiteLLM's spend logs.

Access Admin Analytics

Log in to the LiteMaaS frontend: https://litellm-prod-frontend.apps.maas.redhatworkshops.io
Navigate to Admin → Analytics
Filter by: date range, user, model, provider, or API key
Export to CSV or JSON using the export button

Analytics Caching Architecture

The backend uses intelligent day-by-day incremental caching to minimize database load:

Historical data (past days): Cached permanently — exact numbers won't change
Current day: 5-minute cache TTL — refreshed frequently to show near-real-time data
Cache storage: In-process memory with React Query on the frontend

LiteLLM Admin Portal Spend Tracking

The LiteLLM admin UI also provides spend tracking directly against the LiteLLM_SpendLogs table:

# Access LiteLLM admin portal
# URL: https://litellm-prod.apps.maas.redhatworkshops.io
# Navigate to: Usage tab → Spend Logs

# Or query directly via API
LITELLM_KEY=$(oc get secret litellm-secret -n litellm-rhpds \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n litellm-rhpds -o jsonpath='{.spec.host}')

# Get spend per key
curl "https://$ROUTE/spend/keys" \
  -H "Authorization: Bearer $LITELLM_KEY" | jq '.'

# Get spend per model
curl "https://$ROUTE/spend/models" \
  -H "Authorization: Bearer $LITELLM_KEY" | jq '.'