RHDP LiteMaaS

Model as a Service for Red Hat Demo Platform

Grafana & RHOAI Metrics

Infrastructure observability for LiteMaaS — model serving metrics, GPU utilization, vLLM performance, and Grafana dashboards.

Overview

LiteMaaS observability is implemented at two levels:

  1. Application metrics (LiteMaaS built-in): The LiteMaaS backend provides usage analytics per user, per model, and per API key through its admin UI. This data comes from LiteLLM's LiteLLM_SpendLogs table and is aggregated with intelligent day-by-day caching.
  2. Infrastructure metrics (RHOAI + Grafana): The ocp4_workload_rhoai_metrics Ansible role deploys Grafana operator and dashboards to monitor the underlying model serving infrastructure — vLLM queue depth, request rates, GPU utilization, DCGM exporter metrics, and OpenVino throughput. This is deployed in the llm-hosting namespace.
graph LR A[LiteLLM Proxy] -->|spend logs| B[PostgreSQL] B --> C[LiteMaaS Backend] C --> D[Admin Usage Dashboard
per user / model / key] E[vLLM Predictors] -->|/metrics port 8080| F[ServiceMonitor] G[OpenVino Predictors] -->|/metrics port 8080| F H[DCGM Exporter] -->|port 9400| F F --> I[OpenShift UWM
Prometheus] I --> J[Grafana
llm-hosting ns] J --> K[vLLM Dashboard] J --> L[GPU Dashboard] J --> M[OpenVino Dashboard] style D fill:#f0f4ff,stroke:#0066cc style J fill:#f9a825,stroke:#f9a825,color:#000

Production Grafana Instance

Grafana is deployed in the llm-hosting namespace by the ocp4_workload_rhoai_metrics role. The production deployment has been running for 94 days.

Running Components (from oc get pods -n llm-hosting)

PodStatusContainersRole
grafana-deployment-* Running (94d) 2/2 Grafana server + sidecar dashboard loader
grafana-operator-controller-manager-v5-* Running (85d) 1/1 Grafana Operator v5 controller

Access Grafana

# Get the Grafana route
oc get route -n llm-hosting | grep grafana

# Or check all services and find the grafana service
oc get svc -n llm-hosting | grep grafana

# Grafana service is on port 3000
# Service: grafana-service (ClusterIP 172.30.169.229) ports: 9091/TCP, 3000/TCP
# The route should be exposed at a hostname in llm-hosting namespace
oc get route -n llm-hosting

The Grafana operator version 5 channel is used (ocp4_workload_rhoai_metrics_grafana_operator_channel: "v5"). The Grafana instance name is rhoai-grafana. Grafana uses OpenShift user workload monitoring (UWM) as its Prometheus data source.

RHOAI Metrics Role Configuration

The ocp4_workload_rhoai_metrics role handles everything from enabling user workload monitoring to deploying dashboards. Key defaults from roles/ocp4_workload_rhoai_metrics/defaults/main.yml:

VariableDefaultDescription
ocp4_workload_rhoai_metrics_enable_uwmtrueEnable OpenShift User Workload Monitoring (required for Grafana data source)
ocp4_workload_rhoai_metrics_uwm_retention7dPrometheus data retention period
ocp4_workload_rhoai_metrics_enable_kservetrueEnable ServiceMonitors for KServe InferenceServices
ocp4_workload_rhoai_metrics_kserve_namespaceredhat-ods-applicationsKServe control plane namespace
ocp4_workload_rhoai_metrics_scrape_interval30sPrometheus scrape interval for model endpoints
ocp4_workload_rhoai_metrics_scrape_timeout10sPrometheus scrape timeout
ocp4_workload_rhoai_metrics_vllm_port8080vLLM metrics port
ocp4_workload_rhoai_metrics_vllm_path/metricsvLLM metrics endpoint path
ocp4_workload_rhoai_metrics_openvino_port8080OpenVino metrics port
ocp4_workload_rhoai_metrics_enable_gputrueEnable NVIDIA DCGM Exporter metrics collection
ocp4_workload_rhoai_metrics_gpu_operator_namespacenvidia-gpu-operatorGPU Operator namespace
ocp4_workload_rhoai_metrics_dcgm_port9400DCGM Exporter metrics port
ocp4_workload_rhoai_metrics_install_grafana_operatortrueInstall Grafana Operator if not present
ocp4_workload_rhoai_metrics_grafana_operator_channelv5Grafana Operator OLM channel
ocp4_workload_rhoai_metrics_grafana_instance_namerhoai-grafanaName of the GrafanaInstance custom resource
ocp4_workload_rhoai_metrics_grafana_overlayoverlays/grafana-uwm-user-appKustomize overlay for dashboard deployment
ocp4_workload_rhoai_metrics_runtimes[vllm, openvino]Model runtimes to create ServiceMonitors for

Deploy the Metrics Role

# Deploy RHOAI metrics and Grafana dashboards
ansible-playbook playbooks/deploy_rhoai_metrics.yml \
  -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \
  -e ocp4_workload_rhoai_metrics_enable_gpu=true

# Monitor specific models only
ansible-playbook playbooks/deploy_rhoai_metrics.yml \
  -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \
  -e '{"ocp4_workload_rhoai_metrics_models": ["llama-scout-17b", "granite-3-2-8b-instruct"]}'

Grafana Dashboards

Dashboards are bundled directly in the role's files/grafana/ directory — no external git clone is needed. They are deployed via Kustomize using the grafana-uwm-user-app overlay.

vLLM Performance Dashboard

Monitors all vLLM-based KServe predictors in the llm-hosting namespace.

Request Throughput

Requests per second per model

rate(vllm:request_success_total[5m])

Time to First Token (TTFT)

Latency for first token generation

vllm:time_to_first_token_seconds_bucket

Tokens Per Second

Generation throughput

rate(vllm:generation_tokens_total[5m])

Queue Depth

Requests waiting in vLLM scheduler queue

vllm:num_requests_waiting

KV Cache Utilization

GPU memory used for key/value cache

vllm:gpu_cache_usage_perc

Prompt Token Length

Distribution of input token counts

vllm:request_prompt_tokens_bucket

GPU Dashboard (DCGM Exporter)

Monitors NVIDIA GPU resources across all model serving nodes.

GPU Utilization

Percentage of time GPU is executing kernels

DCGM_FI_DEV_GPU_UTIL

GPU Memory Used

Framebuffer memory in use

DCGM_FI_DEV_FB_USED

GPU Temperature

Device temperature in Celsius

DCGM_FI_DEV_GPU_TEMP

Power Draw

Current power consumption in Watts

DCGM_FI_DEV_POWER_USAGE

NVLink Bandwidth

Multi-GPU communication bandwidth

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

PCIe Throughput

Data transfer rate over PCIe bus

DCGM_FI_DEV_PCIE_TX_THROUGHPUT

OpenVino Dashboard

Monitors OpenVino Model Server predictors (used for nomic-embed-text-v1-5 and similar CPU-based models).

Inference Requests

Total inference requests served

ovms_requests_success

Inference Duration

Request processing time histogram

ovms_request_time_us_bucket

Current Requests

In-flight requests being processed

ovms_current_requests

ServiceMonitors — How Metrics are Scraped

The role creates Prometheus ServiceMonitor resources that tell the OpenShift User Workload Monitoring (UWM) Prometheus instance how to scrape metrics from each model predictor service.

Production services with metrics endpoints in llm-hosting:

ServiceMetrics PortRuntimeClusterIP
granite-3-2-8b-instruct-metrics8080vLLM172.30.63.126
granite-4-0-h-tiny-metrics8080vLLM172.30.203.138
codellama-7b-instruct-metrics8080vLLM172.30.54.173
llama-guard-3-1b-metrics8080vLLM172.30.56.6
llama-scout-17b-metrics8080vLLM172.30.32.164

Verify ServiceMonitors are Working

# List ServiceMonitors
oc get servicemonitor -n llm-hosting

# Check Prometheus can reach them
# In OpenShift UWM, targets are listed in the monitoring console
# Platform: Observe → Targets → filter by llm-hosting namespace

# Test a metrics endpoint directly
oc exec -n llm-hosting \
  $(oc get pods -n llm-hosting -l app=granite-3-2-8b-instruct -o name | head -1) -- \
  curl -s http://localhost:8080/metrics | head -20

OpenShift User Workload Monitoring

User Workload Monitoring (UWM) must be enabled for Grafana to have a Prometheus data source for model metrics. The role enables it automatically if ocp4_workload_rhoai_metrics_enable_uwm: true.

Verify UWM is Enabled

# Check the cluster monitoring config
oc get configmap cluster-monitoring-config \
  -n openshift-monitoring -o yaml | grep enableUserWorkload

# Expected output
# enableUserWorkload: true

# Check UWM pods are running
oc get pods -n openshift-user-workload-monitoring

Configure Prometheus Retention

# The role sets retention to 7d via ConfigMap patch
# To change it manually:
oc edit configmap user-workload-monitoring-config \
  -n openshift-user-workload-monitoring

Add Grafana Data Source Pointing to UWM Prometheus

# The role creates a GrafanaDataSource CR automatically
# Verify it exists:
oc get grafanadatasource -n llm-hosting

# If missing, the role's UWM overlay creates it pointing to:
# https://thanos-querier.openshift-monitoring.svc.cluster.local:9091

Accessing Grafana

Find the Grafana URL

# Look for a route in llm-hosting namespace
oc get routes -n llm-hosting

# Or port-forward to access locally
oc port-forward svc/grafana-service 3000:3000 -n llm-hosting &
open http://localhost:3000

Authentication

The Grafana instance deployed by the role uses basic authentication by default. The admin credentials are set in the GrafanaInstance custom resource. Check the Grafana secret:

# Get Grafana admin credentials
oc get secret -n llm-hosting | grep grafana

# Retrieve the secret (name may vary)
oc get secret rhoai-grafana-admin-credentials -n llm-hosting \
  -o jsonpath='{.data.GF_SECURITY_ADMIN_USER}' | base64 -d
oc get secret rhoai-grafana-admin-credentials -n llm-hosting \
  -o jsonpath='{.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 -d

Dashboard Navigation

After logging in, navigate to Dashboards → Browse. Dashboards deployed by the role appear in the RHOAI Monitoring folder:

LiteMaaS Built-in Usage Analytics

Separately from Grafana, the LiteMaaS admin portal provides application-level analytics: per-user spend, per-model call counts, per-API-key usage, and trend analysis. This data comes from LiteLLM's spend logs.

Access Admin Analytics

  1. Log in to the LiteMaaS frontend: https://litellm-prod-frontend.apps.maas.redhatworkshops.io
  2. Navigate to Admin → Analytics
  3. Filter by: date range, user, model, provider, or API key
  4. Export to CSV or JSON using the export button

Analytics Caching Architecture

The backend uses intelligent day-by-day incremental caching to minimize database load:

LiteLLM Admin Portal Spend Tracking

The LiteLLM admin UI also provides spend tracking directly against the LiteLLM_SpendLogs table:

# Access LiteLLM admin portal
# URL: https://litellm-prod.apps.maas.redhatworkshops.io
# Navigate to: Usage tab → Spend Logs

# Or query directly via API
LITELLM_KEY=$(oc get secret litellm-secret -n litellm-rhpds \
  -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)
ROUTE=$(oc get route litellm-prod -n litellm-rhpds -o jsonpath='{.spec.host}')

# Get spend per key
curl "https://$ROUTE/spend/keys" \
  -H "Authorization: Bearer $LITELLM_KEY" | jq '.'

# Get spend per model
curl "https://$ROUTE/spend/models" \
  -H "Authorization: Bearer $LITELLM_KEY" | jq '.'