Model as a Service for Red Hat Demo Platform
Infrastructure observability for LiteMaaS — model serving metrics, GPU utilization, vLLM performance, and Grafana dashboards.
LiteMaaS observability is implemented at two levels:
LiteLLM_SpendLogs table and is aggregated with intelligent day-by-day caching.
ocp4_workload_rhoai_metrics
Ansible role deploys Grafana operator and dashboards to monitor the underlying model serving infrastructure —
vLLM queue depth, request rates, GPU utilization, DCGM exporter metrics, and OpenVino throughput.
This is deployed in the llm-hosting namespace.
Grafana is deployed in the llm-hosting namespace by the ocp4_workload_rhoai_metrics role.
The production deployment has been running for 94 days.
| Pod | Status | Containers | Role |
|---|---|---|---|
grafana-deployment-* |
Running (94d) | 2/2 | Grafana server + sidecar dashboard loader |
grafana-operator-controller-manager-v5-* |
Running (85d) | 1/1 | Grafana Operator v5 controller |
# Get the Grafana route oc get route -n llm-hosting | grep grafana # Or check all services and find the grafana service oc get svc -n llm-hosting | grep grafana # Grafana service is on port 3000 # Service: grafana-service (ClusterIP 172.30.169.229) ports: 9091/TCP, 3000/TCP # The route should be exposed at a hostname in llm-hosting namespace oc get route -n llm-hosting
The Grafana operator version 5 channel is used (ocp4_workload_rhoai_metrics_grafana_operator_channel: "v5").
The Grafana instance name is rhoai-grafana.
Grafana uses OpenShift user workload monitoring (UWM) as its Prometheus data source.
The ocp4_workload_rhoai_metrics role handles everything from enabling user workload monitoring
to deploying dashboards. Key defaults from roles/ocp4_workload_rhoai_metrics/defaults/main.yml:
| Variable | Default | Description |
|---|---|---|
ocp4_workload_rhoai_metrics_enable_uwm | true | Enable OpenShift User Workload Monitoring (required for Grafana data source) |
ocp4_workload_rhoai_metrics_uwm_retention | 7d | Prometheus data retention period |
ocp4_workload_rhoai_metrics_enable_kserve | true | Enable ServiceMonitors for KServe InferenceServices |
ocp4_workload_rhoai_metrics_kserve_namespace | redhat-ods-applications | KServe control plane namespace |
ocp4_workload_rhoai_metrics_scrape_interval | 30s | Prometheus scrape interval for model endpoints |
ocp4_workload_rhoai_metrics_scrape_timeout | 10s | Prometheus scrape timeout |
ocp4_workload_rhoai_metrics_vllm_port | 8080 | vLLM metrics port |
ocp4_workload_rhoai_metrics_vllm_path | /metrics | vLLM metrics endpoint path |
ocp4_workload_rhoai_metrics_openvino_port | 8080 | OpenVino metrics port |
ocp4_workload_rhoai_metrics_enable_gpu | true | Enable NVIDIA DCGM Exporter metrics collection |
ocp4_workload_rhoai_metrics_gpu_operator_namespace | nvidia-gpu-operator | GPU Operator namespace |
ocp4_workload_rhoai_metrics_dcgm_port | 9400 | DCGM Exporter metrics port |
ocp4_workload_rhoai_metrics_install_grafana_operator | true | Install Grafana Operator if not present |
ocp4_workload_rhoai_metrics_grafana_operator_channel | v5 | Grafana Operator OLM channel |
ocp4_workload_rhoai_metrics_grafana_instance_name | rhoai-grafana | Name of the GrafanaInstance custom resource |
ocp4_workload_rhoai_metrics_grafana_overlay | overlays/grafana-uwm-user-app | Kustomize overlay for dashboard deployment |
ocp4_workload_rhoai_metrics_runtimes | [vllm, openvino] | Model runtimes to create ServiceMonitors for |
# Deploy RHOAI metrics and Grafana dashboards ansible-playbook playbooks/deploy_rhoai_metrics.yml \ -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \ -e ocp4_workload_rhoai_metrics_enable_gpu=true # Monitor specific models only ansible-playbook playbooks/deploy_rhoai_metrics.yml \ -e ocp4_workload_rhoai_metrics_namespace=llm-hosting \ -e '{"ocp4_workload_rhoai_metrics_models": ["llama-scout-17b", "granite-3-2-8b-instruct"]}'
Dashboards are bundled directly in the role's files/grafana/ directory — no external git clone
is needed. They are deployed via Kustomize using the grafana-uwm-user-app overlay.
Monitors all vLLM-based KServe predictors in the llm-hosting namespace.
Requests per second per model
rate(vllm:request_success_total[5m])
Latency for first token generation
vllm:time_to_first_token_seconds_bucket
Generation throughput
rate(vllm:generation_tokens_total[5m])
Requests waiting in vLLM scheduler queue
vllm:num_requests_waiting
GPU memory used for key/value cache
vllm:gpu_cache_usage_perc
Distribution of input token counts
vllm:request_prompt_tokens_bucket
Monitors NVIDIA GPU resources across all model serving nodes.
Percentage of time GPU is executing kernels
DCGM_FI_DEV_GPU_UTIL
Framebuffer memory in use
DCGM_FI_DEV_FB_USED
Device temperature in Celsius
DCGM_FI_DEV_GPU_TEMP
Current power consumption in Watts
DCGM_FI_DEV_POWER_USAGE
Multi-GPU communication bandwidth
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL
Data transfer rate over PCIe bus
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
Monitors OpenVino Model Server predictors (used for nomic-embed-text-v1-5 and similar CPU-based models).
Total inference requests served
ovms_requests_success
Request processing time histogram
ovms_request_time_us_bucket
In-flight requests being processed
ovms_current_requests
The role creates Prometheus ServiceMonitor resources that tell the OpenShift User Workload Monitoring (UWM) Prometheus instance how to scrape metrics from each model predictor service.
Production services with metrics endpoints in llm-hosting:
| Service | Metrics Port | Runtime | ClusterIP |
|---|---|---|---|
granite-3-2-8b-instruct-metrics | 8080 | vLLM | 172.30.63.126 |
granite-4-0-h-tiny-metrics | 8080 | vLLM | 172.30.203.138 |
codellama-7b-instruct-metrics | 8080 | vLLM | 172.30.54.173 |
llama-guard-3-1b-metrics | 8080 | vLLM | 172.30.56.6 |
llama-scout-17b-metrics | 8080 | vLLM | 172.30.32.164 |
# List ServiceMonitors oc get servicemonitor -n llm-hosting # Check Prometheus can reach them # In OpenShift UWM, targets are listed in the monitoring console # Platform: Observe → Targets → filter by llm-hosting namespace # Test a metrics endpoint directly oc exec -n llm-hosting \ $(oc get pods -n llm-hosting -l app=granite-3-2-8b-instruct -o name | head -1) -- \ curl -s http://localhost:8080/metrics | head -20
User Workload Monitoring (UWM) must be enabled for Grafana to have a Prometheus data source for model metrics.
The role enables it automatically if ocp4_workload_rhoai_metrics_enable_uwm: true.
# Check the cluster monitoring config oc get configmap cluster-monitoring-config \ -n openshift-monitoring -o yaml | grep enableUserWorkload # Expected output # enableUserWorkload: true # Check UWM pods are running oc get pods -n openshift-user-workload-monitoring
# The role sets retention to 7d via ConfigMap patch # To change it manually: oc edit configmap user-workload-monitoring-config \ -n openshift-user-workload-monitoring
# The role creates a GrafanaDataSource CR automatically # Verify it exists: oc get grafanadatasource -n llm-hosting # If missing, the role's UWM overlay creates it pointing to: # https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
# Look for a route in llm-hosting namespace oc get routes -n llm-hosting # Or port-forward to access locally oc port-forward svc/grafana-service 3000:3000 -n llm-hosting & open http://localhost:3000
The Grafana instance deployed by the role uses basic authentication by default. The admin credentials are set in the GrafanaInstance custom resource. Check the Grafana secret:
# Get Grafana admin credentials oc get secret -n llm-hosting | grep grafana # Retrieve the secret (name may vary) oc get secret rhoai-grafana-admin-credentials -n llm-hosting \ -o jsonpath='{.data.GF_SECURITY_ADMIN_USER}' | base64 -d oc get secret rhoai-grafana-admin-credentials -n llm-hosting \ -o jsonpath='{.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 -d
After logging in, navigate to Dashboards → Browse. Dashboards deployed by the role appear in the RHOAI Monitoring folder:
Separately from Grafana, the LiteMaaS admin portal provides application-level analytics: per-user spend, per-model call counts, per-API-key usage, and trend analysis. This data comes from LiteLLM's spend logs.
https://litellm-prod-frontend.apps.maas.redhatworkshops.ioThe backend uses intelligent day-by-day incremental caching to minimize database load:
The LiteLLM admin UI also provides spend tracking directly against the LiteLLM_SpendLogs table:
# Access LiteLLM admin portal # URL: https://litellm-prod.apps.maas.redhatworkshops.io # Navigate to: Usage tab → Spend Logs # Or query directly via API LITELLM_KEY=$(oc get secret litellm-secret -n litellm-rhpds \ -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d) ROUTE=$(oc get route litellm-prod -n litellm-rhpds -o jsonpath='{.spec.host}') # Get spend per key curl "https://$ROUTE/spend/keys" \ -H "Authorization: Bearer $LITELLM_KEY" | jq '.' # Get spend per model curl "https://$ROUTE/spend/models" \ -H "Authorization: Bearer $LITELLM_KEY" | jq '.'