RHDP MaaS — Infrastructure & Hardware

Multi-cluster deployment on AWS (NVIDIA GPU) and on-prem bare metal (Intel Gaudi 3)

Cluster Overview

RHDP MaaS runs across two clusters — a primary AWS cluster for NVIDIA GPU workloads and an on-prem Intel Gaudi 3 cluster for bare-metal AI acceleration.

AWS Cluster — Primary (NVIDIA GPU)

AttributeValue
Clustermaas.redhatworkshops.io
CloudAWS us-west-2
OpenShift version4.17 (Kubernetes 1.31)
GPU namespacellm-hosting — all model servers run here
Management namespacelitellm-rhpds — LiteMaaS platform runs here
Grafanagrafana-route-llm-hosting.apps.maas.redhatworkshops.io

Intel Gaudi Cluster — Rackspace Bare Metal

AttributeValue
Clustermaas00.rs-dfw3.infra.demo.redhat.com
LocationRackspace DFW3 data center (bare metal)
TopologySingle Node OpenShift (SNO)
ServerDell PowerEdge XE9680
AI accelerator8× Intel Gaudi 3 (device ID 1060)
CPU256 cores
RAM~2 TB
Driver versionHabana Labs 1.22.1
Firmwarehl-gaudi3-1.22.0-fw-61.3.2
GPU namespacellm-hosting
Grafanagrafana-route-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com
Same access for all clusters: Models on the Intel Gaudi cluster are registered in the same LiteMaaS/LiteLLM instance as AWS models. Users access them via the same portal, same endpoint, and same sk-... virtual key — no difference in how you connect.

AWS — GPU Nodes

The AWS cluster has 9 GPU worker nodes across three AWS instance types, providing a mix of GPU VRAM capacities. All nodes have the label node-role.kubernetes.io/worker-gpu.

Instance Types in Use

g6e.12xlarge
2 nodes · 4× NVIDIA L40S · 48 GB VRAM each (192 GB total) · 48 vCPU · 384 GB RAM
Large models
g6e.2xlarge
3 nodes · 1× NVIDIA L40S · 48 GB VRAM · 8 vCPU · 62 GB RAM
Mid-size models
g6.2xlarge
2 nodes · 1× NVIDIA L4 · 24 GB VRAM · 8 vCPU · 30 GB RAM
Small models
g4dn.2xlarge
3 nodes · 1× NVIDIA T4 · 16 GB VRAM · 8 vCPU · 30 GB RAM
Compact models

GPU Hardware Reference

GPUVRAMArchitectureFP16 TFLOPSBest for
NVIDIA L40S48 GB GDDR6Ada Lovelace362Large models, multi-GPU inference
NVIDIA L424 GB GDDR6Ada Lovelace121Mid-size models, efficient inference
NVIDIA T416 GB GDDR6Turing65Compact models, CPU-offload capable

Intel Gaudi 3 — Hardware Details

The Rackspace cluster is a single Dell PowerEdge XE9680 server with 8 Intel Gaudi 3 AI accelerators managed by the Habana AI Operator on OpenShift.

Credentials: Contact Ashok for access credentials to the Intel Gaudi cluster.
Intel Gaudi 3
8 accelerators · 96 GB HBM2e each (768 GB total) · 8× 21 TFLOPS BF16 · NIC integrated on-chip · 900 GB/s HBM bandwidth
On-prem SNO

Hardware Comparison

AttributeIntel Gaudi 3NVIDIA L40S (AWS)
VRAM per card96 GB HBM2e48 GB GDDR6
Cards in server8up to 4 (g6e.12xlarge)
Total VRAM768 GB192 GB
BF16 performance~1,835 TFLOPS (8 cards)~1,448 TFLOPS (4 cards)
Inference runtimevLLM on Gaudi / TGI-GaudivLLM
OperatorHabana AI OperatorNVIDIA GPU Operator

Gaudi-Specific Serving Runtimes

RuntimeTypeUsed for
vllm-gaudi-runtimevLLM on GaudiBase runtime for Gaudi vLLM deployments
llama-4-scout-tgi-gaudiTGI on GaudiLlama 4 Scout via HuggingFace TGI
gpt-oss-20b-tgi-gaudiTGI on GaudiGPT-OSS 20B via HuggingFace TGI

Gaudi Model Placement

ModelRuntimeGaudi cardsStatus
deepseek-r1-distill-qwen-14bvLLM on GaudiTBDReady
qwen3-14bvLLM on GaudiTBDReady
Check live allocation:
oc get inferenceservice -n llm-hosting --context=gaudi

Model Placement — Current Allocation

Each model is pinned to a specific node via KServe's node selector. The table below shows where each model is currently running, how many GPUs it uses, and on which AWS instance type.

ModelInstance TypeGPUVRAMGPUs allocatedReplicasRuntime
llama-scout-17b g6e.12xlarge L40S × 4 192 GB total 4 2 (HA) vLLM
granite-3-2-8b-instruct g6e.2xlarge L40S × 1 48 GB 1 1 vLLM
codellama-7b-instruct g6e.2xlarge L40S × 1 48 GB 1 1 vLLM
granite-4-0-h-tiny g6.2xlarge L4 × 1 24 GB 1 1 vLLM
granite-docling-258m g6.2xlarge L4 × 1 24 GB 1 1 Docling Serve
llama-guard-3-1b g4dn.2xlarge T4 × 1 16 GB 1 1 vLLM
nomic-embed-text-v1-5 g4dn.2xlarge T4 × 1 16 GB 1 1 TEI (GPU)
Why 4 GPUs for llama-scout-17b? Despite being a 17B parameter model, Llama Scout has a 400K token context window. The full KV cache for long contexts requires significantly more VRAM than the model weights alone. 4× L40S (192 GB total) ensures both the weights and context can be served without degradation.

How Models are Deployed — OpenShift AI & KServe

All model servers are deployed as KServe InferenceServices via OpenShift AI (RHOAI). Each model is a separate Kubernetes custom resource that KServe translates into a Deployment, Service, and optional HorizontalPodAutoscaler.

Deployment Stack

LayerComponentRole
OrchestrationOpenShift AI (RHOAI)Manages KServe, model runtimes, GPU scheduling via NVIDIA GPU Operator
Serving frameworkKServeTranslates InferenceService CRs into pods, handles scaling and routing
Inference runtimevLLMLLM inference engine with continuous batching, PagedAttention, prefix caching
Embedding runtimeTEI (Text Embeddings Inference)GPU-accelerated embedding model serving
Docling runtimeDocling ServeDocument conversion service (PDF → Markdown/JSON)
GPU driverNVIDIA GPU OperatorManages drivers, DCGM exporter, device plugin across all GPU nodes

InferenceService Structure

Each model is defined as an InferenceService custom resource. Here is an annotated example:

# Example: granite-3-2-8b-instruct InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: granite-3-2-8b-instruct
  namespace: llm-hosting
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM               # tells KServe which runtime to use
      runtime: granite-3-2-8b-instruct  # references a ServingRuntime CR
      storageUri: oci://...      # model weights location (OCI registry or S3)
    nodeSelector:
      node.kubernetes.io/instance-type: g6e.2xlarge  # pin to L40S node
    resources:
      limits:
        nvidia.com/gpu: "1"      # number of GPUs
        memory: 45Gi
        cpu: "6"
    

ServingRuntime

Each InferenceService references a ServingRuntime CR that defines the container image, environment variables, and default resource limits for that model family.

# List all serving runtimes in the namespace
oc get servingruntimes -n llm-hosting

# Inspect a specific runtime
oc get servingruntime granite-3-2-8b-instruct -n llm-hosting -o yaml

GitOps Management

All InferenceService and ServingRuntime manifests are managed via the rhpds/models-aas repository and synced automatically. See the GitOps section below for full details.

GitOps — Model Deployment Repository

Model servers are managed entirely through a dedicated GitOps repository, separate from the LiteMaaS platform repo. Any change committed to the repo is automatically applied to the cluster — no manual oc apply needed.

ArgoCD Application

FieldValue
Application namemodel-serving
ArgoCD namespaceopenshift-gitops
Git repositorygithub.com/rhpds/models-aas
Path in repomodel-serving/
Target revisionv1.0.5 (pinned tag)
Destination clusterhttps://kubernetes.default.svc (in-cluster)
Sync policyAutomated — commits sync without manual trigger
Health statusHealthy

How It Works

ArgoCD watches the model-serving/ directory at the pinned tag. When a commit changes a manifest, ArgoCD detects the drift and applies the changes automatically. The Validate=false annotation allows KServe CRDs to be applied even if CRD versions temporarily mismatch during cluster upgrades.

Checking Sync Status

# Check the model-serving app status
oc get applications.argoproj.io model-serving \
  -n openshift-gitops \
  -o custom-columns="NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status"

# See all managed resources (InferenceServices, ServingRuntimes, etc.)
oc get applications.argoproj.io model-serving -n openshift-gitops \
  -o jsonpath='{range .status.resources[*]}{.kind}{"\t"}{.name}{"\n"}{end}'

Adding or Updating a Model

# 1. Clone the models-aas repo
git clone https://github.com/rhpds/models-aas
cd models-aas

# 2. Add or edit the InferenceService in model-serving/

# 3. Commit and push — ArgoCD auto-syncs
git add model-serving/my-new-model.yaml
git commit -m "Add my-new-model InferenceService"
git push

# 4. Watch the sync
oc get applications.argoproj.io model-serving -n openshift-gitops -w
Pinned tag: The ArgoCD app tracks tag v1.0.5, not main. New commits only deploy if the tag is updated or the ArgoCD app's targetRevision is changed. Coordinate with the team before updating the revision.

Forcing a Manual Sync

# Force reconciliation if the app shows OutOfSync
oc patch applications.argoproj.io model-serving -n openshift-gitops \
  --type merge -p '{"operation":{"sync":{"syncStrategy":{"apply":{"force":false}}}}}'

Operational Commands

Check all model pods and their nodes

oc get pods -n llm-hosting -o wide | grep predictor

Check GPU allocation per node

# Total GPU capacity across GPU nodes
oc get nodes -l 'node-role.kubernetes.io/worker-gpu' \
  -o custom-columns="NAME:.metadata.name,INSTANCE:.metadata.labels['node\.kubernetes\.io/instance-type'],GPU:.status.capacity['nvidia\.com/gpu']"

# GPU currently allocated (requested by pods)
oc describe node <node-name> | grep -A5 "Allocated resources"

Check InferenceService status

oc get inferenceservice -n llm-hosting

View model server logs

# Tail logs for a specific model
oc logs -n llm-hosting -l app=granite-3-2-8b-instruct --tail=50

# Check vLLM startup — shows model load time and VRAM usage
oc logs -n llm-hosting <predictor-pod-name> | grep -E "GPU|VRAM|loaded|error"

Scale a model (increase replicas)

# Scale via InferenceService (preferred — GitOps-managed)
oc patch inferenceservice llama-scout-17b -n llm-hosting \
  --type merge -p '{"spec":{"predictor":{"minReplicas":3}}}'

# Or scale the underlying deployment directly (not persisted in Git)
oc scale deployment llama-scout-17b-predictor -n llm-hosting --replicas=3