Multi-cluster deployment on AWS (NVIDIA GPU) and on-prem bare metal (Intel Gaudi 3)
RHDP MaaS runs across two clusters — a primary AWS cluster for NVIDIA GPU workloads and an on-prem Intel Gaudi 3 cluster for bare-metal AI acceleration.
| Attribute | Value |
|---|---|
| Cluster | maas.redhatworkshops.io |
| Cloud | AWS us-west-2 |
| OpenShift version | 4.17 (Kubernetes 1.31) |
| GPU namespace | llm-hosting — all model servers run here |
| Management namespace | litellm-rhpds — LiteMaaS platform runs here |
| Grafana | grafana-route-llm-hosting.apps.maas.redhatworkshops.io |
| Attribute | Value |
|---|---|
| Cluster | maas00.rs-dfw3.infra.demo.redhat.com |
| Location | Rackspace DFW3 data center (bare metal) |
| Topology | Single Node OpenShift (SNO) |
| Server | Dell PowerEdge XE9680 |
| AI accelerator | 8× Intel Gaudi 3 (device ID 1060) |
| CPU | 256 cores |
| RAM | ~2 TB |
| Driver version | Habana Labs 1.22.1 |
| Firmware | hl-gaudi3-1.22.0-fw-61.3.2 |
| GPU namespace | llm-hosting |
| Grafana | grafana-route-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com |
sk-... virtual key — no difference in how you connect.
The AWS cluster has 9 GPU worker nodes across three AWS instance types, providing a mix of GPU VRAM capacities. All nodes have the label node-role.kubernetes.io/worker-gpu.
| GPU | VRAM | Architecture | FP16 TFLOPS | Best for |
|---|---|---|---|---|
| NVIDIA L40S | 48 GB GDDR6 | Ada Lovelace | 362 | Large models, multi-GPU inference |
| NVIDIA L4 | 24 GB GDDR6 | Ada Lovelace | 121 | Mid-size models, efficient inference |
| NVIDIA T4 | 16 GB GDDR6 | Turing | 65 | Compact models, CPU-offload capable |
The Rackspace cluster is a single Dell PowerEdge XE9680 server with 8 Intel Gaudi 3 AI accelerators managed by the Habana AI Operator on OpenShift.
| Attribute | Intel Gaudi 3 | NVIDIA L40S (AWS) |
|---|---|---|
| VRAM per card | 96 GB HBM2e | 48 GB GDDR6 |
| Cards in server | 8 | up to 4 (g6e.12xlarge) |
| Total VRAM | 768 GB | 192 GB |
| BF16 performance | ~1,835 TFLOPS (8 cards) | ~1,448 TFLOPS (4 cards) |
| Inference runtime | vLLM on Gaudi / TGI-Gaudi | vLLM |
| Operator | Habana AI Operator | NVIDIA GPU Operator |
| Runtime | Type | Used for |
|---|---|---|
vllm-gaudi-runtime | vLLM on Gaudi | Base runtime for Gaudi vLLM deployments |
llama-4-scout-tgi-gaudi | TGI on Gaudi | Llama 4 Scout via HuggingFace TGI |
gpt-oss-20b-tgi-gaudi | TGI on Gaudi | GPT-OSS 20B via HuggingFace TGI |
| Model | Runtime | Gaudi cards | Status |
|---|---|---|---|
deepseek-r1-distill-qwen-14b | vLLM on Gaudi | TBD | Ready |
qwen3-14b | vLLM on Gaudi | TBD | Ready |
oc get inferenceservice -n llm-hosting --context=gaudi
Each model is pinned to a specific node via KServe's node selector. The table below shows where each model is currently running, how many GPUs it uses, and on which AWS instance type.
| Model | Instance Type | GPU | VRAM | GPUs allocated | Replicas | Runtime |
|---|---|---|---|---|---|---|
llama-scout-17b |
g6e.12xlarge | L40S × 4 | 192 GB total | 4 | 2 (HA) | vLLM |
granite-3-2-8b-instruct |
g6e.2xlarge | L40S × 1 | 48 GB | 1 | 1 | vLLM |
codellama-7b-instruct |
g6e.2xlarge | L40S × 1 | 48 GB | 1 | 1 | vLLM |
granite-4-0-h-tiny |
g6.2xlarge | L4 × 1 | 24 GB | 1 | 1 | vLLM |
granite-docling-258m |
g6.2xlarge | L4 × 1 | 24 GB | 1 | 1 | Docling Serve |
llama-guard-3-1b |
g4dn.2xlarge | T4 × 1 | 16 GB | 1 | 1 | vLLM |
nomic-embed-text-v1-5 |
g4dn.2xlarge | T4 × 1 | 16 GB | 1 | 1 | TEI (GPU) |
All model servers are deployed as KServe InferenceServices via OpenShift AI (RHOAI). Each model is a separate Kubernetes custom resource that KServe translates into a Deployment, Service, and optional HorizontalPodAutoscaler.
| Layer | Component | Role |
|---|---|---|
| Orchestration | OpenShift AI (RHOAI) | Manages KServe, model runtimes, GPU scheduling via NVIDIA GPU Operator |
| Serving framework | KServe | Translates InferenceService CRs into pods, handles scaling and routing |
| Inference runtime | vLLM | LLM inference engine with continuous batching, PagedAttention, prefix caching |
| Embedding runtime | TEI (Text Embeddings Inference) | GPU-accelerated embedding model serving |
| Docling runtime | Docling Serve | Document conversion service (PDF → Markdown/JSON) |
| GPU driver | NVIDIA GPU Operator | Manages drivers, DCGM exporter, device plugin across all GPU nodes |
Each model is defined as an InferenceService custom resource. Here is an annotated example:
# Example: granite-3-2-8b-instruct InferenceService apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: granite-3-2-8b-instruct namespace: llm-hosting spec: predictor: model: modelFormat: name: vLLM # tells KServe which runtime to use runtime: granite-3-2-8b-instruct # references a ServingRuntime CR storageUri: oci://... # model weights location (OCI registry or S3) nodeSelector: node.kubernetes.io/instance-type: g6e.2xlarge # pin to L40S node resources: limits: nvidia.com/gpu: "1" # number of GPUs memory: 45Gi cpu: "6"ServingRuntime
Each InferenceService references a
ServingRuntimeCR that defines the container image, environment variables, and default resource limits for that model family.# List all serving runtimes in the namespace oc get servingruntimes -n llm-hosting # Inspect a specific runtime oc get servingruntime granite-3-2-8b-instruct -n llm-hosting -o yamlGitOps Management
All InferenceService and ServingRuntime manifests are managed via the rhpds/models-aas repository and synced automatically. See the GitOps section below for full details.
Model servers are managed entirely through a dedicated GitOps repository, separate from the LiteMaaS platform repo. Any change committed to the repo is automatically applied to the cluster — no manual oc apply needed.
| Field | Value |
|---|---|
| Application name | model-serving |
| ArgoCD namespace | openshift-gitops |
| Git repository | github.com/rhpds/models-aas |
| Path in repo | model-serving/ |
| Target revision | v1.0.5 (pinned tag) |
| Destination cluster | https://kubernetes.default.svc (in-cluster) |
| Sync policy | Automated — commits sync without manual trigger |
| Health status | Healthy |
ArgoCD watches the model-serving/ directory at the pinned tag. When a commit changes a manifest, ArgoCD detects the drift and applies the changes automatically. The Validate=false annotation allows KServe CRDs to be applied even if CRD versions temporarily mismatch during cluster upgrades.
# Check the model-serving app status oc get applications.argoproj.io model-serving \ -n openshift-gitops \ -o custom-columns="NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status" # See all managed resources (InferenceServices, ServingRuntimes, etc.) oc get applications.argoproj.io model-serving -n openshift-gitops \ -o jsonpath='{range .status.resources[*]}{.kind}{"\t"}{.name}{"\n"}{end}'
# 1. Clone the models-aas repo git clone https://github.com/rhpds/models-aas cd models-aas # 2. Add or edit the InferenceService in model-serving/ # 3. Commit and push — ArgoCD auto-syncs git add model-serving/my-new-model.yaml git commit -m "Add my-new-model InferenceService" git push # 4. Watch the sync oc get applications.argoproj.io model-serving -n openshift-gitops -w
v1.0.5, not main. New commits only deploy if the tag is updated or the ArgoCD app's targetRevision is changed. Coordinate with the team before updating the revision.
# Force reconciliation if the app shows OutOfSync
oc patch applications.argoproj.io model-serving -n openshift-gitops \
--type merge -p '{"operation":{"sync":{"syncStrategy":{"apply":{"force":false}}}}}'
oc get pods -n llm-hosting -o wide | grep predictor
# Total GPU capacity across GPU nodes oc get nodes -l 'node-role.kubernetes.io/worker-gpu' \ -o custom-columns="NAME:.metadata.name,INSTANCE:.metadata.labels['node\.kubernetes\.io/instance-type'],GPU:.status.capacity['nvidia\.com/gpu']" # GPU currently allocated (requested by pods) oc describe node <node-name> | grep -A5 "Allocated resources"
oc get inferenceservice -n llm-hosting
# Tail logs for a specific model oc logs -n llm-hosting -l app=granite-3-2-8b-instruct --tail=50 # Check vLLM startup — shows model load time and VRAM usage oc logs -n llm-hosting <predictor-pod-name> | grep -E "GPU|VRAM|loaded|error"
# Scale via InferenceService (preferred — GitOps-managed) oc patch inferenceservice llama-scout-17b -n llm-hosting \ --type merge -p '{"spec":{"predictor":{"minReplicas":3}}}' # Or scale the underlying deployment directly (not persisted in Git) oc scale deployment llama-scout-17b-predictor -n llm-hosting --replicas=3