Infrastructure & Hardware

Cluster Overview

RHDP MaaS runs across two clusters — a primary AWS cluster for NVIDIA GPU workloads and an on-prem Intel Gaudi 3 cluster for bare-metal AI acceleration.

AWS Cluster — Primary (NVIDIA GPU)

Attribute	Value
Cluster	`maas.redhatworkshops.io`
Cloud	AWS us-west-2
OpenShift version	4.17 (Kubernetes 1.31)
GPU namespace	`llm-hosting` — all model servers run here
Management namespace	`litellm-rhpds` — LiteMaaS platform runs here
Grafana	grafana-route-llm-hosting.apps.maas.redhatworkshops.io

Intel Gaudi Cluster — Rackspace Bare Metal

Attribute	Value
Cluster	`maas00.rs-dfw3.infra.demo.redhat.com`
Location	Rackspace DFW3 data center (bare metal)
Topology	Single Node OpenShift (SNO)
Server	Dell PowerEdge XE9680
AI accelerator	8× Intel Gaudi 3 (device ID 1060)
CPU	256 cores
RAM	~2 TB
Driver version	Habana Labs 1.22.1
Firmware	hl-gaudi3-1.22.0-fw-61.3.2
GPU namespace	`llm-hosting`
Grafana	grafana-route-llm-hosting.apps.maas00.rs-dfw3.infra.demo.redhat.com

Same access for all clusters: Models on the Intel Gaudi cluster are registered in the same LiteMaaS/LiteLLM instance as AWS models. Users access them via the same portal, same endpoint, and same sk-... virtual key — no difference in how you connect.

AWS — GPU Nodes

The AWS cluster has 9 GPU worker nodes across three AWS instance types, providing a mix of GPU VRAM capacities. All nodes have the label node-role.kubernetes.io/worker-gpu.

Instance Types in Use

g6e.12xlarge

2 nodes · 4× NVIDIA L40S · 48 GB VRAM each (192 GB total) · 48 vCPU · 384 GB RAM

Large models

g6e.2xlarge

3 nodes · 1× NVIDIA L40S · 48 GB VRAM · 8 vCPU · 62 GB RAM

Mid-size models

g6.2xlarge

2 nodes · 1× NVIDIA L4 · 24 GB VRAM · 8 vCPU · 30 GB RAM

Small models

g4dn.2xlarge

3 nodes · 1× NVIDIA T4 · 16 GB VRAM · 8 vCPU · 30 GB RAM

Compact models

GPU Hardware Reference

GPU	VRAM	Architecture	FP16 TFLOPS	Best for
NVIDIA L40S	48 GB GDDR6	Ada Lovelace	362	Large models, multi-GPU inference
NVIDIA L4	24 GB GDDR6	Ada Lovelace	121	Mid-size models, efficient inference
NVIDIA T4	16 GB GDDR6	Turing	65	Compact models, CPU-offload capable

Intel Gaudi 3 — Hardware Details

The Rackspace cluster is a single Dell PowerEdge XE9680 server with 8 Intel Gaudi 3 AI accelerators managed by the Habana AI Operator on OpenShift.

Credentials: Contact Ashok for access credentials to the Intel Gaudi cluster.

Intel Gaudi 3

8 accelerators · 96 GB HBM2e each (768 GB total) · 8× 21 TFLOPS BF16 · NIC integrated on-chip · 900 GB/s HBM bandwidth

On-prem SNO

Hardware Comparison

Attribute	Intel Gaudi 3	NVIDIA L40S (AWS)
VRAM per card	96 GB HBM2e	48 GB GDDR6
Cards in server	8	up to 4 (g6e.12xlarge)
Total VRAM	768 GB	192 GB
BF16 performance	~1,835 TFLOPS (8 cards)	~1,448 TFLOPS (4 cards)
Inference runtime	vLLM on Gaudi / TGI-Gaudi	vLLM
Operator	Habana AI Operator	NVIDIA GPU Operator

Gaudi-Specific Serving Runtimes

Runtime	Type	Used for
`vllm-gaudi-runtime`	vLLM on Gaudi	Base runtime for Gaudi vLLM deployments
`llama-4-scout-tgi-gaudi`	TGI on Gaudi	Llama 4 Scout via HuggingFace TGI
`gpt-oss-20b-tgi-gaudi`	TGI on Gaudi	GPT-OSS 20B via HuggingFace TGI

Gaudi Model Placement

Model	Runtime	Gaudi cards	Status
`deepseek-r1-distill-qwen-14b`	vLLM on Gaudi	TBD	Ready
`qwen3-14b`	vLLM on Gaudi	TBD	Ready

Check live allocation:

oc get inferenceservice -n llm-hosting --context=gaudi

Model Placement — Current Allocation

Each model is pinned to a specific node via KServe's node selector. The table below shows where each model is currently running, how many GPUs it uses, and on which AWS instance type.

Model	Instance Type	GPU	VRAM	GPUs allocated	Replicas	Runtime
`llama-scout-17b`	g6e.12xlarge	L40S × 4	192 GB total	4	2 (HA)	vLLM
`granite-3-2-8b-instruct`	g6e.2xlarge	L40S × 1	48 GB	1	1	vLLM
`codellama-7b-instruct`	g6e.2xlarge	L40S × 1	48 GB	1	1	vLLM
`granite-4-0-h-tiny`	g6.2xlarge	L4 × 1	24 GB	1	1	vLLM
`granite-docling-258m`	g6.2xlarge	L4 × 1	24 GB	1	1	Docling Serve
`llama-guard-3-1b`	g4dn.2xlarge	T4 × 1	16 GB	1	1	vLLM
`nomic-embed-text-v1-5`	g4dn.2xlarge	T4 × 1	16 GB	1	1	TEI (GPU)

Why 4 GPUs for llama-scout-17b? Despite being a 17B parameter model, Llama Scout has a 400K token context window. The full KV cache for long contexts requires significantly more VRAM than the model weights alone. 4× L40S (192 GB total) ensures both the weights and context can be served without degradation.

How Models are Deployed — OpenShift AI & KServe

All model servers are deployed as KServe InferenceServices via OpenShift AI (RHOAI). Each model is a separate Kubernetes custom resource that KServe translates into a Deployment, Service, and optional HorizontalPodAutoscaler.

Deployment Stack

Layer	Component	Role
Orchestration	OpenShift AI (RHOAI)	Manages KServe, model runtimes, GPU scheduling via NVIDIA GPU Operator
Serving framework	KServe	Translates InferenceService CRs into pods, handles scaling and routing
Inference runtime	vLLM	LLM inference engine with continuous batching, PagedAttention, prefix caching
Embedding runtime	TEI (Text Embeddings Inference)	GPU-accelerated embedding model serving
Docling runtime	Docling Serve	Document conversion service (PDF → Markdown/JSON)
GPU driver	NVIDIA GPU Operator	Manages drivers, DCGM exporter, device plugin across all GPU nodes

InferenceService Structure

Each model is defined as an InferenceService custom resource. Here is an annotated example:

# Example: granite-3-2-8b-instruct InferenceService apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: granite-3-2-8b-instruct namespace: llm-hosting spec: predictor: model: modelFormat: name: vLLM # tells KServe which runtime to use runtime: granite-3-2-8b-instruct # references a ServingRuntime CR storageUri: oci://... # model weights location (OCI registry or S3) nodeSelector: node.kubernetes.io/instance-type: g6e.2xlarge # pin to L40S node resources: limits: nvidia.com/gpu: "1" # number of GPUs memory: 45Gi cpu: "6"

ServingRuntime

Each InferenceService references a ServingRuntime CR that defines the container image, environment variables, and default resource limits for that model family.

# List all serving runtimes in the namespace
oc get servingruntimes -n llm-hosting

# Inspect a specific runtime
oc get servingruntime granite-3-2-8b-instruct -n llm-hosting -o yaml

GitOps Management

All InferenceService and ServingRuntime manifests are managed via the rhpds/models-aas repository and synced automatically. See the GitOps section below for full details.

GitOps — Model Deployment Repository

Model servers are managed entirely through a dedicated GitOps repository, separate from the LiteMaaS platform repo. Any change committed to the repo is automatically applied to the cluster — no manual oc apply needed.

ArgoCD Application

Field	Value
Application name	`model-serving`
ArgoCD namespace	`openshift-gitops`
Git repository	github.com/rhpds/models-aas
Path in repo	`model-serving/`
Target revision	`v1.0.5` (pinned tag)
Destination cluster	`https://kubernetes.default.svc` (in-cluster)
Sync policy	Automated — commits sync without manual trigger
Health status	Healthy

How It Works

ArgoCD watches the model-serving/ directory at the pinned tag. When a commit changes a manifest, ArgoCD detects the drift and applies the changes automatically. The Validate=false annotation allows KServe CRDs to be applied even if CRD versions temporarily mismatch during cluster upgrades.

Checking Sync Status

# Check the model-serving app status
oc get applications.argoproj.io model-serving \
  -n openshift-gitops \
  -o custom-columns="NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status"

# See all managed resources (InferenceServices, ServingRuntimes, etc.)
oc get applications.argoproj.io model-serving -n openshift-gitops \
  -o jsonpath='{range .status.resources[*]}{.kind}{"\t"}{.name}{"\n"}{end}'

Adding or Updating a Model

# 1. Clone the models-aas repo
git clone https://github.com/rhpds/models-aas
cd models-aas

# 2. Add or edit the InferenceService in model-serving/

# 3. Commit and push — ArgoCD auto-syncs
git add model-serving/my-new-model.yaml
git commit -m "Add my-new-model InferenceService"
git push

# 4. Watch the sync
oc get applications.argoproj.io model-serving -n openshift-gitops -w

Pinned tag: The ArgoCD app tracks tag v1.0.5, not main. New commits only deploy if the tag is updated or the ArgoCD app's targetRevision is changed. Coordinate with the team before updating the revision.

Forcing a Manual Sync

# Force reconciliation if the app shows OutOfSync
oc patch applications.argoproj.io model-serving -n openshift-gitops \
  --type merge -p '{"operation":{"sync":{"syncStrategy":{"apply":{"force":false}}}}}'

Operational Commands

Check all model pods and their nodes

oc get pods -n llm-hosting -o wide | grep predictor

Check GPU allocation per node

# Total GPU capacity across GPU nodes
oc get nodes -l 'node-role.kubernetes.io/worker-gpu' \
  -o custom-columns="NAME:.metadata.name,INSTANCE:.metadata.labels['node\.kubernetes\.io/instance-type'],GPU:.status.capacity['nvidia\.com/gpu']"

# GPU currently allocated (requested by pods)
oc describe node <node-name> | grep -A5 "Allocated resources"

Check InferenceService status

oc get inferenceservice -n llm-hosting

View model server logs

# Tail logs for a specific model
oc logs -n llm-hosting -l app=granite-3-2-8b-instruct --tail=50

# Check vLLM startup — shows model load time and VRAM usage
oc logs -n llm-hosting <predictor-pod-name> | grep -E "GPU|VRAM|loaded|error"

Scale a model (increase replicas)

# Scale via InferenceService (preferred — GitOps-managed)
oc patch inferenceservice llama-scout-17b -n llm-hosting \
  --type merge -p '{"spec":{"predictor":{"minReplicas":3}}}'

# Or scale the underlying deployment directly (not persisted in Git)
oc scale deployment llama-scout-17b-predictor -n llm-hosting --replicas=3