Model as a Service for Red Hat Demo Platform
Six-month cost breakdown for the RHDP MaaS cluster — GPU spend, per-model token costs, and self-hosted vs external provider economics.
Six months of production cluster costs across all node types, broken down between GPU compute and supporting infrastructure.
| Month | Total Cost |
|---|---|
| Sep 2025 | $18,900 |
| Oct 2025 | $20,267 |
| Nov 2025 | $23,455 |
| Dec 2025 | $18,717 |
| Jan 2026 | $17,089 |
| Feb 2026 | $18,555 |
| Total | $116,984 |
The overhead multiplier is the critical number for cost planning. The cluster infrastructure (worker nodes, control plane, PostgreSQL, networking, storage) costs 5.40× the raw GPU spend. Any model running on dedicated GPU hardware must be used heavily enough to justify both the GPU cost and that 4.40× overhead on top.
Each GPU node type is dedicated to specific models. Costs are amortized over the 6-month window.
| Instance Type | GPU | 6-Month Cost | Cost/Month | Models Hosted |
|---|---|---|---|---|
g6e.12xlarge |
L40S × 4 | $7,899 | $1,317 | llama-scout-17b (2 replicas) |
g6e.2xlarge |
L40S × 1 | $7,365 | $1,227 | granite-3-2-8b-instruct, codellama-7b-instruct |
g6.2xlarge |
L4 × 1 | $3,804 | $634 | granite-4-0-h-tiny, granite-docling-258m |
g4dn.2xlarge |
T4 × 1 | $2,597 | $433 | llama-guard-3-1b, nomic-embed-text-v1-5 |
| Total | $21,665 | $3,611 |
llama-scout-17b accounts for 36% of GPU spend ($7,899 over 6 months) due to the 4-GPU instance type required for 2 replicas. Models requiring multi-GPU instances have a much steeper cost cliff when token utilization is low.
The GPU-only cost divides the monthly GPU node cost by observed token throughput. The full cost applies the 5.40× overhead multiplier to reflect the true all-in infrastructure cost per token. Only the 7 InferenceServices confirmed running on this cluster are included below.
| Model | Node / Instance | GPU $/month | Full $/month (×5.40) | 3-Month Token Usage | Full $/1M tokens |
|---|---|---|---|---|---|
| granite-3-2-8b-instruct | g6e.2xlarge (L40S) |
$614 | $3,314 | 135M | $73.64 |
| llama-scout-17b | g6e.12xlarge ×2 (L40S×4) |
$2,633 | $14,218 | 99.9M | $426.82 |
| codellama-7b-instruct | g6e.2xlarge (L40S) |
$614 | $3,314 | 13.2M | $750.90 |
| granite-4-0-h-tiny | g6.2xlarge (L4) |
$317 | $1,712 | 5.0M | $1,024.43 |
| granite-docling-258m | g6.2xlarge (L4) |
$317 | $1,712 | ~0 | N/A |
| llama-guard-3-1b | g4dn.2xlarge (T4) |
$216 | $1,168 | 4.6M | $766.55 |
| nomic-embed-text-v1-5 | g4dn.2xlarge (T4) |
$216 | $1,168 | 1.5M | $2,351.43 |
Models on maas00/smc00 servers (qwen3-14b, deepseek-r1-distill-qwen-14b, microsoft-phi-4) are excluded
— their infrastructure costs are not captured in this AWS billing data. Only the 7 InferenceServices
running on this cluster (in the llm-hosting namespace) are included above.
Full cost = GPU cost × 5.40 (overhead multiplier). Self-hosted cost is high for low-usage models — consider moving them to external providers. granite-3-2-8b-instruct has sufficient token volume to amortize the fixed GPU cost efficiently.
This is a utilization problem, not a model problem. Two factors combine:
At 10× usage (15M tokens/3 months), cost would drop to ~$235/1M. At 100× (150M tokens), it would reach ~$24/1M — competitive with any external provider.
At current usage (1.5M tokens over 3 months), self-hosting nomic-embed costs $858 compared to ~$0.15 for the same volume via an external embedding API (e.g. OpenAI text-embedding-3-small at $0.02/1M or Vertex AI embeddings at $0.025/1M). Consider moving to an external provider unless usage is expected to grow significantly.
Fixed GPU costs only make sense when token volume is high enough to spread that cost thin. These examples show where self-hosting competes well — and where it does not.
| Model | Self-Hosted $/1M (full) | External $/1M | Verdict |
|---|---|---|---|
| nomic-embed-text-v1-5 | $2,351 | $0.02–$0.03 (external) | Move to external — ~80,000× cheaper per token at current volume |
| codellama-7b-instruct | $751 | No direct equivalent | Evaluate alternatives; GPU utilization is too low to justify cost |
| granite-3-2-8b-instruct | $74 | No direct equivalent (proprietary Red Hat model) | Self-hosting justified — competitive cost and data sovereignty |
Self-hosting makes sense for high-usage models (granite-3-2-8b-instruct, llama-scout-17b) where the fixed GPU cost is amortized across many tokens and data sovereignty matters. Low-usage models cost significantly more per token self-hosted than via external APIs — the fixed node cost is the same whether the GPU serves 1M or 1B tokens per month.
External models routed through Vertex AI are billed purely on consumption with no fixed infrastructure cost. Input and output tokens are priced separately.
| Model | Input $/1M tokens | Output $/1M tokens |
|---|---|---|
| minimax-m2 | $0.30 | $1.20 |
| qwen3-235b | $0.22 | $0.88 |
| gpt-oss-120b | $0.09 | $0.36 |
| gpt-oss-20b | $0.07 | $0.25 |
| claude-sonnet-4-6 | $3.00 | $15.00 |
| claude-opus-4-6 | $5.00 | $25.00 |
| claude-sonnet-4-5 | $3.00 | $15.00 |
| claude-3-5-haiku | $1.00 | $5.00 |
| gemini-2.5-pro | $1.25 | $10.00 |
External model costs are official published Vertex AI rates. Actual rates may differ under Red Hat partnership agreements. For models with very low usage on the cluster (under 5M tokens per month), the Vertex AI pay-per-token model is almost always cheaper than maintaining a dedicated GPU node.
The cost data came from an AWS Cost Explorer CSV export covering 6 months (September 2025 – February 2026). Here is the exact methodology used to arrive at the per-model $/1M token figures.
The cost attribution covers only the 7 InferenceServices running on this cluster (in the
llm-hosting namespace). Models running on separate inference servers (maas00/smc00) such as
qwen3-14b, deepseek-r1-distill-qwen-14b, and microsoft-phi-4 are excluded because their infrastructure
costs are tracked separately and are not reflected in this AWS billing data.
The CSV had one row per AWS instance type per month. We summed each GPU instance type over the full 6-month period:
g6e.12xlarge (L40S×4) → $7,899 total g6e.2xlarge (L40S×1) → $7,365 total g6.2xlarge (L4×1) → $3,804 total g4dn.2xlarge (T4×1) → $2,597 total Total GPU cost → $21,665
The total cluster cost over 6 months was $116,984. GPU nodes represent only 18.5% of that. The remaining 81.5% covers workers, control plane, database, networking, storage, and other overhead.
Overhead multiplier = Total cluster cost / GPU cost
= $116,984 / $21,665
= 5.40x
This means for every $1 spent on GPU hardware, $4.40 goes to supporting infrastructure.
We queried the OpenShift cluster to confirm exactly which InferenceServices are running and which instance type each uses:
oc get inferenceservice -n llm-hosting
This confirmed exactly 7 models deployed as InferenceServices in the
llm-hosting namespace: granite-3-2-8b-instruct, llama-scout-17b, codellama-7b-instruct,
granite-4-0-h-tiny, granite-docling-258m, llama-guard-3-1b, and nomic-embed-text-v1-5.
Each model was assigned its instance's share of the monthly GPU cost. Models sharing an instance
(e.g. granite-4-0-h-tiny and granite-docling-258m on the same g6.2xlarge, or llama-guard-3-1b and
nomic-embed-text-v1-5 on the same g4dn.2xlarge) split the cost equally.
We queried the LiteLLM spend logs for the last 3 months:
SELECT model, SUM(total_tokens) FROM "LiteLLM_SpendLogs" WHERE "startTime" >= NOW() - INTERVAL '3 months' GROUP BY model ORDER BY 2 DESC;
Model names were normalised (e.g. openai/granite-3-2-8b-instruct and granite-3-2-8b-instruct were combined) to get total tokens per logical model.
Monthly GPU cost for model / (tokens used in 3 months / 3 / 1,000,000)
Full cost per 1M tokens = GPU-only cost × 5.40
Embedding models use input cost only (output tokens = 0). Chat models use a combined token count.
Important caveat: These are internal infrastructure cost estimates, not billing rates. They reflect the amortised cost of running the cluster assuming current utilisation levels. Models with low usage will show a higher per-token cost because the fixed GPU cost is divided across fewer tokens.