RHDP LiteMaaS

Model as a Service for Red Hat Demo Platform

Model Cost Analysis

Six-month cost breakdown for the RHDP MaaS cluster — GPU spend, per-model token costs, and self-hosted vs external provider economics.

On This Page

Cluster Cost Overview (Sep 2025 – Feb 2026)

Six months of production cluster costs across all node types, broken down between GPU compute and supporting infrastructure.

Total Cluster Cost
$116,984
6 months (Sep 2025 – Feb 2026)
GPU Node Costs
$21,665
18.5% of total spend
Overhead
$95,319
81.5% — workers, control plane, DB, networking
Overhead Multiplier
5.40×
For every $1 of GPU cost, $4.40 goes to overhead

Monthly Breakdown

Month Total Cost
Sep 2025$18,900
Oct 2025$20,267
Nov 2025$23,455
Dec 2025$18,717
Jan 2026$17,089
Feb 2026$18,555
Total$116,984

The overhead multiplier is the critical number for cost planning. The cluster infrastructure (worker nodes, control plane, PostgreSQL, networking, storage) costs 5.40× the raw GPU spend. Any model running on dedicated GPU hardware must be used heavily enough to justify both the GPU cost and that 4.40× overhead on top.

GPU Node Cost Breakdown

Each GPU node type is dedicated to specific models. Costs are amortized over the 6-month window.

Instance Type GPU 6-Month Cost Cost/Month Models Hosted
g6e.12xlarge L40S × 4 $7,899 $1,317 llama-scout-17b (2 replicas)
g6e.2xlarge L40S × 1 $7,365 $1,227 granite-3-2-8b-instruct, codellama-7b-instruct
g6.2xlarge L4 × 1 $3,804 $634 granite-4-0-h-tiny, granite-docling-258m
g4dn.2xlarge T4 × 1 $2,597 $433 llama-guard-3-1b, nomic-embed-text-v1-5
Total $21,665 $3,611

llama-scout-17b accounts for 36% of GPU spend ($7,899 over 6 months) due to the 4-GPU instance type required for 2 replicas. Models requiring multi-GPU instances have a much steeper cost cliff when token utilization is low.

Cost Per 1M Tokens (GPU-Only vs Full Infrastructure)

The GPU-only cost divides the monthly GPU node cost by observed token throughput. The full cost applies the 5.40× overhead multiplier to reflect the true all-in infrastructure cost per token. Only the 7 InferenceServices confirmed running on this cluster are included below.

Model Node / Instance GPU $/month Full $/month (×5.40) 3-Month Token Usage Full $/1M tokens
granite-3-2-8b-instruct g6e.2xlarge (L40S) $614 $3,314 135M $73.64
llama-scout-17b g6e.12xlarge ×2 (L40S×4) $2,633 $14,218 99.9M $426.82
codellama-7b-instruct g6e.2xlarge (L40S) $614 $3,314 13.2M $750.90
granite-4-0-h-tiny g6.2xlarge (L4) $317 $1,712 5.0M $1,024.43
granite-docling-258m g6.2xlarge (L4) $317 $1,712 ~0 N/A
llama-guard-3-1b g4dn.2xlarge (T4) $216 $1,168 4.6M $766.55
nomic-embed-text-v1-5 g4dn.2xlarge (T4) $216 $1,168 1.5M $2,351.43

Models on maas00/smc00 servers (qwen3-14b, deepseek-r1-distill-qwen-14b, microsoft-phi-4) are excluded — their infrastructure costs are not captured in this AWS billing data. Only the 7 InferenceServices running on this cluster (in the llm-hosting namespace) are included above.

Full cost = GPU cost × 5.40 (overhead multiplier). Self-hosted cost is high for low-usage models — consider moving them to external providers. granite-3-2-8b-instruct has sufficient token volume to amortize the fixed GPU cost efficiently.

Why nomic-embed-text-v1-5 Shows $2,351/1M

This is a utilization problem, not a model problem. Two factors combine:

At 10× usage (15M tokens/3 months), cost would drop to ~$235/1M. At 100× (150M tokens), it would reach ~$24/1M — competitive with any external provider.

At current usage (1.5M tokens over 3 months), self-hosting nomic-embed costs $858 compared to ~$0.15 for the same volume via an external embedding API (e.g. OpenAI text-embedding-3-small at $0.02/1M or Vertex AI embeddings at $0.025/1M). Consider moving to an external provider unless usage is expected to grow significantly.

Self-Hosted vs External Provider Comparison

Fixed GPU costs only make sense when token volume is high enough to spread that cost thin. These examples show where self-hosting competes well — and where it does not.

Model Self-Hosted $/1M (full) External $/1M Verdict
nomic-embed-text-v1-5 $2,351 $0.02–$0.03 (external) Move to external — ~80,000× cheaper per token at current volume
codellama-7b-instruct $751 No direct equivalent Evaluate alternatives; GPU utilization is too low to justify cost
granite-3-2-8b-instruct $74 No direct equivalent (proprietary Red Hat model) Self-hosting justified — competitive cost and data sovereignty

Self-hosting makes sense for high-usage models (granite-3-2-8b-instruct, llama-scout-17b) where the fixed GPU cost is amortized across many tokens and data sovereignty matters. Low-usage models cost significantly more per token self-hosted than via external APIs — the fixed node cost is the same whether the GPU serves 1M or 1B tokens per month.

Vertex AI Models (Pay-per-Token)

External models routed through Vertex AI are billed purely on consumption with no fixed infrastructure cost. Input and output tokens are priced separately.

Model Input $/1M tokens Output $/1M tokens
minimax-m2$0.30$1.20
qwen3-235b$0.22$0.88
gpt-oss-120b$0.09$0.36
gpt-oss-20b$0.07$0.25
claude-sonnet-4-6$3.00$15.00
claude-opus-4-6$5.00$25.00
claude-sonnet-4-5$3.00$15.00
claude-3-5-haiku$1.00$5.00
gemini-2.5-pro$1.25$10.00

External model costs are official published Vertex AI rates. Actual rates may differ under Red Hat partnership agreements. For models with very low usage on the cluster (under 5M tokens per month), the Vertex AI pay-per-token model is almost always cheaper than maintaining a dedicated GPU node.

How This Was Calculated

The cost data came from an AWS Cost Explorer CSV export covering 6 months (September 2025 – February 2026). Here is the exact methodology used to arrive at the per-model $/1M token figures.

The cost attribution covers only the 7 InferenceServices running on this cluster (in the llm-hosting namespace). Models running on separate inference servers (maas00/smc00) such as qwen3-14b, deepseek-r1-distill-qwen-14b, and microsoft-phi-4 are excluded because their infrastructure costs are tracked separately and are not reflected in this AWS billing data.

Step 1 — Extract GPU instance costs from the CSV

The CSV had one row per AWS instance type per month. We summed each GPU instance type over the full 6-month period:

g6e.12xlarge (L40S×4)  → $7,899 total
g6e.2xlarge  (L40S×1)  → $7,365 total
g6.2xlarge   (L4×1)    → $3,804 total
g4dn.2xlarge (T4×1)    → $2,597 total
Total GPU cost          → $21,665

Step 2 — Calculate the overhead multiplier

The total cluster cost over 6 months was $116,984. GPU nodes represent only 18.5% of that. The remaining 81.5% covers workers, control plane, database, networking, storage, and other overhead.

Overhead multiplier = Total cluster cost / GPU cost
                    = $116,984 / $21,665
                    = 5.40x

This means for every $1 spent on GPU hardware, $4.40 goes to supporting infrastructure.

Step 3 — Map models to instance types

We queried the OpenShift cluster to confirm exactly which InferenceServices are running and which instance type each uses:

oc get inferenceservice -n llm-hosting

This confirmed exactly 7 models deployed as InferenceServices in the llm-hosting namespace: granite-3-2-8b-instruct, llama-scout-17b, codellama-7b-instruct, granite-4-0-h-tiny, granite-docling-258m, llama-guard-3-1b, and nomic-embed-text-v1-5. Each model was assigned its instance's share of the monthly GPU cost. Models sharing an instance (e.g. granite-4-0-h-tiny and granite-docling-258m on the same g6.2xlarge, or llama-guard-3-1b and nomic-embed-text-v1-5 on the same g4dn.2xlarge) split the cost equally.

Step 4 — Get token usage from the database

We queried the LiteLLM spend logs for the last 3 months:

SELECT model, SUM(total_tokens)
FROM "LiteLLM_SpendLogs"
WHERE "startTime" >= NOW() - INTERVAL '3 months'
GROUP BY model ORDER BY 2 DESC;

Model names were normalised (e.g. openai/granite-3-2-8b-instruct and granite-3-2-8b-instruct were combined) to get total tokens per logical model.

Step 5 — Calculate GPU-only $/1M tokens

Monthly GPU cost for model / (tokens used in 3 months / 3 / 1,000,000)

Step 6 — Apply overhead multiplier

Full cost per 1M tokens = GPU-only cost × 5.40

Embedding models use input cost only (output tokens = 0). Chat models use a combined token count.

Important caveat: These are internal infrastructure cost estimates, not billing rates. They reflect the amortised cost of running the cluster assuming current utilisation levels. Models with low usage will show a higher per-token cost because the fixed GPU cost is divided across fewer tokens.