Google Vertex AI Integration

How LiteMaaS connects to Google Cloud Vertex AI — service accounts, model registration, credential mounting, and operational configuration.

Overview

Google Cloud Vertex AI is the primary provider for the majority of models in LiteMaaS. This includes all Claude models (via Anthropic on Vertex), Gemini 2.5 Pro, and all Model Garden OSS models such as GPT-OSS, Qwen3, and MiniMax.

LiteMaaS routes requests to Vertex AI through the LiteLLM proxy, using two distinct GCP service accounts — one per GCP billing project. All credentials are stored as OpenShift secrets and mounted as read-only volumes inside the litellm pod.

Traffic stays on Google's network. All Vertex AI calls use Private Service Connect endpoints, keeping inference traffic off the public internet.

Why Vertex AI

Vertex AI is preferred over direct API access (e.g., api.anthropic.com) for several operational reasons:

Unified billing — all model spend consolidates through the RHDP GCP project, simplifying cost allocation and chargeback reporting.
Model Garden access — OSS models like GPT-OSS 120B, Qwen3-235B, and MiniMax M2 are only available through Google's Model Garden hosted inference endpoints.
Private Service Connect — traffic stays within Google's network perimeter rather than traversing the public internet.
No per-key API rate limits — direct Anthropic API keys have per-key RPM/TPM caps that are harder to manage at scale. Vertex capacity is provisioned at the GCP project level.
SLA coverage — enterprise GCP SLAs apply to Vertex-hosted models, providing stronger uptime guarantees for production use.

Two GCP Service Accounts

LiteMaaS uses two separate GCP service accounts, each associated with a different GCP project. The SA selection determines both the billing project and which models are accessible.

Service Account	GCP Project	Models Served	OCP Secret	Mount Path
`litemaas-vertex-sa@rhdp-infra.iam.gserviceaccount.com`	`rhdp-infra`	OSS models: GPT-OSS 120B, GPT-OSS 20B, Qwen3-235B, MiniMax M2	`vertex-ai-sa-key`	`/etc/vertex-credentials/key.json`
`rhdp-vertex-app-sa@itpc-gcp-product-all-claude.iam.gserviceaccount.com`	`itpc-gcp-product-all-claude`	Claude (all versions), Gemini 2.5 Pro	`vertex-claude-sa-key`	`/etc/vertex-claude-credentials/key.json`

Required IAM role: Both service accounts must have roles/aiplatform.user granted on their respective GCP projects. Without this role, all inference calls return HTTP 403.

Do not mix SA paths. Registering a Claude model with /etc/vertex-credentials/key.json (the OSS SA) will fail — that SA is not authorized on the itpc-gcp-product-all-claude project. Always match the vertex_credentials path to the correct GCP project.

Models on Vertex AI

The following models are currently registered in LiteMaaS via Vertex AI backends. Costs are listed per 1M tokens (input / output).

LiteMaaS Name	Vertex Backend	GCP Project	Cost (in / out per 1M)
`claude-sonnet-4-6`	`vertex_ai/claude-sonnet-4-6`	`itpc-gcp-product-all-claude`	$3.00 / $15.00
`claude-opus-4-6`	`vertex_ai/claude-opus-4-6-v1`	`itpc-gcp-product-all-claude`	$5.00 / $25.00
`claude-sonnet-4-5`	`vertex_ai/claude-sonnet-4-5`	`itpc-gcp-product-all-claude`	$3.00 / $15.00
`claude-3-5-haiku`	`vertex_ai/claude-3-5-haiku`	`itpc-gcp-product-all-claude`	$1.00 / $5.00
`gemini-2.5-pro`	`vertex_ai/gemini-2.5-pro`	`itpc-gcp-product-all-claude` (Gemini AI Studio key)	$1.25 / $10.00
`gpt-oss-120b`	`vertex_ai/openai/gpt-oss-120b-maas`	`rhdp-infra`	$0.09 / $0.36
`gpt-oss-20b`	`vertex_ai/openai/gpt-oss-20b-maas`	`rhdp-infra`	$0.07 / $0.25
`minimax-m2`	`vertex_ai/minimaxai/minimax-m2-maas`	`rhdp-infra`	$0.30 / $1.20
`qwen3-235b`	`vertex_ai/qwen/qwen3-235b-a22b-instruct-2507-maas`	`rhdp-infra`	$0.22 / $0.88

Model Garden path format: OSS models served through Google's Model Garden use an extended path format — vertex_ai/<provider>/<model-id>-maas. The -maas suffix is required by the Model Garden endpoint and is not a LiteMaaS convention.

Rate Limits

All Vertex AI-backed models share the following rate limits, which are set on the admin key rhdp-automation-admin in LiteLLM:

Limit Type	Value	Scope
Tokens per minute (TPM)	400,000	Per key
Requests per minute (RPM)	500	Per key

These limits are enforced by LiteLLM at the virtual key level, not at the Vertex API level. Vertex-side quotas are provisioned separately per GCP project and are generally higher than the LiteMaaS-enforced limits.

If users report 429 errors, check whether the issue is a LiteLLM-enforced per-key limit or an actual Vertex quota. Run oc logs -n litellm-rhpds -l app=litellm and look for RateLimitError vs QuotaExceeded to distinguish the source.

Credential Mounting

GCP service account JSON key files are stored as OpenShift secrets and mounted as read-only volumes into the litellm deployment. LiteLLM reads the file path at inference time when the model's vertex_credentials parameter is set.

OCP Secrets

Create (or update) the secrets in the litellm-rhpds namespace. Each secret contains a single key named key.json holding the full GCP SA JSON key file content.

# Create the OSS model SA secret (rhdp-infra project)
oc create secret generic vertex-ai-sa-key \
  --from-file=key.json=/path/to/litemaas-vertex-sa-key.json \
  -n litellm-rhpds

# Create the Claude / Gemini SA secret (itpc-gcp-product-all-claude project)
oc create secret generic vertex-claude-sa-key \
  --from-file=key.json=/path/to/rhdp-vertex-app-sa-key.json \
  -n litellm-rhpds

Never commit SA key files to git. Treat them as passwords. Use oc create secret generic directly from the downloaded JSON file, then delete the local copy.

Volume Mounts

The secrets are projected into the pod via the following volume and volumeMount configuration in the litellm Deployment spec:

# Volumes (add to .spec.template.spec.volumes)
volumes:
  - name: vertex-sa-key
    secret:
      secretName: vertex-ai-sa-key
  - name: vertex-claude-sa-key
    secret:
      secretName: vertex-claude-sa-key

# Volume mounts (add to .spec.template.spec.containers[0].volumeMounts)
volumeMounts:
  - name: vertex-sa-key
    mountPath: /etc/vertex-credentials
    readOnly: true
  - name: vertex-claude-sa-key
    mountPath: /etc/vertex-claude-credentials
    readOnly: true

After patching the deployment, verify that both files are visible inside the running pod:

# Verify credential files are present in the pod
oc exec -n litellm-rhpds deploy/litellm -- ls -la /etc/vertex-credentials/
oc exec -n litellm-rhpds deploy/litellm -- ls -la /etc/vertex-claude-credentials/

# Expected output for each directory
total 8
drwxr-xr-x 2 root root  60 Apr 21 00:00 .
drwxr-xr-x 1 root root  60 Apr 21 00:00 ..
-rw-r--r-- 1 root root 2400 Apr 21 00:00 key.json

Model Registration

When registering a model in LiteLLM, the vertex_credentials field must point to the correct mounted key file path. LiteLLM reads this path at request time — it does not cache the credentials in memory on startup.

Model params are encrypted at rest. All fields in litellm_params — including vertex_credentials paths and vertex_project — are encrypted in PostgreSQL using the LiteLLM master key. Rotating the master key requires re-registering all models.

Registering Claude Models

Claude models use the itpc-gcp-product-all-claude GCP project and the SA key mounted at /etc/vertex-claude-credentials/key.json.

# Register claude-sonnet-4-6 via Vertex AI
curl -X POST "$LITELLM_URL/model/new" \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "claude-sonnet-4-6",
    "litellm_params": {
      "model": "vertex_ai/claude-sonnet-4-6",
      "vertex_project": "itpc-gcp-product-all-claude",
      "vertex_location": "us-east5",
      "vertex_credentials": "/etc/vertex-claude-credentials/key.json"
    },
    "model_info": {
      "input_cost_per_token": 0.000003,
      "output_cost_per_token": 0.000015
    }
  }'

# Register claude-opus-4-6
curl -X POST "$LITELLM_URL/model/new" \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "claude-opus-4-6",
    "litellm_params": {
      "model": "vertex_ai/claude-opus-4-6-v1",
      "vertex_project": "itpc-gcp-product-all-claude",
      "vertex_location": "us-east5",
      "vertex_credentials": "/etc/vertex-claude-credentials/key.json"
    },
    "model_info": {
      "input_cost_per_token": 0.000005,
      "output_cost_per_token": 0.000025
    }
  }'

# Register gemini-2.5-pro (also uses the claude SA / GCP project)
curl -X POST "$LITELLM_URL/model/new" \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "gemini-2.5-pro",
    "litellm_params": {
      "model": "vertex_ai/gemini-2.5-pro",
      "vertex_project": "itpc-gcp-product-all-claude",
      "vertex_location": "us-central1",
      "vertex_credentials": "/etc/vertex-claude-credentials/key.json"
    },
    "model_info": {
      "input_cost_per_token": 0.00000125,
      "output_cost_per_token": 0.00001
    }
  }'

Registering OSS Models

Model Garden OSS models use the rhdp-infra project and the SA key mounted at /etc/vertex-credentials/key.json. The backend model path uses the extended vertex_ai/<provider>/<model-id>-maas format.

# Register gpt-oss-120b (Model Garden)
curl -X POST "$LITELLM_URL/model/new" \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "gpt-oss-120b",
    "litellm_params": {
      "model": "vertex_ai/openai/gpt-oss-120b-maas",
      "vertex_project": "rhdp-infra",
      "vertex_location": "us-central1",
      "vertex_credentials": "/etc/vertex-credentials/key.json"
    },
    "model_info": {
      "input_cost_per_token": 0.00000009,
      "output_cost_per_token": 0.00000036
    }
  }'

# Register qwen3-235b (Model Garden)
curl -X POST "$LITELLM_URL/model/new" \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "qwen3-235b",
    "litellm_params": {
      "model": "vertex_ai/qwen/qwen3-235b-a22b-instruct-2507-maas",
      "vertex_project": "rhdp-infra",
      "vertex_location": "us-central1",
      "vertex_credentials": "/etc/vertex-credentials/key.json"
    },
    "model_info": {
      "input_cost_per_token": 0.00000022,
      "output_cost_per_token": 0.00000088
    }
  }'

# Register minimax-m2 (Model Garden)
curl -X POST "$LITELLM_URL/model/new" \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "minimax-m2",
    "litellm_params": {
      "model": "vertex_ai/minimaxai/minimax-m2-maas",
      "vertex_project": "rhdp-infra",
      "vertex_location": "us-central1",
      "vertex_credentials": "/etc/vertex-credentials/key.json"
    },
    "model_info": {
      "input_cost_per_token": 0.0000003,
      "output_cost_per_token": 0.0000012
    }
  }'

Architecture Diagram

The diagram below shows how the two service accounts map to GCP projects and which models each SA serves. OCP secrets are projected into the pod as read-only volume mounts.

flowchart TD LiteLLM["LiteLLM Proxy\nlitellm-rhpds"] --> SA1["SA: litemaas-vertex-sa\nrhdp-infra project"] LiteLLM --> SA2["SA: rhdp-vertex-app-sa\nitpc-gcp-product-all-claude"] SA1 -->|"Model Garden OSS"| GPT120["GPT-OSS 120B"] SA1 -->|"Model Garden OSS"| GPT20["GPT-OSS 20B"] SA1 -->|"Model Garden OSS"| Qwen["Qwen3-235B"] SA1 -->|"Model Garden OSS"| MiniMax["MiniMax M2"] SA2 -->|"Anthropic on Vertex"| Claude["Claude Sonnet / Opus 4.x\nHaiku 3.5"] SA2 -->|"Gemini AI Studio"| Gemini["Gemini 2.5 Pro"] OCP1["OCP Secret\nvertex-ai-sa-key"] -. "mounted at\n/etc/vertex-credentials" .-> LiteLLM OCP2["OCP Secret\nvertex-claude-sa-key"] -. "mounted at\n/etc/vertex-claude-credentials" .-> LiteLLM

Request flow

When a user sends a request to the LiteMaaS API, the following sequence occurs:

The request arrives at the LiteLLM proxy and is matched to a registered model by name.
LiteLLM reads the model's litellm_params from PostgreSQL (decrypting with the master key).
LiteLLM reads the SA key file from the mounted volume path specified in vertex_credentials.
LiteLLM exchanges the SA key for a short-lived OAuth2 access token via Google's token endpoint.
The inference request is forwarded to the Vertex AI endpoint for the appropriate GCP project and region.
The response is streamed back through LiteLLM to the user, and spend is logged to PostgreSQL.

Operational Configuration

Several environment variables on the litellm deployment govern how LiteMaaS interacts with the database and with the LiteMaaS backend. Understanding these is important for safe day-2 operations.

Environment Variable	Value	Effect
`DISABLE_SCHEMA_UPDATE`	`true`	Prevents Prisma from running schema migrations on pod restart. Without this, a LiteLLM version bump could drop LiteMaaS-specific tables.
`STORE_MODEL_IN_DB`	`true`	Model configs (including `vertex_credentials` paths and `vertex_project`) are persisted in PostgreSQL, not in a config file. Models survive pod restarts.
`LITELLM_AUTO_SYNC`	`false`	Models registered in LiteLLM are not automatically synced to the LiteMaaS backend model list. Models must be registered through the LiteMaaS API or UI to appear to users.

Master key rotation requires model re-registration. All litellm_params (including SA key file paths and GCP project IDs) are encrypted with the LiteLLM master key. If the master key changes, every model must be deleted and re-registered so that params are re-encrypted under the new key.

Syncing models to LiteMaaS

Because LITELLM_AUTO_SYNC=false, a model can be registered in LiteLLM but not yet visible to users in the LiteMaaS UI or /v1/models endpoint. After registering a model in LiteLLM, explicitly add it to the LiteMaaS backend via the LiteMaaS admin API or UI to make it available.

# Check which models are registered in LiteLLM
curl -s "$LITELLM_URL/model/info" \
  -H "Authorization: Bearer $MASTER_KEY" | jq '.[].model_name'

# Check which models are visible in LiteMaaS (user-facing endpoint)
curl -s "$LITEMAAS_URL/v1/models" \
  -H "Authorization: Bearer $USER_KEY" | jq '.data[].id'

Troubleshooting

BXNIM0415E — credential error on multi-turn requests

This IBM-origin error code surfaces when per-model SA credentials fail during multi-turn conversations (i.e., the second or later message in a thread). The root cause is that LiteLLM re-reads the credential file on each request, and a timing or file-permission issue causes a transient failure.

Fix: Set the Vertex credentials as deployment-level environment variables instead of relying solely on the per-model vertex_credentials path. Add GOOGLE_APPLICATION_CREDENTIALS as an env var pointing to the primary SA key path, and LiteLLM will use it as a fallback.

# Add as env var on the litellm deployment
oc set env deploy/litellm \
  GOOGLE_APPLICATION_CREDENTIALS=/etc/vertex-claude-credentials/key.json \
  -n litellm-rhpds

Model appears in LiteLLM but not in LiteMaaS UI

This is expected when LITELLM_AUTO_SYNC=false. Register the model through the LiteMaaS admin interface or API to make it visible to users. See Model Management for the full workflow.

HTTP 403 on Vertex API calls

A 403 from Vertex means the SA does not have the required IAM role on the GCP project. Verify that the SA email has roles/aiplatform.user by checking the GCP IAM console or running:

# Verify IAM binding for the OSS SA
gcloud projects get-iam-policy rhdp-infra \
  --flatten="bindings[].members" \
  --filter="bindings.members:litemaas-vertex-sa@rhdp-infra.iam.gserviceaccount.com" \
  --format="table(bindings.role)"

# Verify IAM binding for the Claude SA
gcloud projects get-iam-policy itpc-gcp-product-all-claude \
  --flatten="bindings[].members" \
  --filter="bindings.members:rhdp-vertex-app-sa@itpc-gcp-product-all-claude.iam.gserviceaccount.com" \
  --format="table(bindings.role)"

Model shows LEGACY status

A model with LEGACY status in the LiteMaaS UI is still fully functional — it continues to serve inference requests. LEGACY status indicates that a newer version of the same model is available. Consider upgrading the registered backend model string to the latest version at your next maintenance window.

Checking Vertex connectivity from the pod

# Test that the OSS SA can reach Vertex
oc exec -n litellm-rhpds deploy/litellm -- \
  curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer $(cat /etc/vertex-credentials/key.json | python3 -c 'import sys,json; print(json.load(sys.stdin)[\"private_key\"][:20])')" \
  "https://us-central1-aiplatform.googleapis.com/v1/projects/rhdp-infra/locations/us-central1/publishers/openai/models"

# View recent Vertex-related errors in the LiteLLM log
oc logs -n litellm-rhpds -l app=litellm --tail=100 | grep -i "vertex\|403\|quota\|rate"

When in doubt, check the pod logs first. LiteLLM logs the full upstream error message from Vertex including status code, error type, and GCP project ID. This is the fastest way to distinguish between an IAM issue, a quota issue, and a model path typo.

← Previous Claude Code MCP Next → AWS Bedrock

RHDP LiteMaaS