Migration Guide — Challenges + Lessons

Migration Guide → Challenges & Lessons

7. Challenges and Fixes

Challenge 1: `$password` undefined in the bootstrap Helm template

Symptom: LibreChat's user creation init job was creating accounts with the hardcoded username user1@example.com and a placeholder password string, not the actual tenant credentials.

Root cause: The Helm bootstrap template in tenant/bootstrap/templates/applications.yaml received tenant.password in .Values but never declared the $password variable. The template silently fell back to its hardcoded defaults rather than failing loudly.

Fix Added {{- $password := .Values.tenant.password | default "" }} at the top of the template, alongside the existing $username declaration. Email was then set as "{{ $username }}@example.com" and the password flowed through correctly. The silent fallback was the problem — always test that injected Helm values actually reach the places you expect.

Challenge 2: LibreChat secrets showing placeholder values

Symptom: LibreChat crashed immediately after deployment. Inspecting the pod logs showed that creds_key, jwt_secret, and meili_master_key all contained literal placeholder strings like FROM_SECRETS or REPLACE_ME.

Root cause: The original monolithic role constructed these secrets by reading from a Kubernetes Secret object that it created per-user. In the new pattern, that Secret did not exist — there was no Ansible task creating it, and the Helm template was not generating it either.

Fix Generated deterministic values directly in the Helm bootstrap template using sha256sum of the tenant username, so no external Secret object is needed:

# In tenant/bootstrap/templates/applications.yaml
creds_key: {{ sha256sum $username }}
jwt_secret: {{ sha256sum (printf "%s-jwt" $username) }}
meili_master_key: {{ sha256sum (printf "%s-meili" $username) | trunc 16 }}

These are stable across ArgoCD syncs — the same username always produces the same secret values.

Challenge 3: OCP login broke after cluster re-provision

Symptom: Users could not log in via RHBK after a cluster was re-provisioned. The RHBK login page showed "Invalid client credentials." The RHBK realm and users were intact; only OCP authentication was broken.

Root cause: Re-running the cluster provisioner regenerated the OAuth client secret for the RHBK Identity Provider configured in OpenShift's OAuth config. The oauth-openshift pods in openshift-authentication had the old client secret cached in memory and continued using it, causing all authentication attempts to fail.

Fix Restart the oauth-openshift pods in the openshift-authentication namespace after any OAuth configuration change or cluster re-provision:

oc delete pods -n openshift-authentication -l app=oauth-openshift

The pods restart and reload the current OAuth config, picking up the new client secret.

Challenge 4: `git_url` undefined when `gitops_bootstrap` ran

Symptom: The ocp4_workload_gitops_bootstrap role failed with a variable undefined error. The repo URL for the tenant's Gitea mirror was not set anywhere the role could find it.

Root cause: ocp4_workload_gitops_bootstrap_repo_url was expected to be available when the bootstrap role ran, but no task had set it. In the original lab this URL was hardcoded. In the new pattern, the URL depends on which Gitea instance is on the cluster and what the tenant's org name is — neither of which is known at AgV authoring time.

Fix The ocp4_workload_tenant_gitea role now exports the mirrored repo URL as a set_fact at the end of its workload.yml:

# At the end of ocp4_workload_tenant_gitea/tasks/workload.yml
- name: Export repo URL for gitops_bootstrap
  ansible.builtin.set_fact:
    ocp4_workload_gitops_bootstrap_repo_url: "{{ _gitea_repo_clone_url }}"

Because tenant_gitea runs before gitops_bootstrap in the workloads list, the variable is available when gitops_bootstrap needs it. No AgV change required.

Challenge 5: Namespace naming was inconsistent

Symptom: Namespace names changed format between development iterations. Some ArgoCD Applications were looking for agent-mcpuser-abc123; other code was generating mcpuser-abc123-agent. Resources could not find their target namespaces.

Root cause: The naming convention was not pinned early in the project. The ocp4_workload_tenant_namespace role defaulted to one format; the bootstrap Helm chart was written assuming another format; ArgoCD AppProject target namespaces used a third format written before the others were finalised.

Fix Standardised on {suffix}-{username} across all roles, all Helm templates, and all ArgoCD AppProject definitions — for example, agent-mcpuser-abc123, librechat-mcpuser-abc123. All three sources (tenant_namespace role, bootstrap Helm chart, AppProject) were updated in a single commit to prevent partial states.

Challenge 6: LibreChat log directory permission denied

Symptom: LibreChat pod entered CrashLoopBackOff immediately. Pod logs showed: mkdir /app/logs/: permission denied.

Root cause: The LibreChat container image attempts to create /app/logs and /app/api/logs directories at startup. On OpenShift, SCCs prevent containers from writing to paths owned by root, and the directories did not exist in the image layers — the container expected to create them at runtime.

Fix Added emptyDir volume mounts for both log paths in the bootstrap Helm values. OpenShift mounts the emptyDir before the container starts, which means the directories already exist and the container does not need to create them:

extraVolumes:
  - name: logs
    emptyDir: {}
  - name: api-logs
    emptyDir: {}
extraVolumeMounts:
  - name: logs
    mountPath: /app/logs
  - name: api-logs
    mountPath: /app/api/logs

Challenge 7: LiteMaaS models rendered as a comma string, not a YAML list

Symptom: LibreChat rejected the generated configYaml — the models field contained "gpt-4o,llama-3.3-70b" as a plain string instead of a proper YAML list. LibreChat's YAML parser threw a type error.

Root cause: AgV passes litemaas.models to the bootstrap Helm values as a comma-joined string (produced by | join(',') in the Jinja template). The Helm chart was then passing this string directly into the configYamlContent field without splitting it back into a list.

Fix Used Helm's splitList function in the bootstrap template to convert the comma-separated string back into a YAML list before rendering configYamlContent:

# In the Helm template
{{- $models := splitList "," .Values.litemaas.models }}
endpoints:
  - name: LiteMaaS
    models:
{{- range $models }}
      - name: {{ . }}
{{- end }}

8. Lessons Learned

Monolithic roles do not scale. The ocp4_workload_mcp_user role was doing too much. When something broke, the failure could originate anywhere across hundreds of tasks. Splitting into single-responsibility roles made every failure immediately attributable to one role, and made updates surgical rather than all-or-nothing.
"Delete the cluster" is not a destroy strategy. When you move to shared clusters, every role must clean up exactly what it created — and nothing more. Write remove_workload.yml at the same time you write workload.yml, not as an afterthought when destroy testing is on the schedule.
Move cluster-wide infra out of the order. Installing Pipelines, OpenShift GitOps, ToolHive, and Gitea per order is wasteful. These are cluster-wide operators. Install them once in the cluster provisioner and let every order on that cluster benefit. The per-order provisioning time drops dramatically.
One username variable, referenced everywhere. The original lab had sequential names (user1, user2) with no connection to the order GUID. The new pattern has a single ocp4_workload_tenant_keycloak_username — set once in AgV, used by every role, every Helm chart, and every ArgoCD Application. Tracing a problem from an ArgoCD app name back to an order is immediate.
Deterministic passwords from the GUID. Both the old and new patterns derive passwords from the GUID rather than generating random values. Deterministic derivation means you never need to store the password anywhere — it can always be recalculated from the GUID. The new sha256 formula is stronger and satisfies modern password complexity requirements with the fixed prefix and suffix.
ArgoCD cascade finalizer is essential for clean destroy. Without the cascade finalizer on the bootstrap Application, deleting it leaves all child Applications and all their synced Kubernetes resources as orphans in the cluster. Adding argocd.argoproj.io/finalizer: resources-finalizer.argocd.argoproj.io to the bootstrap Application ensures ArgoCD cleans up the entire tree before the parent Application is removed.
Test destroy on every iteration, not just at the end. Destroy testing was done late in this migration and uncovered several ordering problems in remove_workloads: and missing remove_workload.yml implementations. Running a full provision-then-destroy cycle on every significant change would have caught these much earlier.

9. Open Questions — Unresolved Concerns

Concern 1 (raised by @Judd): Namespace termination deadlocks and etcd orphans

If we rely on namespace deletion to clean up resources, we will run into namespace termination deadlocks. When a namespace is stuck terminating and we force it by removing the finalizer, the resources inside the namespace are not deleted — they become orphans. They remain in etcd but are inaccessible. This causes two serious problems:

Resource name conflicts on re-use. If a new order tries to create a resource with the same name as an orphaned resource, it will hit a conflict and the old orphaned data will win. New resources of the same name as orphans cannot be created cleanly.
etcd fills up over time. Orphaned resources accumulate. On a long-running shared cluster with many orders, this will eventually exhaust etcd capacity.

The current implementation uses the ArgoCD cascade finalizer to delete managed resources before the namespace is removed. This helps when ArgoCD cleanup completes cleanly. But if ArgoCD cleanup itself gets stuck, or if there are resources not managed by ArgoCD, namespace force-deletion will still produce orphans.

Status: Open — needs resolution before production scale The cascade finalizer is in place but does not cover all cases. A robust solution needs explicit resource deletion (not relying on namespace deletion as the primary cleanup mechanism) and validation that etcd does not accumulate orphans over repeated order cycles.

Concern 2 (raised by @Judd): Namespaces not created by ArgoCD — portability problem

In the current pattern, namespaces are pre-created by Ansible (ocp4_workload_tenant_namespace) before ArgoCD syncs. ArgoCD is explicitly told CreateNamespace=false and does not own the namespaces.

This creates a portability problem: the GitOps repos cannot be used on non-OcpSandbox clusters without modification. If someone tries to use the same GitOps repo on a plain OpenShift cluster (without the Sandbox API or the Ansible tenant roles), the ArgoCD sync will fail immediately because the namespaces do not exist and ArgoCD is not allowed to create them.

Being good corporate citizens means our GitOps repos should be usable by customers in their own clusters, not just in RHDP. Right now they are not — they have a hidden dependency on the Ansible pre-provisioning step that is not expressed anywhere in the GitOps repo itself.

Status: Open — design decision needed Two options to consider: (1) Have ArgoCD create the namespaces (CreateNamespace=true) — this makes the repo portable but means ArgoCD owns the namespace lifecycle, which has its own implications for multi-tenant shared clusters. (2) Document the dependency explicitly and provide a standalone mode where the GitOps repo includes a namespace creation step that is skipped when Ansible pre-creates them. Neither option is implemented yet.

Challenge: Showroom fails with "invalid value bearer" — wrong role used

The ocp4_workload_showroom_ocp_integration role uses oc login --token internally. In config: namespace this fails because the token format causes an "invalid value bearer" error. This role was designed for config: openshift-workloads.

Fix: Remove ocp4_workload_showroom_ocp_integration from your workloads: list entirely. The ocp4_workload_ocp_console_embed role is already run once by the cluster provisioner — do not add it to tenant workloads. Add these two vars to common.yaml:

openshift_api_url: "{{ sandbox_openshift_api_url }}"
openshift_cluster_admin_token: "{{ cluster_admin_agnosticd_sa_token }}"

Challenge: Catalog item in `summit-2026/` — Sandbox API cannot find cluster

The summit-2026/account.yaml sets sandbox_api: reservation: pgpu-event at account level. This restricts ALL catalog items in that directory to only schedule against Summit event clusters. Shared clusters (like cnv-us-east-ocp-3) are not in that reservation, so Sandbox API returns "No OCP shared cluster configuration found." Empty string override (reservation: "") is not supported by AgV and does not work.

Fix: Move the catalog item to agd_v2/ instead of summit-2026/. The summit-2026/ directory is specifically for Summit event clusters — shared cluster labs that run year-round belong in agd_v2/. The agd_v2/account.yaml does not set a reservation so any matching cluster in the pool can be scheduled.

Challenge: `catch_all` set as a role var — does not work

Setting ocp4_workload_litellm_virtual_keys_catch_all: false as a regular role variable does not prevent the Sandbox API from deleting other tenants' LiteMaaS keys. The catch_all setting must be in __meta__.sandbox_api.actions.destroy where Sandbox API reads it directly.

Fix: Move it to __meta__:

__meta__:
  sandbox_api:
    actions:
      destroy:
        catch_all: false

Challenge: Missing tag in `cloud_selector` — cluster not found

The Sandbox API does a subset match — a cluster qualifies only if it has ALL the tags you specify. Missing even one tag means zero clusters match and the order fails before provisioning starts. Common mistake: omitting cloud: cnv-dedicated-shared.

Fix: All three tags are required:

cloud_selector:
  cloud: cnv-dedicated-shared   # required
  demo: your-lab-name           # required
  purpose: prod                 # required

10. All Links

Pull Requests

Repository	PR	Title	Status
agnosticd/agnosticd-v2	PR #123	config/namespace: support explicit `remove_workloads` list on destroy	Merged
agnosticd/namespaced_workloads	PR #12	feat: add tenant roles for shared cluster labs (keycloak_user, namespace, gitea)	Open
agnosticd/core_workloads	PR #62	gitops_bootstrap: userinfo ConfigMap discovery + cascade cleanup on destroy	Open
rhpds/agnosticv	PR #25180	MCP Sandbox: tenant roles pattern, scheduler-only sandbox, gitops_bootstrap	Open

Repositories

Repo	Branch / path	What it contains
agnosticd/namespaced_workloads	`main` (merged ✓)	New tenant roles: tenant_keycloak_user, tenant_namespace, tenant_gitea
agnosticd/core_workloads	`main` (merged ✓)	ocp4_workload_gitops_bootstrap — cascade finalizer, remove_workload, multi-app list, userinfo ConfigMap
agnosticd/agnosticd-v2	`tenant-roles-support`	AgnosticD v2 deployer with explicit `remove_workloads` list support
rhpds/agnosticv	`tenant-namespace-roles`	AgV catalog item: mcp-with-openshift-sandbox/common.yaml
rhpds/ocpsandbox-mcp-with-openshift-gitops	`main`	GitOps repo: infra/bootstrap · platform/bootstrap · tenant/bootstrap · tenant/librechat · mcp-gitea · mcp-openshift · agent
rhpds/rhpds.ocpsandbox_mcp_with_openshift	`main`	Cluster provisioner: cluster-provision.yml (Infra + Platform bootstrap)
rhpds/rhpds.ocpsandbox_mcp_with_openshift	`main`	Original workshop (before migration): mcp-with-openshift/common.yaml

Documentation

Page	What it covers
Overview	What is Sandbox API, three-layer architecture, both patterns compared, variables reference
Summit 2026 / Scheduler-Only	Complete annotated AgV common.yaml for the scheduler-only pattern
OCP Sandbox API	Complete annotated AgV common.yaml for the full OCP Sandbox API pattern

Before you go live — check the Common Mistakes list. Many of the mistakes documented in this migration have also been seen in other labs converting to the shared cluster pattern for Summit 2026. The Common Mistakes section in the Scheduler-Only guide lists all of them with symptoms and fixes — read it before ordering your first test environment.

← Previous: Layers + Role Conversions

7. Challenges and Fixes

Challenge 1: $password undefined in the bootstrap Helm template

Challenge 2: LibreChat secrets showing placeholder values

Challenge 3: OCP login broke after cluster re-provision

Challenge 4: git_url undefined when gitops_bootstrap ran

Challenge 5: Namespace naming was inconsistent

Challenge 6: LibreChat log directory permission denied

Challenge 7: LiteMaaS models rendered as a comma string, not a YAML list

8. Lessons Learned

9. Open Questions — Unresolved Concerns

Concern 1 (raised by @Judd): Namespace termination deadlocks and etcd orphans

Concern 2 (raised by @Judd): Namespaces not created by ArgoCD — portability problem

Challenge: Showroom fails with "invalid value bearer" — wrong role used

Challenge: Catalog item in summit-2026/ — Sandbox API cannot find cluster

Challenge: catch_all set as a role var — does not work

Challenge: Missing tag in cloud_selector — cluster not found

10. All Links

Pull Requests

Repositories

Documentation

Challenge 1: `$password` undefined in the bootstrap Helm template

Challenge 4: `git_url` undefined when `gitops_bootstrap` ran

Challenge: Catalog item in `summit-2026/` — Sandbox API cannot find cluster

Challenge: `catch_all` set as a role var — does not work

Challenge: Missing tag in `cloud_selector` — cluster not found