Migration Guide → Challenges & Lessons
7. Challenges and Fixes
Challenge 1: $password undefined in the bootstrap Helm template
Symptom: LibreChat's user creation init job was creating accounts with the hardcoded username user1@example.com and a placeholder password string, not the actual tenant credentials.
Root cause: The Helm bootstrap template in tenant/bootstrap/templates/applications.yaml received tenant.password in .Values but never declared the $password variable. The template silently fell back to its hardcoded defaults rather than failing loudly.
{{- $password := .Values.tenant.password | default "" }} at the top of the template, alongside the existing $username declaration. Email was then set as "{{ $username }}@example.com" and the password flowed through correctly. The silent fallback was the problem — always test that injected Helm values actually reach the places you expect.
Challenge 2: LibreChat secrets showing placeholder values
Symptom: LibreChat crashed immediately after deployment. Inspecting the pod logs showed that creds_key, jwt_secret, and meili_master_key all contained literal placeholder strings like FROM_SECRETS or REPLACE_ME.
Root cause: The original monolithic role constructed these secrets by reading from a Kubernetes Secret object that it created per-user. In the new pattern, that Secret did not exist — there was no Ansible task creating it, and the Helm template was not generating it either.
sha256sum of the tenant username, so no external Secret object is needed:
# In tenant/bootstrap/templates/applications.yaml
creds_key: {{ sha256sum $username }}
jwt_secret: {{ sha256sum (printf "%s-jwt" $username) }}
meili_master_key: {{ sha256sum (printf "%s-meili" $username) | trunc 16 }}
These are stable across ArgoCD syncs — the same username always produces the same secret values.
Challenge 3: OCP login broke after cluster re-provision
Symptom: Users could not log in via RHBK after a cluster was re-provisioned. The RHBK login page showed "Invalid client credentials." The RHBK realm and users were intact; only OCP authentication was broken.
Root cause: Re-running the cluster provisioner regenerated the OAuth client secret for the RHBK Identity Provider configured in OpenShift's OAuth config. The oauth-openshift pods in openshift-authentication had the old client secret cached in memory and continued using it, causing all authentication attempts to fail.
oauth-openshift pods in the openshift-authentication namespace after any OAuth configuration change or cluster re-provision:
oc delete pods -n openshift-authentication -l app=oauth-openshift
The pods restart and reload the current OAuth config, picking up the new client secret.
Challenge 4: git_url undefined when gitops_bootstrap ran
Symptom: The ocp4_workload_gitops_bootstrap role failed with a variable undefined error. The repo URL for the tenant's Gitea mirror was not set anywhere the role could find it.
Root cause: ocp4_workload_gitops_bootstrap_repo_url was expected to be available when the bootstrap role ran, but no task had set it. In the original lab this URL was hardcoded. In the new pattern, the URL depends on which Gitea instance is on the cluster and what the tenant's org name is — neither of which is known at AgV authoring time.
ocp4_workload_tenant_gitea role now exports the mirrored repo URL as a set_fact at the end of its workload.yml:
# At the end of ocp4_workload_tenant_gitea/tasks/workload.yml
- name: Export repo URL for gitops_bootstrap
ansible.builtin.set_fact:
ocp4_workload_gitops_bootstrap_repo_url: "{{ _gitea_repo_clone_url }}"
Because tenant_gitea runs before gitops_bootstrap in the workloads list, the variable is available when gitops_bootstrap needs it. No AgV change required.
Challenge 5: Namespace naming was inconsistent
Symptom: Namespace names changed format between development iterations. Some ArgoCD Applications were looking for agent-mcpuser-abc123; other code was generating mcpuser-abc123-agent. Resources could not find their target namespaces.
Root cause: The naming convention was not pinned early in the project. The ocp4_workload_tenant_namespace role defaulted to one format; the bootstrap Helm chart was written assuming another format; ArgoCD AppProject target namespaces used a third format written before the others were finalised.
{suffix}-{username} across all roles, all Helm templates, and all ArgoCD AppProject definitions — for example, agent-mcpuser-abc123, librechat-mcpuser-abc123. All three sources (tenant_namespace role, bootstrap Helm chart, AppProject) were updated in a single commit to prevent partial states.
Challenge 6: LibreChat log directory permission denied
Symptom: LibreChat pod entered CrashLoopBackOff immediately. Pod logs showed: mkdir /app/logs/: permission denied.
Root cause: The LibreChat container image attempts to create /app/logs and /app/api/logs directories at startup. On OpenShift, SCCs prevent containers from writing to paths owned by root, and the directories did not exist in the image layers — the container expected to create them at runtime.
emptyDir volume mounts for both log paths in the bootstrap Helm values. OpenShift mounts the emptyDir before the container starts, which means the directories already exist and the container does not need to create them:
extraVolumes:
- name: logs
emptyDir: {}
- name: api-logs
emptyDir: {}
extraVolumeMounts:
- name: logs
mountPath: /app/logs
- name: api-logs
mountPath: /app/api/logs
Challenge 7: LiteMaaS models rendered as a comma string, not a YAML list
Symptom: LibreChat rejected the generated configYaml — the models field contained "gpt-4o,llama-3.3-70b" as a plain string instead of a proper YAML list. LibreChat's YAML parser threw a type error.
Root cause: AgV passes litemaas.models to the bootstrap Helm values as a comma-joined string (produced by | join(',') in the Jinja template). The Helm chart was then passing this string directly into the configYamlContent field without splitting it back into a list.
splitList function in the bootstrap template to convert the comma-separated string back into a YAML list before rendering configYamlContent:
# In the Helm template
{{- $models := splitList "," .Values.litemaas.models }}
endpoints:
- name: LiteMaaS
models:
{{- range $models }}
- name: {{ . }}
{{- end }}
8. Lessons Learned
-
Monolithic roles do not scale. The
ocp4_workload_mcp_userrole was doing too much. When something broke, the failure could originate anywhere across hundreds of tasks. Splitting into single-responsibility roles made every failure immediately attributable to one role, and made updates surgical rather than all-or-nothing. -
"Delete the cluster" is not a destroy strategy. When you move to shared clusters, every role must clean up exactly what it created — and nothing more. Write
remove_workload.ymlat the same time you writeworkload.yml, not as an afterthought when destroy testing is on the schedule. - Move cluster-wide infra out of the order. Installing Pipelines, OpenShift GitOps, ToolHive, and Gitea per order is wasteful. These are cluster-wide operators. Install them once in the cluster provisioner and let every order on that cluster benefit. The per-order provisioning time drops dramatically.
-
One username variable, referenced everywhere. The original lab had sequential names (
user1,user2) with no connection to the order GUID. The new pattern has a singleocp4_workload_tenant_keycloak_username— set once in AgV, used by every role, every Helm chart, and every ArgoCD Application. Tracing a problem from an ArgoCD app name back to an order is immediate. -
Deterministic passwords from the GUID. Both the old and new patterns derive passwords from the GUID rather than generating random values. Deterministic derivation means you never need to store the password anywhere — it can always be recalculated from the GUID. The new
sha256formula is stronger and satisfies modern password complexity requirements with the fixed prefix and suffix. -
ArgoCD cascade finalizer is essential for clean destroy. Without the cascade finalizer on the bootstrap Application, deleting it leaves all child Applications and all their synced Kubernetes resources as orphans in the cluster. Adding
argocd.argoproj.io/finalizer: resources-finalizer.argocd.argoproj.ioto the bootstrap Application ensures ArgoCD cleans up the entire tree before the parent Application is removed. -
Test destroy on every iteration, not just at the end. Destroy testing was done late in this migration and uncovered several ordering problems in
remove_workloads:and missingremove_workload.ymlimplementations. Running a full provision-then-destroy cycle on every significant change would have caught these much earlier.
9. Open Questions — Unresolved Concerns
Concern 1 (raised by @Judd): Namespace termination deadlocks and etcd orphans
If we rely on namespace deletion to clean up resources, we will run into namespace termination deadlocks. When a namespace is stuck terminating and we force it by removing the finalizer, the resources inside the namespace are not deleted — they become orphans. They remain in etcd but are inaccessible. This causes two serious problems:
- Resource name conflicts on re-use. If a new order tries to create a resource with the same name as an orphaned resource, it will hit a conflict and the old orphaned data will win. New resources of the same name as orphans cannot be created cleanly.
- etcd fills up over time. Orphaned resources accumulate. On a long-running shared cluster with many orders, this will eventually exhaust etcd capacity.
The current implementation uses the ArgoCD cascade finalizer to delete managed resources before the namespace is removed. This helps when ArgoCD cleanup completes cleanly. But if ArgoCD cleanup itself gets stuck, or if there are resources not managed by ArgoCD, namespace force-deletion will still produce orphans.
Concern 2 (raised by @Judd): Namespaces not created by ArgoCD — portability problem
In the current pattern, namespaces are pre-created by Ansible (ocp4_workload_tenant_namespace) before ArgoCD syncs. ArgoCD is explicitly told CreateNamespace=false and does not own the namespaces.
This creates a portability problem: the GitOps repos cannot be used on non-OcpSandbox clusters without modification. If someone tries to use the same GitOps repo on a plain OpenShift cluster (without the Sandbox API or the Ansible tenant roles), the ArgoCD sync will fail immediately because the namespaces do not exist and ArgoCD is not allowed to create them.
Being good corporate citizens means our GitOps repos should be usable by customers in their own clusters, not just in RHDP. Right now they are not — they have a hidden dependency on the Ansible pre-provisioning step that is not expressed anywhere in the GitOps repo itself.
CreateNamespace=true) — this makes the repo portable but means ArgoCD owns the namespace lifecycle, which has its own implications for multi-tenant shared clusters. (2) Document the dependency explicitly and provide a standalone mode where the GitOps repo includes a namespace creation step that is skipped when Ansible pre-creates them. Neither option is implemented yet.
Challenge: Showroom fails with "invalid value bearer" — wrong role used
The ocp4_workload_showroom_ocp_integration role uses oc login --token internally. In config: namespace this fails because the token format causes an "invalid value bearer" error. This role was designed for config: openshift-workloads.
ocp4_workload_showroom_ocp_integration from your workloads: list entirely. The ocp4_workload_ocp_console_embed role is already run once by the cluster provisioner — do not add it to tenant workloads. Add these two vars to common.yaml:
openshift_api_url: "{{ sandbox_openshift_api_url }}"
openshift_cluster_admin_token: "{{ cluster_admin_agnosticd_sa_token }}"
Challenge: Catalog item in summit-2026/ — Sandbox API cannot find cluster
The summit-2026/account.yaml sets sandbox_api: reservation: pgpu-event at account level. This restricts ALL catalog items in that directory to only schedule against Summit event clusters. Shared clusters (like cnv-us-east-ocp-3) are not in that reservation, so Sandbox API returns "No OCP shared cluster configuration found." Empty string override (reservation: "") is not supported by AgV and does not work.
agd_v2/ instead of summit-2026/. The summit-2026/ directory is specifically for Summit event clusters — shared cluster labs that run year-round belong in agd_v2/. The agd_v2/account.yaml does not set a reservation so any matching cluster in the pool can be scheduled.
Challenge: catch_all set as a role var — does not work
Setting ocp4_workload_litellm_virtual_keys_catch_all: false as a regular role variable does not prevent the Sandbox API from deleting other tenants' LiteMaaS keys. The catch_all setting must be in __meta__.sandbox_api.actions.destroy where Sandbox API reads it directly.
__meta__:
__meta__:
sandbox_api:
actions:
destroy:
catch_all: false
Challenge: Missing tag in cloud_selector — cluster not found
The Sandbox API does a subset match — a cluster qualifies only if it has ALL the tags you specify. Missing even one tag means zero clusters match and the order fails before provisioning starts. Common mistake: omitting cloud: cnv-dedicated-shared.
cloud_selector: cloud: cnv-dedicated-shared # required demo: your-lab-name # required purpose: prod # required
10. All Links
Pull Requests
| Repository | PR | Title | Status |
|---|---|---|---|
| agnosticd/agnosticd-v2 | PR #123 | config/namespace: support explicit remove_workloads list on destroy |
Merged |
| agnosticd/namespaced_workloads | PR #12 | feat: add tenant roles for shared cluster labs (keycloak_user, namespace, gitea) | Open |
| agnosticd/core_workloads | PR #62 | gitops_bootstrap: userinfo ConfigMap discovery + cascade cleanup on destroy | Open |
| rhpds/agnosticv | PR #25180 | MCP Sandbox: tenant roles pattern, scheduler-only sandbox, gitops_bootstrap | Open |
Repositories
| Repo | Branch / path | What it contains |
|---|---|---|
| agnosticd/namespaced_workloads | main (merged ✓) |
New tenant roles: tenant_keycloak_user, tenant_namespace, tenant_gitea |
| agnosticd/core_workloads | main (merged ✓) |
ocp4_workload_gitops_bootstrap — cascade finalizer, remove_workload, multi-app list, userinfo ConfigMap |
| agnosticd/agnosticd-v2 | tenant-roles-support |
AgnosticD v2 deployer with explicit remove_workloads list support |
| rhpds/agnosticv | tenant-namespace-roles |
AgV catalog item: mcp-with-openshift-sandbox/common.yaml |
| rhpds/ocpsandbox-mcp-with-openshift-gitops | main |
GitOps repo: infra/bootstrap · platform/bootstrap · tenant/bootstrap · tenant/librechat · mcp-gitea · mcp-openshift · agent |
| rhpds/rhpds.ocpsandbox_mcp_with_openshift | main |
Cluster provisioner: cluster-provision.yml (Infra + Platform bootstrap) |
| rhpds/rhpds.ocpsandbox_mcp_with_openshift | main |
Original workshop (before migration): mcp-with-openshift/common.yaml |
Documentation
| Page | What it covers |
|---|---|
| Overview | What is Sandbox API, three-layer architecture, both patterns compared, variables reference |
| Summit 2026 / Scheduler-Only | Complete annotated AgV common.yaml for the scheduler-only pattern |
| OCP Sandbox API | Complete annotated AgV common.yaml for the full OCP Sandbox API pattern |