Module 3: Understanding your Agentic AIOps Team
In Module 2, you watched the Deep Agents pipeline analyze a failure. Now you’re going to look under the hood — understand the architecture of your AI operations team, explore how agents and skills are defined, and then build and deploy a new specialist agent yourself.
But first — what kind of agent architecture are we using, and why does it matter?
From First to Second Generation Agents
First generation AI agents — the classic ReAct (Reason + Act) pattern — work well for short, focused tasks: answer a question, call an API, summarize a document. But they struggle with complex, multi-step operations like incident investigation. Their context is limited to conversation history, which overflows on long tasks. As the context window fills with increasingly irrelevant history, agents enter what researchers call the "dumb zone" — the point where an agent with a 200K token context window starts acting like it has a 20K window because most of its attention is wasted on noise. With 50+ tool calls, agents drift off-topic and forget their original objectives. Researchers have identified five specific failure modes:
-
Context distraction — beyond ~100K tokens, agents start repeating actions from their history rather than synthesizing new plans
-
Context confusion — too many tools or too much irrelevant information overwhelms the model. One study found an 8B model failed with 46 tools but succeeded with just 19
-
Context poisoning — early errors or hallucinations become embedded in the history and repeatedly referenced
-
Context clash — conflicting information across conversation turns causes contradictory assumptions, with performance dropping up to 39%
-
Lost goals — with ~50+ tool calls, agents drift off-topic or forget earlier objectives as the original task gets buried
LangChain’s Deep Agents framework represents a second generation of agent architecture — sometimes called Agent 2.0. Instead of a single agent with a long, degrading conversation, a Deep Agent uses four architectural pillars:
-
Explicit planning — to-do lists as persistent context anchors that prevent goal drift
-
Persistent filesystem — a virtual workspace for saving notes, incident data, and structured artifacts outside the conversation
-
Sub-agent spawning — delegating complex subtasks to specialists, each with a clean context window
-
Context isolation — each specialist sees only what it needs, preventing cross-contamination between unrelated analyses
Agentic Harnesses
Deep Agents is one example of an agentic harness — a framework that wraps and drives an LLM rather than letting it run unsupervised. Other harnesses include Claude Code (Anthropic’s coding agent) and the Claude Agent SDK (for building custom agents). What they share: the harness manages context, tools, delegation, and output parsing. A well-designed harness makes smaller models perform like larger ones — which is why you can switch to open source models in Module 4 without losing quality.
| Dimension | First Generation (ReAct Only) | Second Generation (ReAct + Harness) |
|---|---|---|
Best for |
Short, focused tasks |
Multi-step projects |
Context |
Conversation history only |
Filesystem + sub-agents |
Goal persistence |
Degrades over time |
Maintained via planning |
Context isolation |
None — everything in one window |
Each sub-agent gets a clean context |
Athena uses all four pillars.
The ops_manager maintains an explicit plan (the triage protocol in AGENTS.md).
Skills and incident data live on the filesystem.
Specialist subagents are spawned on demand and destroyed after use.
And each subagent operates in context isolation — sre_linux never sees the networking error that sre_networking is investigating.
Meet Athena — your AIOps team
Athena is not a single AI agent. It’s a team of specialized agents, orchestrated by a manager, each with its own expertise, tools, and domain knowledge.
The ops_manager — your team lead
At the center is ops_manager, a long-running Deep Agent that uses the ReAct (Reason + Act) pattern. When a failure arrives via webhook, ops_manager:
-
Reads the incident data (job output, error details, metadata)
-
Reasons about which domain this failure belongs to
-
Acts by delegating to the right specialist via the
tasktool -
Reviews the specialist’s work via the
revieweragent -
Returns the final ticket
The ops_manager stays running across the entire analysis — it’s the orchestrator that holds the conversation together.
The sre_ subagents — your specialists
The four sre_* subagents are ephemeral — they’re created on demand when ops_manager calls task(), and discarded after they return their analysis. Each one is a focused expert:
| Subagent | Expertise |
|---|---|
|
Playbook errors, credential issues, execution environments, variable resolution |
|
Systemd services, SELinux denials, filesystem/permissions |
|
Pod lifecycle, RBAC, operators, namespace/quota, security contexts |
|
DNS, SSH connectivity, proxy/TLS, routing, firewall |
Why ephemeral subagents matter — context isolation
This is a critical architectural decision. Consider what would happen if a single agent handled everything:
-
ops_manageranalyzes a package failure — 3000 tokens of dnf output fills the context -
Next, an SSH failure arrives — the package output is still in context, confusing the analysis
-
By the fifth failure, the context is bloated with irrelevant data from previous investigations
Ephemeral subagents solve this. Each sre_* agent starts with a fresh context — only the incident data and its domain-specific skills. When it finishes, its context is discarded. The ops_manager only sees the final TicketPayload result, not the specialist’s internal reasoning. This keeps every analysis clean and focused.
Skills — domain knowledge without code
Each subagent is loaded with skills — markdown documents that provide domain-specific diagnostic workflows, institutional knowledge, and output formatting guidance. Skills are the reference library that turns a general-purpose LLM into a domain expert.
Tools — how agents take action
Agents can call tools — functions that let them interact with external systems. In Athena, the key tool is task() which lets ops_manager delegate to subagents. Subagents also have web_search for looking up documentation.
Models — the reasoning engine
Each agent can use a different model from MaaS. The ops_manager might use a frontier model for complex reasoning, while subagents use a smaller, cheaper model for focused domain analysis. The reviewer can use an even lighter model since it only validates, not analyzes.
In the next exercises, you’ll see a gap in the current team’s capabilities, explore the architecture, add a specialist, and then watch the same failure get handled far better.
Exercise 1: See the gap
The AI/ML team at Meridian Financial has requested Python 3.14 on their RHEL servers. This job ran automatically when your environment was provisioned — so the ticket is already waiting for you in Kira.
The job failed with this error:
TASK [Install Python 3.14] ***
fatal: [rhel-node-01]: FAILED! => {"changed": false,
"failures": ["No match for argument: python3.14"],
"msg": "Failed to install some of the specified packages",
"rc": 1, "results": []}
Step 1: Read the Kira ticket
-
In the Kira tab (log in with
user-12345/deeper-agentsif prompted), find the open ticket for 10 Install Python 3.14 -
Read the ticket carefully. Notice what it doesn’t contain:
-
No mention of Meridian Financial’s Satellite servers or the specific content view model used by different teams
-
No reference to the Content View Request SOP (Meridian Standard Operating Procedure v2.3)
-
No awareness that the AI/ML team needs CRB (CodeReady Builder) and EPEL to be enabled in their content view
-
No suggestion of the
#platform-satellitefast-track escalation path
sre_linuxgave a generic "check your package manager and repositories" answer. Technically accurate — but useless to anyone at Meridian. It doesn’t know how your organization works.This is the gap a specialist agent closes. Let’s build one.
-
Exercise 2: Explore the Athena architecture
Before adding a new agent, let’s understand what’s already there.
-
Click the Gitea tab — no login required, the Athena repo is public
Navigate to the athena-aiops-deep-agent repository. This is the source code for the Deep Agent running in your namespace.
-
Browse to
subagents.yamlin the repository rootThis is the file that defines every specialist agent in the system:
Agent Domain sre_ansiblePlaybook syntax, role/collection errors, credential issues, execution environments
sre_linuxSystemd services, SELinux denials, filesystem/permissions
sre_openshiftPod lifecycle, image pull, RBAC, operators, namespace/quota
sre_networkingDNS, SSH connectivity, proxy/TLS, routing, firewall
reviewerQuality validation — checks ticket coherence before submission
Each agent has the same structure:
description,model,system_prompt,tools, andskills. Theops_managerreads thedescriptionfield to decide which agent to route a failure to. -
Browse to the
skills/directoryEach subdirectory contains a
SKILL.mdfile — a markdown document that gets loaded into the agent’s context as reference knowledge. For example,skills/analyze-linux-failure/SKILL.mdcontains a structured diagnostic workflow.Skills are not code — they’re instructions written in plain language. Anyone who understands the domain can write or improve a skill without touching Python.
-
Optionally, browse the Helm chart at
deploy/helm/athena/templates/The key files are
configmap.yaml(agent config),pvc.yaml(skills volume), anddeployment.yaml(which mounts both and includes an initContainer that pre-populates the skills PVC from the image on first boot).
Exercise 3: Inspect the extensibility layer
Here’s where it gets interesting. In your namespace, Athena’s agent configuration and skills are not baked into the container image. They’re mounted from Kubernetes resources that you can modify at runtime.
-
Click the Terminal tab and log in to your OpenShift namespace:
oc login --insecure-skip-tls-verify \ -u user-12345 \ -p deeper-agents \ https://openshift.example.com:6443 \ --namespace user-12345-agenticSample output:
Login successful. Using project "user-XXXXX-agentic".
-
Inspect the ConfigMap that holds the agent definitions:
oc get configmap athena-agent-config -o yaml | bat --language yamlSample output (your results may vary):
apiVersion: v1 data: AGENTS.md: | # Athena Ops Manager ... subagents.yaml: | sre_ansible: description: > Ansible/AAP2 specialist... ... kind: ConfigMap metadata: name: athena-agent-config ...The ConfigMap stores two files as keys in
data:. Each key becomes a file when mounted into the pod. This means you can add a new agent by updating this ConfigMap — no image rebuild needed.Use oc get configmap athena-agent-config -o yaml | bat --language yamlto browse the entire file — you’ll see the fullAGENTS.mdandsubagents.yamlcontents -
Inspect the PVC that holds the skills:
oc exec deployment/athena --container athena \ -- ls /app/skills/Sample output:
analyze-ansible-failure analyze-linux-failure analyze-networking-failure analyze-openshift-failure common create-ticket error-classifier lost+found review-ticket
The
skills/directory is mounted from a PersistentVolumeClaim. You can add new skills by copying files onto the PVC — again, no image rebuild needed.
Notice the lost+found directory? The PVC, unlike the ConfigMap, appears as a mounted filesystem to the container. You can confirm this with oc exec deployment/athena --container athena — df, this makes it simple to add new skills without rebuilding the image, or to iterate on them.
|
-
Understand the implications:
-
Add a new specialist agent — update the ConfigMap to add a new entry to
subagents.yaml -
Add domain knowledge — copy a new skill directory onto the PVC
-
Improve existing agents — edit the system prompt or skill content
-
Change models — switch an agent to a cheaper or more capable model
-
Roll out changes — restart the pod to pick up the new configuration
All without writing a single line of Python. Let’s do exactly that.
-
Exercise 4: Add the sre_package_management agent
Currently, package management failures go to sre_linux — which knows the Linux ecosystem but nothing about Meridian Financial’s Satellite infrastructure, content view model, or escalation paths. You’re going to add a dedicated sre_package_management agent with a skill that encodes this institutional knowledge.
Step 3a: Create the skill directory on the PVC
-
Create the skill directory:
oc exec deployment/athena --container athena \ -- mkdir -p /app/skills/analyze-package-management
Step 3b: Download the skill and copy it to the PVC
-
Download the package management analysis skill from your Gitea repository and copy it directly onto the PVC:
curl -sL \ https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/analyze-package-management-SKILL.md \ | oc exec --stdin deployment/athena --container athena \ -- sh -c 'cat > /app/skills/analyze-package-management/SKILL.md' -
Verify the skill was created:
oc exec deployment/athena --container athena \ -- cat /app/skills/analyze-package-management/SKILL.md \ | bat --language markdownSample output:
--- name: analyze-package-management-failure description: Package management failure analysis with Meridian Financial institutional knowledge --- # Analyze Package Management Failure Deep analysis of AAP2 job failures caused by package installation, repository, or content view issues on RHEL hosts. ## Institutional Knowledge — Meridian Financial Satellite Infrastructure **Satellite Infrastructure:** - Primary: satellite-primary.meridian.internal (RHEL 9 content) - Sync schedule: Tuesdays 02:00 UTC **Content View Model (per-team):** - base-rhel9: All RHEL servers — core OS packages only - ops-tooling: SRE team — monitoring, diagnostic tools - aiml-workloads: AI/ML team — Python 3.11+, CUDA, data science libs (CRB + EPEL enabled) ...
This is the tribal knowledge that makes this agent smarter than the generic
sre_linux: it knows about Meridian’s Satellite servers, which content view belongs to which team, the SOP for requesting content view changes, and the fast-track escalation channel.
Step 3c: Update the ConfigMap with the new agent definition
Here’s what the new sre_package_management agent entry looks like:
sre_package_management:
description: > (1)
Package management specialist. Delegate all package installation failures:
dnf/yum errors, missing packages, Satellite content view gaps, EPEL/CRB
requirements, repository sync issues, and content view request escalation.
model: claude-sonnet-4-6 (2)
system_prompt: | (3)
You are a senior SRE specializing in Red Hat package management and Satellite
content delivery. You receive incident data from failed AAP2 jobs and perform
root-cause analysis on package-related failures.
Always:
- Read the incident context (incident.json) first
- Identify the exact package(s) that failed to install
- Determine which team's content view is involved (check host group)
- Reference the analyze-package-management skill for Meridian-specific knowledge
- Check whether CRB or EPEL is needed and whether the team's content view includes it
- Provide the SOP reference and escalation path for content view requests
Use the create-ticket skill to structure your analysis as a TicketPayload.
Set area to "linux" for all package management issues.
tools: (4)
- web_search
skills: (5)
- ./skills/analyze-package-management/
- ./skills/create-ticket/
- ./skills/common/
| 1 | Description — the ops_manager reads this to decide when to route a failure here. Mentioning "package installation", "dnf/yum", "Satellite", "content view", and "EPEL/CRB" ensures package failures get routed here instead of to sre_linux |
| 2 | Model — using claude-sonnet-4-6 for strong reasoning. For a cost-sensitive deployment, you could use a smaller model |
| 3 | System prompt — instructs the agent to identify team content view gaps and reference the skill for institutional knowledge |
| 4 | Tools — web_search lets the agent look up package availability or upstream documentation if needed |
| 5 | Skills — points to the skill you just created, plus the shared create-ticket and common skills |
Rather than manually editing YAML in a terminal editor (where a single indentation error would break the configuration), download the complete pre-built subagents.yaml that includes sre_package_management:
-
Download the updated subagents configuration:
curl -sL https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/subagents-with-sre-package-mgmt.yaml \ -o /tmp/subagents-new.yaml -
Verify the download — confirm
sre_package_managementis present:grep "^sre_\|^reviewer" /tmp/subagents-new.yamlExpected output:
sre_ansible: sre_linux: sre_openshift: sre_networking: sre_package_management: reviewer:
Notice the new
sre_package_management:entry betweensre_networking:andreviewer:.
Step 3d: Patch the ConfigMap
You need to update two keys in the ConfigMap: subagents.yaml (to add the new agent) and AGENTS.md (to update ops_manager’s routing rules so it knows to delegate package failures to `sre_package_management instead of sre_linux).
-
Download the updated
ops_managerrouting rules:curl -sL https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/agents-with-sre-package-mgmt.md \ -o /tmp/agents-new.md -
Verify the routing update — confirm
sre_package_managementappears in Domain Awareness:grep "sre_package_management\|sre_linux" /tmp/agents-new.mdExpected output:
- **sre_linux**: Systemd services, SELinux, filesystem/permissions (NOT package manager or Satellite — those go to sre_package_management if available) - **sre_package_management**: DNF/YUM errors, missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements (when available)
-
Apply both files to the ConfigMap in one command:
oc create configmap athena-agent-config \ --from-file=subagents.yaml=/tmp/subagents-new.yaml \ --from-file=AGENTS.md=/tmp/agents-new.md \ --dry-run=client -o yaml | oc apply -f -Expected output:
configmap/athena-agent-config configured
You may see a Warning about a missing kubectl.kubernetes.io/last-applied-configurationannotation — this is safe to ignore. It only appears the first time because the ConfigMap was originally created by Helm, notoc applyThis command rebuilds the ConfigMap from both downloaded files and applies it in place. Both
subagents.yaml(new agent definition) andAGENTS.md(updated routing rules) are replaced in one atomic apply.
Step 3d.5: Update the error classifier skill on the PVC
The ops_manager uses an error-classifier skill to identify the failure domain before routing. The version on your PVC was initialized from the image at first boot and doesn’t know about the package_management domain yet — so even with the new agent in place, the classifier would still emit ansible or linux for a dnf failure.
Adding a new domain to the classifier is the same as teaching a team lead a new category of problem. You’re telling it: "when you see 'No package X available', that’s a package management issue, not an Ansible issue."
-
Download the updated error-classifier skill:
curl -sL \ https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/error-classifier-with-package-mgmt-SKILL.md \ | oc exec --stdin deployment/athena --container athena \ -- sh -c 'cat > /app/skills/error-classifier/SKILL.md' -
Verify the classifier now includes the
package_managementdomain:oc exec deployment/athena --container athena \ -- grep "Package Management\|package_management" \ /app/skills/error-classifier/SKILL.mdExpected output:
- **Package Management**: dnf/yum errors, "No match for argument", "No package X available", missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements, subscription-manager errors - `domain`: one of ansible, linux, package_management, openshift, networking
Step 3e: Roll out the change
-
Restart the Athena pod to pick up the new ConfigMap and skill:
oc rollout restart deployment/athenaExpected output:
deployment.apps/athena restarted
-
Wait for the new pod to come up:
oc rollout status deployment/athena --timeout=120sExpected output:
Waiting for deployment "athena" rollout to finish: 1 old replicas are pending termination... deployment "athena" successfully rolled out
A rolling restart is one of the most powerful primitives OpenShift and Kubernetes give you. It spins up a new pod with the updated configuration, waits for it to pass health checks, and only then terminates the old one — zero downtime. You just reconfigured an AI operations team in production with two commands and no service interruption.
Verify the new agent
-
Confirm the new skill is on the PVC:
oc exec deployment/athena --container athena \ -- ls /app/skills/analyze-package-management/Expected output:
SKILL.md
-
Confirm
sre_package_managementis in the ConfigMap:oc get configmap athena-agent-config \ -o jsonpath='{.data.subagents\.yaml}' \ | grep -A3 "sre_package_management"Expected output:
sre_package_management: description: > Package management specialist. Delegate all package installation failures: dnf/yum errors, missing packages, Satellite content view gaps, EPEL/CRB -
Confirm
ops_managerrouting rules includesre_package_management:oc get configmap athena-agent-config \ -o jsonpath='{.data.AGENTS\.md}' \ | grep "sre_package_management\|sre_linux"Expected output:
- **sre_linux**: Systemd services, SELinux, filesystem/permissions (NOT package manager or Satellite — those go to sre_package_management if available) - **sre_package_management**: DNF/YUM errors, missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements (when available)
Both the agent definition (
subagents.yaml) and the routing rules (AGENTS.md) are now updated. Theops_managerwill route package management failures tosre_package_managementinstead ofsre_linux.
Exercise 5: Same job, better specialist
Now let’s trigger the same failure that exposed the gap in Exercise 1.
-
In the AAP2 tab (log in with
user-12345/deeper-agentsif prompted), navigate to Automation Execution → Templates -
Launch 10 Install Python 3.14 again by clicking the rocket icon

Same playbook. Same RHEL host. Same error —
No match for argument: python3.14. -
Watch the job fail, then wait 1-3 minutes for the pipeline to process it
-
In the Kira tab (log in with
user-12345/deeper-agentsif prompted), look for the new ticket. Pay attention to:-
Does the analysis identify the AI/ML team’s content view (
aiml-workloads) as the source of the gap? -
Does it reference Satellite (
satellite-primary.meridian.internal) and the weekly sync schedule? -
Does it mention that the
aiml-workloadscontent view requires CRB and EPEL to be enabled for Python 3.14? -
Does the recommended action cite the Content View Request SOP v2.3 and the
#platform-satellitefast-track escalation channel?
This is the power of Skills — you encoded Meridian Financial’s institutional knowledge into a markdown file, and the agent used it to provide context-aware analysis that goes far beyond generic package troubleshooting.
Use the AI Chatbot to ask follow-up questions about the analysis -
-
In the Rocket.Chat tab (log in with
user-12345/deeper-agentsif prompted), check#supportfor the notification
Exercise 6: Compare the two tickets side by side
You now have two Kira tickets for the exact same job failure — 10 Install Python 3.14 — processed by two different agents.
-
In Kira, open both tickets and compare them:
First ticket — sre_linuxSecond ticket — sre_package_managementGeneric package manager advice
Identifies
aiml-workloadscontent view gapNo Satellite context
References
satellite-primary.meridian.internalNo SOP reference
Cites Content View Request SOP v2.3
No team routing
Escalates to
#platform-satellitefast-trackNo EPEL/CRB awareness
Specifies CRB + EPEL requirements for Python 3.14
-
The infrastructure was identical. The error was identical. The difference was entirely in what the agent knew about how Meridian Financial operates.
That institutional knowledge lives in a markdown file. You added it — and extended an AI operations team — without writing any Python.
Takeaways
-
Athena’s agents and skills are defined in configuration, not code
-
OpenShift provides the extensibility mechanism: ConfigMap for agents, PVC for skills
-
Adding a new specialist agent is a 4-step process: create the skill, update the error classifier on the PVC, update the ConfigMap, restart the pod
-
Skills encode institutional knowledge in plain language — anyone who understands the domain can write them
-
The
ops_managerautomatically routes failures to the right specialist based on the agent’sdescriptionfield -
Same failure, same infrastructure, completely different analysis — the difference is domain expertise encoded in a skill
-
You extended an AI operations team without writing any Python