Module 3: Understanding your Agentic AIOps Team

In Module 2, you watched the Deep Agents pipeline analyze a failure. Now you’re going to look under the hood — understand the architecture of your AI operations team, explore how agents and skills are defined, and then build and deploy a new specialist agent yourself.

But first — what kind of agent architecture are we using, and why does it matter?

From First to Second Generation Agents

First generation AI agents — the classic ReAct (Reason + Act) pattern — work well for short, focused tasks: answer a question, call an API, summarize a document. But they struggle with complex, multi-step operations like incident investigation. Their context is limited to conversation history, which overflows on long tasks. As the context window fills with increasingly irrelevant history, agents enter what researchers call the "dumb zone" — the point where an agent with a 200K token context window starts acting like it has a 20K window because most of its attention is wasted on noise. With 50+ tool calls, agents drift off-topic and forget their original objectives. Researchers have identified five specific failure modes:

Context distraction — beyond ~100K tokens, agents start repeating actions from their history rather than synthesizing new plans
Context confusion — too many tools or too much irrelevant information overwhelms the model. One study found an 8B model failed with 46 tools but succeeded with just 19
Context poisoning — early errors or hallucinations become embedded in the history and repeatedly referenced
Context clash — conflicting information across conversation turns causes contradictory assumptions, with performance dropping up to 39%
Lost goals — with ~50+ tool calls, agents drift off-topic or forget earlier objectives as the original task gets buried

LangChain’s Deep Agents framework represents a second generation of agent architecture — sometimes called Agent 2.0. Instead of a single agent with a long, degrading conversation, a Deep Agent uses four architectural pillars:

Explicit planning — to-do lists as persistent context anchors that prevent goal drift
Persistent filesystem — a virtual workspace for saving notes, incident data, and structured artifacts outside the conversation
Sub-agent spawning — delegating complex subtasks to specialists, each with a clean context window
Context isolation — each specialist sees only what it needs, preventing cross-contamination between unrelated analyses

Agentic Harnesses

Deep Agents is one example of an agentic harness — a framework that wraps and drives an LLM rather than letting it run unsupervised. Other harnesses include Claude Code (Anthropic’s coding agent) and the Claude Agent SDK (for building custom agents). What they share: the harness manages context, tools, delegation, and output parsing. A well-designed harness makes smaller models perform like larger ones — which is why you can switch to open source models in Module 4 without losing quality.

Dimension	First Generation (ReAct Only)	Second Generation (ReAct + Harness)
Best for	Short, focused tasks	Multi-step projects
Context	Conversation history only	Filesystem + sub-agents
Goal persistence	Degrades over time	Maintained via planning
Context isolation	None — everything in one window	Each sub-agent gets a clean context

Dimension

First Generation (ReAct Only)

Second Generation (ReAct + Harness)

Best for

Short, focused tasks

Multi-step projects

Context

Conversation history only

Filesystem + sub-agents

Goal persistence

Degrades over time

Maintained via planning

Context isolation

None — everything in one window

Each sub-agent gets a clean context

Athena uses all four pillars. The ops_manager maintains an explicit plan (the triage protocol in AGENTS.md). Skills and incident data live on the filesystem. Specialist subagents are spawned on demand and destroyed after use. And each subagent operates in context isolation — sre_linux never sees the networking error that sre_networking is investigating.

Meet Athena — your AIOps team

Athena is not a single AI agent. It’s a team of specialized agents, orchestrated by a manager, each with its own expertise, tools, and domain knowledge.

Athena team architecture — ops_manager delegates to specialist subagents

The ops_manager — your team lead

At the center is ops_manager, a long-running Deep Agent that uses the ReAct (Reason + Act) pattern. When a failure arrives via webhook, ops_manager:

Reads the incident data (job output, error details, metadata)
Reasons about which domain this failure belongs to
Acts by delegating to the right specialist via the task tool
Reviews the specialist’s work via the reviewer agent
Returns the final ticket

The ops_manager stays running across the entire analysis — it’s the orchestrator that holds the conversation together.

The sre_ subagents — your specialists

The four sre_* subagents are ephemeral — they’re created on demand when ops_manager calls task(), and discarded after they return their analysis. Each one is a focused expert:

Subagent Expertise

Subagent	Expertise
`sre_ansible`	Playbook errors, credential issues, execution environments, variable resolution
`sre_linux`	Systemd services, SELinux denials, filesystem/permissions
`sre_openshift`	Pod lifecycle, RBAC, operators, namespace/quota, security contexts
`sre_networking`	DNS, SSH connectivity, proxy/TLS, routing, firewall

sre_ansible

Playbook errors, credential issues, execution environments, variable resolution

sre_linux

Systemd services, SELinux denials, filesystem/permissions

sre_openshift

Pod lifecycle, RBAC, operators, namespace/quota, security contexts

sre_networking

DNS, SSH connectivity, proxy/TLS, routing, firewall

Why ephemeral subagents matter — context isolation

This is a critical architectural decision. Consider what would happen if a single agent handled everything:

ops_manager analyzes a package failure — 3000 tokens of dnf output fills the context
Next, an SSH failure arrives — the package output is still in context, confusing the analysis
By the fifth failure, the context is bloated with irrelevant data from previous investigations

Ephemeral subagents solve this. Each sre_* agent starts with a fresh context — only the incident data and its domain-specific skills. When it finishes, its context is discarded. The ops_manager only sees the final TicketPayload result, not the specialist’s internal reasoning. This keeps every analysis clean and focused.

Skills — domain knowledge without code

Each subagent is loaded with skills — markdown documents that provide domain-specific diagnostic workflows, institutional knowledge, and output formatting guidance. Skills are the reference library that turns a general-purpose LLM into a domain expert.

Tools — how agents take action

Agents can call tools — functions that let them interact with external systems. In Athena, the key tool is task() which lets ops_manager delegate to subagents. Subagents also have web_search for looking up documentation.

Models — the reasoning engine

Each agent can use a different model from MaaS. The ops_manager might use a frontier model for complex reasoning, while subagents use a smaller, cheaper model for focused domain analysis. The reviewer can use an even lighter model since it only validates, not analyzes.

In the next exercises, you’ll see a gap in the current team’s capabilities, explore the architecture, add a specialist, and then watch the same failure get handled far better.

Exercise 1: See the gap

The AI/ML team at Meridian Financial has requested Python 3.14 on their RHEL servers. This job ran automatically when your environment was provisioned — so the ticket is already waiting for you in Kira.

The job failed with this error:

TASK [Install Python 3.14] ***
fatal: [rhel-node-01]: FAILED! => {"changed": false,
  "failures": ["No match for argument: python3.14"],
  "msg": "Failed to install some of the specified packages",
  "rc": 1, "results": []}

Step 1: Read the Kira ticket

In the Kira tab (log in with user-12345 / deeper-agents if prompted), find the open ticket for 10 Install Python 3.14
Read the ticket carefully. Notice what it doesn’t contain:
- No mention of Meridian Financial’s Satellite servers or the specific content view model used by different teams
- No reference to the Content View Request SOP (Meridian Standard Operating Procedure v2.3)
- No awareness that the AI/ML team needs CRB (CodeReady Builder) and EPEL to be enabled in their content view
- No suggestion of the #platform-satellite fast-track escalation path
sre_linux gave a generic "check your package manager and repositories" answer. Technically accurate — but useless to anyone at Meridian. It doesn’t know how your organization works.

This is the gap a specialist agent closes. Let’s build one.

Exercise 2: Explore the Athena architecture

Before adding a new agent, let’s understand what’s already there.

Click the Gitea tab — no login required, the Athena repo is public

Navigate to the athena-aiops-deep-agent repository. This is the source code for the Deep Agent running in your namespace.

Browse to subagents.yaml in the repository root

This is the file that defines every specialist agent in the system:

Agent Domain

Agent	Domain
`sre_ansible`	Playbook syntax, role/collection errors, credential issues, execution environments
`sre_linux`	Systemd services, SELinux denials, filesystem/permissions
`sre_openshift`	Pod lifecycle, image pull, RBAC, operators, namespace/quota
`sre_networking`	DNS, SSH connectivity, proxy/TLS, routing, firewall
`reviewer`	Quality validation — checks ticket coherence before submission

sre_ansible

Playbook syntax, role/collection errors, credential issues, execution environments

sre_linux

Systemd services, SELinux denials, filesystem/permissions

sre_openshift

Pod lifecycle, image pull, RBAC, operators, namespace/quota

sre_networking

DNS, SSH connectivity, proxy/TLS, routing, firewall

reviewer

Quality validation — checks ticket coherence before submission

Each agent has the same structure: description, model, system_prompt, tools, and skills. The ops_manager reads the description field to decide which agent to route a failure to.

Browse to the skills/ directory

Each subdirectory contains a SKILL.md file — a markdown document that gets loaded into the agent’s context as reference knowledge. For example, skills/analyze-linux-failure/SKILL.md contains a structured diagnostic workflow.

Skills are not code — they’re instructions written in plain language. Anyone who understands the domain can write or improve a skill without touching Python.
Optionally, browse the Helm chart at deploy/helm/athena/templates/

The key files are configmap.yaml (agent config), pvc.yaml (skills volume), and deployment.yaml (which mounts both and includes an initContainer that pre-populates the skills PVC from the image on first boot).

Exercise 3: Inspect the extensibility layer

Here’s where it gets interesting. In your namespace, Athena’s agent configuration and skills are not baked into the container image. They’re mounted from Kubernetes resources that you can modify at runtime.

Click the Terminal tab and log in to your OpenShift namespace:

oc login --insecure-skip-tls-verify \
  -u user-12345 \
  -p deeper-agents \
  https://openshift.example.com:6443 \
  --namespace user-12345-agentic

Sample output:

Login successful.
Using project "user-XXXXX-agentic".

Inspect the ConfigMap that holds the agent definitions:
```
oc get configmap athena-agent-config -o yaml | bat --language yaml
```
Sample output (your results may vary):
```
apiVersion: v1
data:
  AGENTS.md: |
    # Athena Ops Manager
    ...
  subagents.yaml: |
    sre_ansible:
      description: >
        Ansible/AAP2 specialist...
    ...
kind: ConfigMap
metadata:
  name: athena-agent-config
  ...
```
The ConfigMap stores two files as keys in data:. Each key becomes a file when mounted into the pod. This means you can add a new agent by updating this ConfigMap — no image rebuild needed.

Use oc get configmap athena-agent-config -o yaml | bat --language yaml to browse the entire file — you’ll see the full AGENTS.md and subagents.yaml contents

Inspect the PVC that holds the skills:

oc exec deployment/athena --container athena \
  -- ls /app/skills/

Sample output:

analyze-ansible-failure
analyze-linux-failure
analyze-networking-failure
analyze-openshift-failure
common
create-ticket
error-classifier
lost+found
review-ticket

The skills/ directory is mounted from a PersistentVolumeClaim. You can add new skills by copying files onto the PVC — again, no image rebuild needed.

Notice the lost+found directory? The PVC, unlike the ConfigMap, appears as a mounted filesystem to the container. You can confirm this with oc exec deployment/athena --container athena — df, this makes it simple to add new skills without rebuilding the image, or to iterate on them.

Understand the implications:
- Add a new specialist agent — update the ConfigMap to add a new entry to subagents.yaml
- Add domain knowledge — copy a new skill directory onto the PVC
- Improve existing agents — edit the system prompt or skill content
- Change models — switch an agent to a cheaper or more capable model
- Roll out changes — restart the pod to pick up the new configuration
All without writing a single line of Python. Let’s do exactly that.

Exercise 4: Add the `sre_package_management` agent

Currently, package management failures go to sre_linux — which knows the Linux ecosystem but nothing about Meridian Financial’s Satellite infrastructure, content view model, or escalation paths. You’re going to add a dedicated sre_package_management agent with a skill that encodes this institutional knowledge.

Step 3a: Create the skill directory on the PVC

Create the skill directory:

oc exec deployment/athena --container athena \
  -- mkdir -p /app/skills/analyze-package-management

Step 3b: Download the skill and copy it to the PVC

Download the package management analysis skill from your Gitea repository and copy it directly onto the PVC:

curl -sL \
  https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/analyze-package-management-SKILL.md \
  | oc exec --stdin deployment/athena --container athena \
  -- sh -c 'cat > /app/skills/analyze-package-management/SKILL.md'

Verify the skill was created:

oc exec deployment/athena --container athena \
  -- cat /app/skills/analyze-package-management/SKILL.md \
  | bat --language markdown

Sample output:

---
name: analyze-package-management-failure
description: Package management failure analysis with Meridian Financial institutional knowledge
---

# Analyze Package Management Failure

Deep analysis of AAP2 job failures caused by package installation, repository, or
content view issues on RHEL hosts.

## Institutional Knowledge — Meridian Financial Satellite Infrastructure

**Satellite Infrastructure:**
- Primary: satellite-primary.meridian.internal (RHEL 9 content)
- Sync schedule: Tuesdays 02:00 UTC

**Content View Model (per-team):**
- base-rhel9: All RHEL servers — core OS packages only
- ops-tooling: SRE team — monitoring, diagnostic tools
- aiml-workloads: AI/ML team — Python 3.11+, CUDA, data science libs (CRB + EPEL enabled)
...

This is the tribal knowledge that makes this agent smarter than the generic sre_linux: it knows about Meridian’s Satellite servers, which content view belongs to which team, the SOP for requesting content view changes, and the fast-track escalation channel.

Step 3c: Update the ConfigMap with the new agent definition

Here’s what the new sre_package_management agent entry looks like:

sre_package_management:
  description: > (1)
    Package management specialist. Delegate all package installation failures:
    dnf/yum errors, missing packages, Satellite content view gaps, EPEL/CRB
    requirements, repository sync issues, and content view request escalation.
  model: claude-sonnet-4-6 (2)
  system_prompt: | (3)
    You are a senior SRE specializing in Red Hat package management and Satellite
    content delivery. You receive incident data from failed AAP2 jobs and perform
    root-cause analysis on package-related failures.

    Always:
    - Read the incident context (incident.json) first
    - Identify the exact package(s) that failed to install
    - Determine which team's content view is involved (check host group)
    - Reference the analyze-package-management skill for Meridian-specific knowledge
    - Check whether CRB or EPEL is needed and whether the team's content view includes it
    - Provide the SOP reference and escalation path for content view requests

    Use the create-ticket skill to structure your analysis as a TicketPayload.
    Set area to "linux" for all package management issues.
  tools: (4)
    - web_search
  skills: (5)
    - ./skills/analyze-package-management/
    - ./skills/create-ticket/
    - ./skills/common/

1	Description — the `ops_manager` reads this to decide when to route a failure here. Mentioning "package installation", "dnf/yum", "Satellite", "content view", and "EPEL/CRB" ensures package failures get routed here instead of to `sre_linux`
2	Model — using `claude-sonnet-4-6` for strong reasoning. For a cost-sensitive deployment, you could use a smaller model
3	System prompt — instructs the agent to identify team content view gaps and reference the skill for institutional knowledge
4	Tools — `web_search` lets the agent look up package availability or upstream documentation if needed
5	Skills — points to the skill you just created, plus the shared `create-ticket` and `common` skills

Rather than manually editing YAML in a terminal editor (where a single indentation error would break the configuration), download the complete pre-built subagents.yaml that includes sre_package_management:

Download the updated subagents configuration:

curl -sL https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/subagents-with-sre-package-mgmt.yaml \
  -o /tmp/subagents-new.yaml

Verify the download — confirm sre_package_management is present:
```
grep "^sre_\|^reviewer" /tmp/subagents-new.yaml
```
Expected output:
```
sre_ansible:
sre_linux:
sre_openshift:
sre_networking:
sre_package_management:
reviewer:
```
Notice the new sre_package_management: entry between sre_networking: and reviewer:.

Step 3d: Patch the ConfigMap

You need to update two keys in the ConfigMap: subagents.yaml (to add the new agent) and AGENTS.md (to update ops_manager’s routing rules so it knows to delegate package failures to `sre_package_management instead of sre_linux).

Download the updated ops_manager routing rules:

curl -sL https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/agents-with-sre-package-mgmt.md \
  -o /tmp/agents-new.md

Verify the routing update — confirm sre_package_management appears in Domain Awareness:

grep "sre_package_management\|sre_linux" /tmp/agents-new.md

Expected output:

- **sre_linux**: Systemd services, SELinux, filesystem/permissions (NOT package manager or Satellite — those go to sre_package_management if available)
- **sre_package_management**: DNF/YUM errors, missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements (when available)

Apply both files to the ConfigMap in one command:

oc create configmap athena-agent-config \
  --from-file=subagents.yaml=/tmp/subagents-new.yaml \
  --from-file=AGENTS.md=/tmp/agents-new.md \
  --dry-run=client -o yaml | oc apply -f -

Expected output:

configmap/athena-agent-config configured

You may see a Warning about a missing kubectl.kubernetes.io/last-applied-configuration annotation — this is safe to ignore. It only appears the first time because the ConfigMap was originally created by Helm, not oc apply

This command rebuilds the ConfigMap from both downloaded files and applies it in place. Both subagents.yaml (new agent definition) and AGENTS.md (updated routing rules) are replaced in one atomic apply.

Step 3d.5: Update the error classifier skill on the PVC

The ops_manager uses an error-classifier skill to identify the failure domain before routing. The version on your PVC was initialized from the image at first boot and doesn’t know about the package_management domain yet — so even with the new agent in place, the classifier would still emit ansible or linux for a dnf failure.

Adding a new domain to the classifier is the same as teaching a team lead a new category of problem. You’re telling it: "when you see 'No package X available', that’s a package management issue, not an Ansible issue."

Download the updated error-classifier skill:

curl -sL \
  https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/error-classifier-with-package-mgmt-SKILL.md \
  | oc exec --stdin deployment/athena --container athena \
  -- sh -c 'cat > /app/skills/error-classifier/SKILL.md'

Verify the classifier now includes the package_management domain:

oc exec deployment/athena --container athena \
  -- grep "Package Management\|package_management" \
  /app/skills/error-classifier/SKILL.md

Expected output:

   - **Package Management**: dnf/yum errors, "No match for argument", "No package X available", missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements, subscription-manager errors
   - `domain`: one of ansible, linux, package_management, openshift, networking

Step 3e: Roll out the change

Restart the Athena pod to pick up the new ConfigMap and skill:
```
oc rollout restart deployment/athena
```
Expected output:
```
deployment.apps/athena restarted
```
Wait for the new pod to come up:
```
oc rollout status deployment/athena --timeout=120s
```
Expected output:
```
Waiting for deployment "athena" rollout to finish: 1 old replicas are pending termination...
deployment "athena" successfully rolled out
```
A rolling restart is one of the most powerful primitives OpenShift and Kubernetes give you. It spins up a new pod with the updated configuration, waits for it to pass health checks, and only then terminates the old one — zero downtime. You just reconfigured an AI operations team in production with two commands and no service interruption.

Verify the new agent

Confirm the new skill is on the PVC:

oc exec deployment/athena --container athena \
  -- ls /app/skills/analyze-package-management/

Expected output:

SKILL.md

Confirm sre_package_management is in the ConfigMap:

oc get configmap athena-agent-config \
  -o jsonpath='{.data.subagents\.yaml}' \
  | grep -A3 "sre_package_management"

Expected output:

sre_package_management:
  description: >
    Package management specialist. Delegate all package installation failures:
    dnf/yum errors, missing packages, Satellite content view gaps, EPEL/CRB

Confirm ops_manager routing rules include sre_package_management:

oc get configmap athena-agent-config \
  -o jsonpath='{.data.AGENTS\.md}' \
  | grep "sre_package_management\|sre_linux"

Expected output:

- **sre_linux**: Systemd services, SELinux, filesystem/permissions (NOT package manager or Satellite — those go to sre_package_management if available)
- **sre_package_management**: DNF/YUM errors, missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements (when available)

Both the agent definition (subagents.yaml) and the routing rules (AGENTS.md) are now updated. The ops_manager will route package management failures to sre_package_management instead of sre_linux.

Exercise 5: Same job, better specialist

Now let’s trigger the same failure that exposed the gap in Exercise 1.

In the AAP2 tab (log in with user-12345 / deeper-agents if prompted), navigate to Automation Execution → Templates
Launch 10 Install Python 3.14 again by clicking the rocket icon

Same playbook. Same RHEL host. Same error — No match for argument: python3.14.
Watch the job fail, then wait 1-3 minutes for the pipeline to process it
In the Kira tab (log in with user-12345 / deeper-agents if prompted), look for the new ticket. Pay attention to:
- Does the analysis identify the AI/ML team’s content view (aiml-workloads) as the source of the gap?
- Does it reference Satellite (satellite-primary.meridian.internal) and the weekly sync schedule?
- Does it mention that the aiml-workloads content view requires CRB and EPEL to be enabled for Python 3.14?
- Does the recommended action cite the Content View Request SOP v2.3 and the #platform-satellite fast-track escalation channel?
This is the power of Skills — you encoded Meridian Financial’s institutional knowledge into a markdown file, and the agent used it to provide context-aware analysis that goes far beyond generic package troubleshooting.

Use the AI Chatbot to ask follow-up questions about the analysis
In the Rocket.Chat tab (log in with user-12345 / deeper-agents if prompted), check #support for the notification

Exercise 6: Compare the two tickets side by side

You now have two Kira tickets for the exact same job failure — 10 Install Python 3.14 — processed by two different agents.

In Kira, open both tickets and compare them:

First ticket — sre_linux Second ticket — sre_package_management

First ticket — `sre_linux`	Second ticket — `sre_package_management`
Generic package manager advice	Identifies `aiml-workloads` content view gap
No Satellite context	References `satellite-primary.meridian.internal`
No SOP reference	Cites Content View Request SOP v2.3
No team routing	Escalates to `#platform-satellite` fast-track
No EPEL/CRB awareness	Specifies CRB + EPEL requirements for Python 3.14

Generic package manager advice

Identifies aiml-workloads content view gap

No Satellite context

References satellite-primary.meridian.internal

No SOP reference

Cites Content View Request SOP v2.3

No team routing

Escalates to #platform-satellite fast-track

No EPEL/CRB awareness

Specifies CRB + EPEL requirements for Python 3.14

The infrastructure was identical. The error was identical. The difference was entirely in what the agent knew about how Meridian Financial operates.

That institutional knowledge lives in a markdown file. You added it — and extended an AI operations team — without writing any Python.

Takeaways

Athena’s agents and skills are defined in configuration, not code
OpenShift provides the extensibility mechanism: ConfigMap for agents, PVC for skills
Adding a new specialist agent is a 4-step process: create the skill, update the error classifier on the PVC, update the ConfigMap, restart the pod
Skills encode institutional knowledge in plain language — anyone who understands the domain can write them
The ops_manager automatically routes failures to the right specialist based on the agent’s description field
Same failure, same infrastructure, completely different analysis — the difference is domain expertise encoded in a skill
You extended an AI operations team without writing any Python

Module 3: Understanding your Agentic AIOps Team

From First to Second Generation Agents

Agentic Harnesses

Meet Athena — your AIOps team

The ops_manager — your team lead

The sre_ subagents — your specialists

Why ephemeral subagents matter — context isolation

Skills — domain knowledge without code

Tools — how agents take action

Models — the reasoning engine

Exercise 1: See the gap

Step 1: Read the Kira ticket

Exercise 2: Explore the Athena architecture

Exercise 3: Inspect the extensibility layer

Exercise 4: Add the sre_package_management agent

Step 3a: Create the skill directory on the PVC

Step 3b: Download the skill and copy it to the PVC

Step 3c: Update the ConfigMap with the new agent definition

Step 3d: Patch the ConfigMap

Step 3d.5: Update the error classifier skill on the PVC

Step 3e: Roll out the change

Verify the new agent

Exercise 5: Same job, better specialist

Exercise 6: Compare the two tickets side by side

Takeaways

Exercise 4: Add the `sre_package_management` agent