Module 3: Understanding your Agentic AIOps Team

In Module 2, you watched the Deep Agents pipeline analyze a failure. Now you’re going to look under the hood — understand the architecture of your AI operations team, explore how agents and skills are defined, and then build and deploy a new specialist agent yourself.

But first — what kind of agent architecture are we using, and why does it matter?

From First to Second Generation Agents

First generation AI agents — the classic ReAct (Reason + Act) pattern — work well for short, focused tasks: answer a question, call an API, summarize a document. But they struggle with complex, multi-step operations like incident investigation. Their context is limited to conversation history, which overflows on long tasks. As the context window fills with increasingly irrelevant history, agents enter what researchers call the "dumb zone" — the point where an agent with a 200K token context window starts acting like it has a 20K window because most of its attention is wasted on noise. With 50+ tool calls, agents drift off-topic and forget their original objectives. Researchers have identified five specific failure modes:

  • Context distraction — beyond ~100K tokens, agents start repeating actions from their history rather than synthesizing new plans

  • Context confusion — too many tools or too much irrelevant information overwhelms the model. One study found an 8B model failed with 46 tools but succeeded with just 19

  • Context poisoning — early errors or hallucinations become embedded in the history and repeatedly referenced

  • Context clash — conflicting information across conversation turns causes contradictory assumptions, with performance dropping up to 39%

  • Lost goals — with ~50+ tool calls, agents drift off-topic or forget earlier objectives as the original task gets buried

LangChain’s Deep Agents framework represents a second generation of agent architecture — sometimes called Agent 2.0. Instead of a single agent with a long, degrading conversation, a Deep Agent uses four architectural pillars:

  • Explicit planning — to-do lists as persistent context anchors that prevent goal drift

  • Persistent filesystem — a virtual workspace for saving notes, incident data, and structured artifacts outside the conversation

  • Sub-agent spawning — delegating complex subtasks to specialists, each with a clean context window

  • Context isolation — each specialist sees only what it needs, preventing cross-contamination between unrelated analyses

Agentic Harnesses

Deep Agents is one example of an agentic harness — a framework that wraps and drives an LLM rather than letting it run unsupervised. Other harnesses include Claude Code (Anthropic’s coding agent) and the Claude Agent SDK (for building custom agents). What they share: the harness manages context, tools, delegation, and output parsing. A well-designed harness makes smaller models perform like larger ones — which is why you can switch to open source models in Module 4 without losing quality.

Dimension First Generation (ReAct Only) Second Generation (ReAct + Harness)

Best for

Short, focused tasks

Multi-step projects

Context

Conversation history only

Filesystem + sub-agents

Goal persistence

Degrades over time

Maintained via planning

Context isolation

None — everything in one window

Each sub-agent gets a clean context

Athena uses all four pillars. The ops_manager maintains an explicit plan (the triage protocol in AGENTS.md). Skills and incident data live on the filesystem. Specialist subagents are spawned on demand and destroyed after use. And each subagent operates in context isolation — sre_linux never sees the networking error that sre_networking is investigating.

Meet Athena — your AIOps team

Athena is not a single AI agent. It’s a team of specialized agents, orchestrated by a manager, each with its own expertise, tools, and domain knowledge.

Athena team architecture — ops_manager delegates to specialist subagents

The ops_manager — your team lead

At the center is ops_manager, a long-running Deep Agent that uses the ReAct (Reason + Act) pattern. When a failure arrives via webhook, ops_manager:

  1. Reads the incident data (job output, error details, metadata)

  2. Reasons about which domain this failure belongs to

  3. Acts by delegating to the right specialist via the task tool

  4. Reviews the specialist’s work via the reviewer agent

  5. Returns the final ticket

The ops_manager stays running across the entire analysis — it’s the orchestrator that holds the conversation together.

The sre_ subagents — your specialists

The four sre_* subagents are ephemeral — they’re created on demand when ops_manager calls task(), and discarded after they return their analysis. Each one is a focused expert:

Subagent Expertise

sre_ansible

Playbook errors, credential issues, execution environments, variable resolution

sre_linux

Systemd services, SELinux denials, filesystem/permissions

sre_openshift

Pod lifecycle, RBAC, operators, namespace/quota, security contexts

sre_networking

DNS, SSH connectivity, proxy/TLS, routing, firewall

Why ephemeral subagents matter — context isolation

This is a critical architectural decision. Consider what would happen if a single agent handled everything:

  • ops_manager analyzes a package failure — 3000 tokens of dnf output fills the context

  • Next, an SSH failure arrives — the package output is still in context, confusing the analysis

  • By the fifth failure, the context is bloated with irrelevant data from previous investigations

Ephemeral subagents solve this. Each sre_* agent starts with a fresh context — only the incident data and its domain-specific skills. When it finishes, its context is discarded. The ops_manager only sees the final TicketPayload result, not the specialist’s internal reasoning. This keeps every analysis clean and focused.

Skills — domain knowledge without code

Each subagent is loaded with skills — markdown documents that provide domain-specific diagnostic workflows, institutional knowledge, and output formatting guidance. Skills are the reference library that turns a general-purpose LLM into a domain expert.

Tools — how agents take action

Agents can call tools — functions that let them interact with external systems. In Athena, the key tool is task() which lets ops_manager delegate to subagents. Subagents also have web_search for looking up documentation.

Models — the reasoning engine

Each agent can use a different model from MaaS. The ops_manager might use a frontier model for complex reasoning, while subagents use a smaller, cheaper model for focused domain analysis. The reviewer can use an even lighter model since it only validates, not analyzes.

In the next exercises, you’ll see a gap in the current team’s capabilities, explore the architecture, add a specialist, and then watch the same failure get handled far better.

Exercise 1: See the gap

The AI/ML team at Meridian Financial has requested Python 3.14 on their RHEL servers. This job ran automatically when your environment was provisioned — so the ticket is already waiting for you in Kira.

The job failed with this error:

TASK [Install Python 3.14] ***
fatal: [rhel-node-01]: FAILED! => {"changed": false,
  "failures": ["No match for argument: python3.14"],
  "msg": "Failed to install some of the specified packages",
  "rc": 1, "results": []}

Step 1: Read the Kira ticket

  1. In the Kira tab (log in with user-12345 / deeper-agents if prompted), find the open ticket for 10 Install Python 3.14

  2. Read the ticket carefully. Notice what it doesn’t contain:

    • No mention of Meridian Financial’s Satellite servers or the specific content view model used by different teams

    • No reference to the Content View Request SOP (Meridian Standard Operating Procedure v2.3)

    • No awareness that the AI/ML team needs CRB (CodeReady Builder) and EPEL to be enabled in their content view

    • No suggestion of the #platform-satellite fast-track escalation path

    sre_linux gave a generic "check your package manager and repositories" answer. Technically accurate — but useless to anyone at Meridian. It doesn’t know how your organization works.

    This is the gap a specialist agent closes. Let’s build one.

Exercise 2: Explore the Athena architecture

Before adding a new agent, let’s understand what’s already there.

  1. Click the Gitea tab — no login required, the Athena repo is public

    Navigate to the athena-aiops-deep-agent repository. This is the source code for the Deep Agent running in your namespace.

  2. Browse to subagents.yaml in the repository root

    This is the file that defines every specialist agent in the system:

    Agent Domain

    sre_ansible

    Playbook syntax, role/collection errors, credential issues, execution environments

    sre_linux

    Systemd services, SELinux denials, filesystem/permissions

    sre_openshift

    Pod lifecycle, image pull, RBAC, operators, namespace/quota

    sre_networking

    DNS, SSH connectivity, proxy/TLS, routing, firewall

    reviewer

    Quality validation — checks ticket coherence before submission

    Each agent has the same structure: description, model, system_prompt, tools, and skills. The ops_manager reads the description field to decide which agent to route a failure to.

  3. Browse to the skills/ directory

    Each subdirectory contains a SKILL.md file — a markdown document that gets loaded into the agent’s context as reference knowledge. For example, skills/analyze-linux-failure/SKILL.md contains a structured diagnostic workflow.

    Skills are not code — they’re instructions written in plain language. Anyone who understands the domain can write or improve a skill without touching Python.

  4. Optionally, browse the Helm chart at deploy/helm/athena/templates/

    The key files are configmap.yaml (agent config), pvc.yaml (skills volume), and deployment.yaml (which mounts both and includes an initContainer that pre-populates the skills PVC from the image on first boot).

Exercise 3: Inspect the extensibility layer

Here’s where it gets interesting. In your namespace, Athena’s agent configuration and skills are not baked into the container image. They’re mounted from Kubernetes resources that you can modify at runtime.

  1. Click the Terminal tab and log in to your OpenShift namespace:

    oc login --insecure-skip-tls-verify \
      -u user-12345 \
      -p deeper-agents \
      https://openshift.example.com:6443 \
      --namespace user-12345-agentic

    Sample output:

    Login successful.
    Using project "user-XXXXX-agentic".
  2. Inspect the ConfigMap that holds the agent definitions:

    oc get configmap athena-agent-config -o yaml | bat --language yaml

    Sample output (your results may vary):

    apiVersion: v1
    data:
      AGENTS.md: |
        # Athena Ops Manager
        ...
      subagents.yaml: |
        sre_ansible:
          description: >
            Ansible/AAP2 specialist...
        ...
    kind: ConfigMap
    metadata:
      name: athena-agent-config
      ...

    The ConfigMap stores two files as keys in data:. Each key becomes a file when mounted into the pod. This means you can add a new agent by updating this ConfigMap — no image rebuild needed.

    Use oc get configmap athena-agent-config -o yaml | bat --language yaml to browse the entire file — you’ll see the full AGENTS.md and subagents.yaml contents
  3. Inspect the PVC that holds the skills:

    oc exec deployment/athena --container athena \
      -- ls /app/skills/

    Sample output:

    analyze-ansible-failure
    analyze-linux-failure
    analyze-networking-failure
    analyze-openshift-failure
    common
    create-ticket
    error-classifier
    lost+found
    review-ticket

    The skills/ directory is mounted from a PersistentVolumeClaim. You can add new skills by copying files onto the PVC — again, no image rebuild needed.

Notice the lost+found directory? The PVC, unlike the ConfigMap, appears as a mounted filesystem to the container. You can confirm this with oc exec deployment/athena --container athena — df, this makes it simple to add new skills without rebuilding the image, or to iterate on them.
  1. Understand the implications:

    • Add a new specialist agent — update the ConfigMap to add a new entry to subagents.yaml

    • Add domain knowledge — copy a new skill directory onto the PVC

    • Improve existing agents — edit the system prompt or skill content

    • Change models — switch an agent to a cheaper or more capable model

    • Roll out changes — restart the pod to pick up the new configuration

    All without writing a single line of Python. Let’s do exactly that.

Exercise 4: Add the sre_package_management agent

Currently, package management failures go to sre_linux — which knows the Linux ecosystem but nothing about Meridian Financial’s Satellite infrastructure, content view model, or escalation paths. You’re going to add a dedicated sre_package_management agent with a skill that encodes this institutional knowledge.

Step 3a: Create the skill directory on the PVC

  1. Create the skill directory:

    oc exec deployment/athena --container athena \
      -- mkdir -p /app/skills/analyze-package-management

Step 3b: Download the skill and copy it to the PVC

  1. Download the package management analysis skill from your Gitea repository and copy it directly onto the PVC:

    curl -sL \
      https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/analyze-package-management-SKILL.md \
      | oc exec --stdin deployment/athena --container athena \
      -- sh -c 'cat > /app/skills/analyze-package-management/SKILL.md'
  2. Verify the skill was created:

    oc exec deployment/athena --container athena \
      -- cat /app/skills/analyze-package-management/SKILL.md \
      | bat --language markdown

    Sample output:

    ---
    name: analyze-package-management-failure
    description: Package management failure analysis with Meridian Financial institutional knowledge
    ---
    
    # Analyze Package Management Failure
    
    Deep analysis of AAP2 job failures caused by package installation, repository, or
    content view issues on RHEL hosts.
    
    ## Institutional Knowledge — Meridian Financial Satellite Infrastructure
    
    **Satellite Infrastructure:**
    - Primary: satellite-primary.meridian.internal (RHEL 9 content)
    - Sync schedule: Tuesdays 02:00 UTC
    
    **Content View Model (per-team):**
    - base-rhel9: All RHEL servers — core OS packages only
    - ops-tooling: SRE team — monitoring, diagnostic tools
    - aiml-workloads: AI/ML team — Python 3.11+, CUDA, data science libs (CRB + EPEL enabled)
    ...

    This is the tribal knowledge that makes this agent smarter than the generic sre_linux: it knows about Meridian’s Satellite servers, which content view belongs to which team, the SOP for requesting content view changes, and the fast-track escalation channel.

Step 3c: Update the ConfigMap with the new agent definition

Here’s what the new sre_package_management agent entry looks like:

sre_package_management:
  description: > (1)
    Package management specialist. Delegate all package installation failures:
    dnf/yum errors, missing packages, Satellite content view gaps, EPEL/CRB
    requirements, repository sync issues, and content view request escalation.
  model: claude-sonnet-4-6 (2)
  system_prompt: | (3)
    You are a senior SRE specializing in Red Hat package management and Satellite
    content delivery. You receive incident data from failed AAP2 jobs and perform
    root-cause analysis on package-related failures.

    Always:
    - Read the incident context (incident.json) first
    - Identify the exact package(s) that failed to install
    - Determine which team's content view is involved (check host group)
    - Reference the analyze-package-management skill for Meridian-specific knowledge
    - Check whether CRB or EPEL is needed and whether the team's content view includes it
    - Provide the SOP reference and escalation path for content view requests

    Use the create-ticket skill to structure your analysis as a TicketPayload.
    Set area to "linux" for all package management issues.
  tools: (4)
    - web_search
  skills: (5)
    - ./skills/analyze-package-management/
    - ./skills/create-ticket/
    - ./skills/common/
1 Description — the ops_manager reads this to decide when to route a failure here. Mentioning "package installation", "dnf/yum", "Satellite", "content view", and "EPEL/CRB" ensures package failures get routed here instead of to sre_linux
2 Model — using claude-sonnet-4-6 for strong reasoning. For a cost-sensitive deployment, you could use a smaller model
3 System prompt — instructs the agent to identify team content view gaps and reference the skill for institutional knowledge
4 Toolsweb_search lets the agent look up package availability or upstream documentation if needed
5 Skills — points to the skill you just created, plus the shared create-ticket and common skills

Rather than manually editing YAML in a terminal editor (where a single indentation error would break the configuration), download the complete pre-built subagents.yaml that includes sre_package_management:

  1. Download the updated subagents configuration:

    curl -sL https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/subagents-with-sre-package-mgmt.yaml \
      -o /tmp/subagents-new.yaml
  2. Verify the download — confirm sre_package_management is present:

    grep "^sre_\|^reviewer" /tmp/subagents-new.yaml

    Expected output:

    sre_ansible:
    sre_linux:
    sre_openshift:
    sre_networking:
    sre_package_management:
    reviewer:

    Notice the new sre_package_management: entry between sre_networking: and reviewer:.

Step 3d: Patch the ConfigMap

You need to update two keys in the ConfigMap: subagents.yaml (to add the new agent) and AGENTS.md (to update ops_manager’s routing rules so it knows to delegate package failures to `sre_package_management instead of sre_linux).

  1. Download the updated ops_manager routing rules:

    curl -sL https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/agents-with-sre-package-mgmt.md \
      -o /tmp/agents-new.md
  2. Verify the routing update — confirm sre_package_management appears in Domain Awareness:

    grep "sre_package_management\|sre_linux" /tmp/agents-new.md

    Expected output:

    - **sre_linux**: Systemd services, SELinux, filesystem/permissions (NOT package manager or Satellite — those go to sre_package_management if available)
    - **sre_package_management**: DNF/YUM errors, missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements (when available)
  3. Apply both files to the ConfigMap in one command:

    oc create configmap athena-agent-config \
      --from-file=subagents.yaml=/tmp/subagents-new.yaml \
      --from-file=AGENTS.md=/tmp/agents-new.md \
      --dry-run=client -o yaml | oc apply -f -

    Expected output:

    configmap/athena-agent-config configured
    You may see a Warning about a missing kubectl.kubernetes.io/last-applied-configuration annotation — this is safe to ignore. It only appears the first time because the ConfigMap was originally created by Helm, not oc apply

    This command rebuilds the ConfigMap from both downloaded files and applies it in place. Both subagents.yaml (new agent definition) and AGENTS.md (updated routing rules) are replaced in one atomic apply.

Step 3d.5: Update the error classifier skill on the PVC

The ops_manager uses an error-classifier skill to identify the failure domain before routing. The version on your PVC was initialized from the image at first boot and doesn’t know about the package_management domain yet — so even with the new agent in place, the classifier would still emit ansible or linux for a dnf failure.

Adding a new domain to the classifier is the same as teaching a team lead a new category of problem. You’re telling it: "when you see 'No package X available', that’s a package management issue, not an Ansible issue."

  1. Download the updated error-classifier skill:

    curl -sL \
      https://gitea.apps.cluster-GUID.opentlc.com/user-12345/agentic-devops-plays/raw/branch/main/configs/error-classifier-with-package-mgmt-SKILL.md \
      | oc exec --stdin deployment/athena --container athena \
      -- sh -c 'cat > /app/skills/error-classifier/SKILL.md'
  2. Verify the classifier now includes the package_management domain:

    oc exec deployment/athena --container athena \
      -- grep "Package Management\|package_management" \
      /app/skills/error-classifier/SKILL.md

    Expected output:

       - **Package Management**: dnf/yum errors, "No match for argument", "No package X available", missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements, subscription-manager errors
       - `domain`: one of ansible, linux, package_management, openshift, networking

Step 3e: Roll out the change

  1. Restart the Athena pod to pick up the new ConfigMap and skill:

    oc rollout restart deployment/athena

    Expected output:

    deployment.apps/athena restarted
  2. Wait for the new pod to come up:

    oc rollout status deployment/athena --timeout=120s

    Expected output:

    Waiting for deployment "athena" rollout to finish: 1 old replicas are pending termination...
    deployment "athena" successfully rolled out

    A rolling restart is one of the most powerful primitives OpenShift and Kubernetes give you. It spins up a new pod with the updated configuration, waits for it to pass health checks, and only then terminates the old one — zero downtime. You just reconfigured an AI operations team in production with two commands and no service interruption.

Verify the new agent

  1. Confirm the new skill is on the PVC:

    oc exec deployment/athena --container athena \
      -- ls /app/skills/analyze-package-management/

    Expected output:

    SKILL.md
  2. Confirm sre_package_management is in the ConfigMap:

    oc get configmap athena-agent-config \
      -o jsonpath='{.data.subagents\.yaml}' \
      | grep -A3 "sre_package_management"

    Expected output:

    sre_package_management:
      description: >
        Package management specialist. Delegate all package installation failures:
        dnf/yum errors, missing packages, Satellite content view gaps, EPEL/CRB
  3. Confirm ops_manager routing rules include sre_package_management:

    oc get configmap athena-agent-config \
      -o jsonpath='{.data.AGENTS\.md}' \
      | grep "sre_package_management\|sre_linux"

    Expected output:

    - **sre_linux**: Systemd services, SELinux, filesystem/permissions (NOT package manager or Satellite — those go to sre_package_management if available)
    - **sre_package_management**: DNF/YUM errors, missing or disabled repositories, Satellite content gaps, CRB/EPEL requirements (when available)

    Both the agent definition (subagents.yaml) and the routing rules (AGENTS.md) are now updated. The ops_manager will route package management failures to sre_package_management instead of sre_linux.

Exercise 5: Same job, better specialist

Now let’s trigger the same failure that exposed the gap in Exercise 1.

  1. In the AAP2 tab (log in with user-12345 / deeper-agents if prompted), navigate to Automation Execution → Templates

  2. Launch 10 Install Python 3.14 again by clicking the rocket icon rocket icon

    Same playbook. Same RHEL host. Same error — No match for argument: python3.14.

  3. Watch the job fail, then wait 1-3 minutes for the pipeline to process it

  4. In the Kira tab (log in with user-12345 / deeper-agents if prompted), look for the new ticket. Pay attention to:

    • Does the analysis identify the AI/ML team’s content view (aiml-workloads) as the source of the gap?

    • Does it reference Satellite (satellite-primary.meridian.internal) and the weekly sync schedule?

    • Does it mention that the aiml-workloads content view requires CRB and EPEL to be enabled for Python 3.14?

    • Does the recommended action cite the Content View Request SOP v2.3 and the #platform-satellite fast-track escalation channel?

    This is the power of Skills — you encoded Meridian Financial’s institutional knowledge into a markdown file, and the agent used it to provide context-aware analysis that goes far beyond generic package troubleshooting.

    Use the AI Chatbot to ask follow-up questions about the analysis
  5. In the Rocket.Chat tab (log in with user-12345 / deeper-agents if prompted), check #support for the notification

Exercise 6: Compare the two tickets side by side

You now have two Kira tickets for the exact same job failure — 10 Install Python 3.14 — processed by two different agents.

  1. In Kira, open both tickets and compare them:

    First ticket — sre_linux Second ticket — sre_package_management

    Generic package manager advice

    Identifies aiml-workloads content view gap

    No Satellite context

    References satellite-primary.meridian.internal

    No SOP reference

    Cites Content View Request SOP v2.3

    No team routing

    Escalates to #platform-satellite fast-track

    No EPEL/CRB awareness

    Specifies CRB + EPEL requirements for Python 3.14

  2. The infrastructure was identical. The error was identical. The difference was entirely in what the agent knew about how Meridian Financial operates.

That institutional knowledge lives in a markdown file. You added it — and extended an AI operations team — without writing any Python.

Takeaways

  • Athena’s agents and skills are defined in configuration, not code

  • OpenShift provides the extensibility mechanism: ConfigMap for agents, PVC for skills

  • Adding a new specialist agent is a 4-step process: create the skill, update the error classifier on the PVC, update the ConfigMap, restart the pod

  • Skills encode institutional knowledge in plain language — anyone who understands the domain can write them

  • The ops_manager automatically routes failures to the right specialist based on the agent’s description field

  • Same failure, same infrastructure, completely different analysis — the difference is domain expertise encoded in a skill

  • You extended an AI operations team without writing any Python