Bonus: Deep Agents Deep Dive

This module is for developers who want to understand what’s inside the AI assistant you’ve been working with. We’ll walk through the architecture of Athena — the deep agent powering your AIOps pipeline — and look at the actual code that makes it work.

No exercises here — just architecture diagrams, real code from the Athena GitHub repository, and the design decisions behind them.

What is a deep agent?

A simple agent is a single LLM invocation with tools. It receives a prompt, reasons about what to do, calls tools, and returns a result — all in one conversation. This works well for focused tasks, but breaks down when the problem requires different types of expertise.

A deep agent separates concerns:

  • A manager agent (ops_manager) runs for the duration of the analysis. It holds the overall context, classifies the problem, and decides who should investigate.

  • Specialist subagents (sre_linux, sre_ansible, sre_openshift, sre_networking) are created on demand, each with their own skills and tools. They do the actual root-cause analysis and return a structured result.

  • Each subagent starts with fresh context — it sees only the incident data and its own domain skills. This prevents cross-contamination: package manager errors don’t confuse the SSH failure analysis.

Deep agent architecture — manager with ephemeral specialists

The ReAct pattern

The manager agent uses a ReAct loop — Reason + Act:

  1. Read the incident data

  2. Reason about the failure domain (using the error-classifier skill)

  3. Act by delegating to the right specialist (using the task() tool)

  4. Review the specialist’s analysis (via the reviewer subagent)

  5. Return the final ticket

This is not a rigid pipeline — the manager reasons at each step and can adjust. If the error classifier is uncertain, the manager notes the low confidence. If the reviewer escalates, the manager upgrades the risk level.

Three extension points

Athena is designed to be extended without changing Python code:

Extension Point What It Is How to Extend

Skills

Markdown documents loaded as agent context — domain knowledge, diagnostic workflows, SOPs

Add a new SKILL.md file to the skills directory

Tools

Python functions the agent can call at runtime — @tool-decorated functions

Add a new function to tools.py and register it in subagents.yaml

Subagents

Specialist agents defined in YAML — each with its own model, skills, and tools

Add a new entry to subagents.yaml

Skills are the most accessible of the three — anyone who can write a diagnostic runbook can create a skill. We’ll look at each of these in detail.

The orchestrator — ops_manager

The ops_manager is created by a single function call, and the entire pipeline runs through run_pipeline(). Here’s how the pieces fit together — deterministic Python handles the API plumbing, while the LLM handles reasoning.

Pipeline data flow — blue is deterministic Python and orange is LLM reasoning

Creating the agent

The create_ops_manager() function in pipeline.py wires together the deep agent from its component parts:

def create_ops_manager(settings: Settings):
    """Create the ops_manager Deep Agent configured by filesystem files."""
    return create_deep_agent(
        model=_make_maas_model(                                    (1)
            os.environ.get("OPS_MANAGER_MODEL", "claude-sonnet-4-6")
        ),
        memory=["./AGENTS.md"],                                    (2)
        tools=[],                                                  (3)
        subagents=load_subagents(PROJECT_DIR / "subagents.yaml"),  (4)
        backend=FilesystemBackend(root_dir=PROJECT_DIR),           (5)
    )
1 The LLM model — routed through the MaaS gateway (Red Hat OpenShift AI)
2 The agent’s persona and triage protocol — loaded from AGENTS.md as persistent memory
3 No direct tools — the manager delegates rather than acting directly
4 Specialist subagents loaded from YAML configuration
5 Filesystem backend — agents read incident.json and write artifacts to disk

Notice what’s not in the code: no hardcoded prompts, no agent routing logic, no domain knowledge. The manager’s entire persona comes from AGENTS.md, and the specialists come from subagents.yaml. Change either file and the agent changes — no Python modification required.

Loading subagents from YAML

The load_subagents() function reads subagents.yaml and resolves tool references:

def load_subagents(config_path: Path) -> list[dict]:
    """Load subagent definitions from YAML and resolve tool references."""
    available_tools = {
        "web_search": web_search,                               (1)
    }

    with open(config_path) as f:
        config = yaml.safe_load(f)

    subagents = []
    for name, spec in config.items():
        subagent = {
            "name": name,
            "description": spec["description"],                 (2)
            "system_prompt": spec["system_prompt"],
        }
        if "model" in spec:
            model_name = (
                spec["model"].split(":")[-1]
                if ":" in spec["model"]
                else spec["model"]
            )
            subagent["model"] = _make_maas_model(model_name)    (3)
        if "tools" in spec:
            subagent["tools"] = [                                (4)
                available_tools[t] for t in spec["tools"]
            ]
        if "skills" in spec:
            subagent["skills"] = spec["skills"]                 (5)
        subagents.append(subagent)

    return subagents
1 Tool registry — maps string names in YAML to actual Python functions
2 The description is what the manager reads to decide which specialist to delegate to
3 Each subagent can use a different model — the reviewer uses a smaller, cheaper model
4 Tools are resolved from string names to function references
5 Skills are paths — the Deep Agents framework loads the SKILL.md files at runtime

Running the pipeline

The run_pipeline() function ties it all together — write the incident to disk, invoke the agent, parse the result:

async def run_pipeline(
    envelope: IncidentEnvelope, settings: Settings
) -> TicketPayload:
    # Write incident context for agents to read
    incident_path = PROJECT_DIR / "incident.json"
    incident_path.write_text(
        envelope.model_dump_json(indent=2)                      (1)
    )

    agent = create_ops_manager(settings)

    incident_summary = (                                        (2)
        f"A failed AAP2 job requires analysis.\n\n"
        f"Job: {envelope.job.name} (ID: {envelope.job.id})\n"
        f"Template: {envelope.job.template_name}\n"
        f"Error excerpt:\n{envelope.artifacts.error_excerpt}\n\n"
        f"Read incident.json for full context. "
        f"Classify the failure, delegate to the right specialist, "
        f"have the reviewer validate, and return a TicketPayload JSON."
    )

    final_message = None
    async for chunk in agent.astream(                           (3)
        {"messages": [("user", incident_summary)]},
        config={"configurable": {
            "thread_id": f"incident-{envelope.event_id}"
        }},
        stream_mode="values",
    ):
        if "messages" in chunk:
            messages = chunk["messages"]
            if messages:
                last = messages[-1]
                if isinstance(last, AIMessage) and last.content:
                    final_message = last

    ticket_data = _extract_json(final_message.content)          (4)
    ticket = TicketPayload(**ticket_data)                       (5)
    return ticket
1 Pydantic model serialized to JSON on disk — the agent reads this file during analysis
2 The prompt summarizes the incident — but the full context is in incident.json
3 Streaming execution — the agent runs its full ReAct loop (classify → delegate → review)
4 The agent returns natural language with embedded JSON — this extracts the JSON
5 Pydantic validation ensures the output matches the TicketPayload schema

The key design decision here is the hybrid architecture: the pipeline function is deterministic Python. It doesn’t classify, analyze, or make decisions — it just wires things up and validates the output. All reasoning happens inside the agent’s ReAct loop.

Skills — domain knowledge as context

Skills are the most powerful and accessible extension point in the deep agent architecture. A skill is a markdown file that gets loaded into an agent’s context — it becomes part of what the agent "knows."

No code. No SDK. No compilation. If you can write a diagnostic runbook, you can write a skill.

How skills flow from filesystem into agent context

The skills directory

Each skill is a directory containing a SKILL.md file. Here’s how they’re organized in the Athena repository:

skills/
├── error-classifier/SKILL.md          # Routes failures to the right specialist
├── analyze-ansible-failure/SKILL.md   # Ansible/AAP2 diagnostic workflow
├── analyze-linux-failure/SKILL.md     # Linux host-level diagnostics
├── analyze-openshift-failure/SKILL.md # Kubernetes/OpenShift diagnostics
├── analyze-networking-failure/SKILL.md# Network connectivity diagnostics
├── analyze-package-management/SKILL.md# Package manager + Satellite SOPs
├── create-ticket/SKILL.md            # TicketPayload output schema
├── review-ticket/SKILL.md            # Quality validation checklist
└── common/
    └── log-analysis/SKILL.md         # Shared log-parsing workflows

Skills are referenced by path in subagents.yaml — each specialist loads only the skills it needs:

sre_linux:
  skills:
    - ./skills/analyze-linux-failure/    # Domain-specific diagnostics
    - ./skills/create-ticket/            # Output schema guidance
    - ./skills/common/                   # Shared log-parsing

The error classifier — the routing brain

The error-classifier skill is what the ops_manager uses to decide which specialist should investigate a failure. It maps error signals to domains:

# Error Classifier

Classify the failure domain from an AAP2 job failure incident.

## Workflow

1. **Read** the error excerpt and stdout from the incident envelope
2. **Scan** for domain-specific signals:
   - **Ansible**: task/role/play references, module errors, collection not found,
     credential failures, jinja2 template errors, variable undefined.
     NOT package manager failures — "No package X available" or dnf/yum
     errors are package_management even when surfaced by an Ansible task
   - **Linux**: systemd unit failures, SELinux denials (avc:),
     permission denied on files, filesystem full/mount errors
   - **Package Management**: dnf/yum errors, "No match for argument",
     "No package X available", missing or disabled repositories,
     Satellite content gaps, CRB/EPEL requirements
   - **OpenShift/Kubernetes**: pod/container/image references,
     CrashLoopBackOff, ImagePullBackOff, RBAC denied, namespace/quota errors
   - **Networking**: DNS resolution failed, connection refused/timeout,
     SSH errors, TLS/SSL certificate errors, proxy errors
3. **Resolve** ambiguity: if signals span multiple domains, identify
   the root cause domain. Example: "Ansible task failed because DNS
   lookup timed out" → networking (not ansible)
4. **Emit** classification:
   - `domain`: one of ansible, linux, package_management, openshift, networking
   - `delegate_to`: the exact subagent name (e.g. sre_linux)
   - `confidence`: 0-100 based on signal strength
   - `rationale`: one sentence explaining why this domain was chosen

This is the full skill — 25 lines of markdown that drive the entire routing decision. The ambiguity resolution rule in step 3 is critical: when an Ansible task fails because a package is missing, the root cause is package management, not Ansible. This is the kind of institutional knowledge that would otherwise live in a senior engineer’s head.

A domain skill — institutional knowledge as markdown

The richest example is the analyze-package-management skill. It encodes Meridian Financial’s Satellite infrastructure, content view architecture, and SOPs — the kind of knowledge a new hire would take weeks to absorb:

# analyze-package-management

You are analyzing a package management failure on a Red Hat Enterprise
Linux host managed by Meridian Financial's Ansible Automation Platform.

## Diagnostic Workflow

1. Read `incident.json` — identify the exact package name and the DNF
   error string
2. Classify the error:
   - `No match for argument: <package>` → package absent from all
     enabled repos/content views
   - `Repository 'X' is disabled` → repo exists in Satellite but not
     enabled in this host's activation key
   - `Failed to download metadata for repo` → Satellite reachability
     or content sync issue
3. Determine whether the package requires CRB, EPEL, or a custom
   Satellite content view
4. Select the appropriate SOP path

## Meridian Financial Satellite Infrastructure

- **Primary:** `satellite-primary.meridian.internal` (London DC)
- **Replica:** `satellite-replica.meridian.internal` (Dublin DC)

### Content View Architecture

Content views are **per-team, per-project** — not global:

- `cv-platform-rhel9` — Platform team (includes CRB, internal tooling)
- `cv-security-rhel9` — Security team (includes CRB, SCAP tools)
- `cv-base-rhel9` — Default for most RHEL VMs (no CRB, no EPEL)

**CRB note:** CodeReady Builder is included in `cv-platform-rhel9`
and `cv-security-rhel9` but **not** in `cv-base-rhel9`. Most RHEL VMs
are registered to `cv-base-rhel9`, which is why packages requiring CRB
(like `python3.14`) fail.

## Common Failure Patterns

| DNF Error | Root Cause | Required Action |
|-----------|-----------|----------------|
| `No match for argument: python3.14` | Not in cv-base-rhel9;
  requires CRB + EPEL | Request new content view (SOP v2.3) |
| `Repository 'epel' is disabled` | EPEL not in activation key |
  Content view update request |
| Package in Dev but not Prod | Content view promoted to Dev only |
  Promote to target lifecycle environment |

## SOP v2.3 — New Content View Request

1. Raise ticket to Platform team: "New Content View Request"
2. Include: project name, target lifecycle, package list, CRB/EPEL
   requirements, business justification
3. Standard SLA: 2 business days

This skill gives the agent the same knowledge a Meridian Financial SRE would have after months on the team: which Satellite servers exist, how content views are structured, why CRB packages fail on default VMs, and exactly which SOP to follow.

Why skills matter

Skills are "just markdown" — but that simplicity is the point:

  • Accessible: Anyone who can write a diagnostic runbook can create a skill. No Python, no SDK, no build step.

  • Institutional knowledge: Skills encode the expertise that otherwise lives in someone’s head or scattered wiki pages — infrastructure-specific details, SOPs, failure pattern tables.

  • Runtime extensible: Skills live on a PersistentVolumeClaim in OpenShift. You can add, modify, or remove skills without rebuilding the container image or redeploying the application.

  • Composable: Each subagent loads a combination of skills — a domain-specific analysis skill, the shared create-ticket schema, and common log-parsing workflows. Different specialists get different knowledge.

In Module 3, you added a new specialist by creating a skill and a subagent entry. That’s the pattern: write the knowledge, wire the configuration, and the agent learns a new domain.

Tools — giving agents hands

Skills give agents knowledge. Tools give agents capabilities — the ability to take actions at runtime that go beyond what’s in their context.

In Athena, the tool surface is deliberately tiny. There is exactly one tool: web_search.

from langchain_core.tools import tool


@tool                                                          (1)
def web_search(
    query: str,                                                (2)
    max_results: int = 5,
    topic: Literal["general", "news"] = "general",
) -> dict:
    """Search the web for documentation, CVEs, or                (3)
    troubleshooting guides.

    Args:
        query: Specific search query (be detailed)
        max_results: Number of results to return (default: 5)
        topic: "general" for docs/guides,
               "news" for recent incidents/CVEs

    Returns:
        Search results with titles, URLs, and content excerpts.
    """
    try:
        from tavily import TavilyClient

        api_key = os.environ.get("TAVILY_API_KEY")
        if not api_key:
            return {
                "error": "TAVILY_API_KEY not set — "
                         "web search unavailable"
            }

        client = TavilyClient(api_key=api_key)
        return client.search(
            query, max_results=max_results, topic=topic
        )
    except Exception as e:
        return {"error": f"Search failed: {e}"}
1 The @tool decorator from LangChain — it extracts the function signature and docstring to create a tool schema that the LLM can call
2 Typed parameters become the tool’s input schema — the LLM sees these types and descriptions
3 The docstring becomes the tool’s description — this is what the agent reads to decide when to use the tool

How tools are wired

Tools are registered in pipeline.py by name and referenced in subagents.yaml:

# In pipeline.py
available_tools = {
    "web_search": web_search,    # String name → function reference
}
# In subagents.yaml
sre_linux:
  tools:
    - web_search                 # References the string name above

The design decision: one tool. Domain knowledge lives in skills, not tools. A tool is for capabilities the agent can’t get from its static context — searching the web for CVEs, checking documentation for a specific error code. Most of the analysis work uses the knowledge already loaded via skills.

Beyond tools — MCP

The @tool pattern works well for tools that are part of your application — you write the function, decorate it, and the agent can call it. But what about tools that live outside your application?

The Model Context Protocol (MCP) is a standard for agents to discover and use tools across system boundaries. Think of it like USB for AI agents — a universal connector that lets an agent talk to any MCP-compatible tool server.

With @tool, the tool function is compiled into the agent at build time. With MCP, an agent connects to a tool server at runtime and discovers what capabilities are available — a database query server, a monitoring API, a ticketing system. The agent doesn’t need to know about these tools in advance; it discovers them dynamically.

Athena doesn’t use MCP today — its single web_search tool is compiled in. But the pattern is compatible: the agent’s tool interface is the same whether the tool comes from a local function or an MCP server. As the agent’s capabilities grow beyond web search, MCP provides the architecture to connect to external services without bundling everything into one application.

Explore further

To dig deeper into the code and concepts covered here:

The Athena repository

Start with these files:

File What to Look At

athena/agents/pipeline.py

The orchestration core — create_ops_manager(), run_pipeline(), load_subagents()

AGENTS.md

The ops_manager’s persona and triage protocol — this is the agent’s "memory"

subagents.yaml

All specialist definitions — descriptions, models, skills, tools

skills/

Every domain skill — start with error-classifier/ to understand routing, then explore the analysis skills

athena/models.py

Pydantic V2 data models — IncidentEnvelope (input) and TicketPayload (output)

Reference reading