Bonus: Deep Agents Deep Dive
This module is for developers who want to understand what’s inside the AI assistant you’ve been working with. We’ll walk through the architecture of Athena — the deep agent powering your AIOps pipeline — and look at the actual code that makes it work.
No exercises here — just architecture diagrams, real code from the Athena GitHub repository, and the design decisions behind them.
What is a deep agent?
A simple agent is a single LLM invocation with tools. It receives a prompt, reasons about what to do, calls tools, and returns a result — all in one conversation. This works well for focused tasks, but breaks down when the problem requires different types of expertise.
A deep agent separates concerns:
-
A manager agent (
ops_manager) runs for the duration of the analysis. It holds the overall context, classifies the problem, and decides who should investigate. -
Specialist subagents (
sre_linux,sre_ansible,sre_openshift,sre_networking) are created on demand, each with their own skills and tools. They do the actual root-cause analysis and return a structured result. -
Each subagent starts with fresh context — it sees only the incident data and its own domain skills. This prevents cross-contamination: package manager errors don’t confuse the SSH failure analysis.
The ReAct pattern
The manager agent uses a ReAct loop — Reason + Act:
-
Read the incident data
-
Reason about the failure domain (using the error-classifier skill)
-
Act by delegating to the right specialist (using the
task()tool) -
Review the specialist’s analysis (via the reviewer subagent)
-
Return the final ticket
This is not a rigid pipeline — the manager reasons at each step and can adjust. If the error classifier is uncertain, the manager notes the low confidence. If the reviewer escalates, the manager upgrades the risk level.
Three extension points
Athena is designed to be extended without changing Python code:
| Extension Point | What It Is | How to Extend |
|---|---|---|
Skills |
Markdown documents loaded as agent context — domain knowledge, diagnostic workflows, SOPs |
Add a new |
Tools |
Python functions the agent can call at runtime — |
Add a new function to |
Subagents |
Specialist agents defined in YAML — each with its own model, skills, and tools |
Add a new entry to |
Skills are the most accessible of the three — anyone who can write a diagnostic runbook can create a skill. We’ll look at each of these in detail.
The orchestrator — ops_manager
The ops_manager is created by a single function call, and the entire pipeline runs through run_pipeline().
Here’s how the pieces fit together — deterministic Python handles the API plumbing, while the LLM handles reasoning.
Creating the agent
The create_ops_manager() function in
pipeline.py
wires together the deep agent from its component parts:
def create_ops_manager(settings: Settings):
"""Create the ops_manager Deep Agent configured by filesystem files."""
return create_deep_agent(
model=_make_maas_model( (1)
os.environ.get("OPS_MANAGER_MODEL", "claude-sonnet-4-6")
),
memory=["./AGENTS.md"], (2)
tools=[], (3)
subagents=load_subagents(PROJECT_DIR / "subagents.yaml"), (4)
backend=FilesystemBackend(root_dir=PROJECT_DIR), (5)
)
| 1 | The LLM model — routed through the MaaS gateway (Red Hat OpenShift AI) |
| 2 | The agent’s persona and triage protocol — loaded from AGENTS.md as persistent memory |
| 3 | No direct tools — the manager delegates rather than acting directly |
| 4 | Specialist subagents loaded from YAML configuration |
| 5 | Filesystem backend — agents read incident.json and write artifacts to disk |
Notice what’s not in the code: no hardcoded prompts, no agent routing logic, no domain knowledge.
The manager’s entire persona comes from AGENTS.md, and the specialists come from subagents.yaml.
Change either file and the agent changes — no Python modification required.
Loading subagents from YAML
The
load_subagents()
function reads subagents.yaml and resolves tool references:
def load_subagents(config_path: Path) -> list[dict]:
"""Load subagent definitions from YAML and resolve tool references."""
available_tools = {
"web_search": web_search, (1)
}
with open(config_path) as f:
config = yaml.safe_load(f)
subagents = []
for name, spec in config.items():
subagent = {
"name": name,
"description": spec["description"], (2)
"system_prompt": spec["system_prompt"],
}
if "model" in spec:
model_name = (
spec["model"].split(":")[-1]
if ":" in spec["model"]
else spec["model"]
)
subagent["model"] = _make_maas_model(model_name) (3)
if "tools" in spec:
subagent["tools"] = [ (4)
available_tools[t] for t in spec["tools"]
]
if "skills" in spec:
subagent["skills"] = spec["skills"] (5)
subagents.append(subagent)
return subagents
| 1 | Tool registry — maps string names in YAML to actual Python functions |
| 2 | The description is what the manager reads to decide which specialist to delegate to |
| 3 | Each subagent can use a different model — the reviewer uses a smaller, cheaper model |
| 4 | Tools are resolved from string names to function references |
| 5 | Skills are paths — the Deep Agents framework loads the SKILL.md files at runtime |
Running the pipeline
The run_pipeline() function ties it all together — write the incident to disk, invoke the agent, parse the result:
async def run_pipeline(
envelope: IncidentEnvelope, settings: Settings
) -> TicketPayload:
# Write incident context for agents to read
incident_path = PROJECT_DIR / "incident.json"
incident_path.write_text(
envelope.model_dump_json(indent=2) (1)
)
agent = create_ops_manager(settings)
incident_summary = ( (2)
f"A failed AAP2 job requires analysis.\n\n"
f"Job: {envelope.job.name} (ID: {envelope.job.id})\n"
f"Template: {envelope.job.template_name}\n"
f"Error excerpt:\n{envelope.artifacts.error_excerpt}\n\n"
f"Read incident.json for full context. "
f"Classify the failure, delegate to the right specialist, "
f"have the reviewer validate, and return a TicketPayload JSON."
)
final_message = None
async for chunk in agent.astream( (3)
{"messages": [("user", incident_summary)]},
config={"configurable": {
"thread_id": f"incident-{envelope.event_id}"
}},
stream_mode="values",
):
if "messages" in chunk:
messages = chunk["messages"]
if messages:
last = messages[-1]
if isinstance(last, AIMessage) and last.content:
final_message = last
ticket_data = _extract_json(final_message.content) (4)
ticket = TicketPayload(**ticket_data) (5)
return ticket
| 1 | Pydantic model serialized to JSON on disk — the agent reads this file during analysis |
| 2 | The prompt summarizes the incident — but the full context is in incident.json |
| 3 | Streaming execution — the agent runs its full ReAct loop (classify → delegate → review) |
| 4 | The agent returns natural language with embedded JSON — this extracts the JSON |
| 5 | Pydantic validation ensures the output matches the TicketPayload schema |
The key design decision here is the hybrid architecture: the pipeline function is deterministic Python. It doesn’t classify, analyze, or make decisions — it just wires things up and validates the output. All reasoning happens inside the agent’s ReAct loop.
Skills — domain knowledge as context
Skills are the most powerful and accessible extension point in the deep agent architecture. A skill is a markdown file that gets loaded into an agent’s context — it becomes part of what the agent "knows."
No code. No SDK. No compilation. If you can write a diagnostic runbook, you can write a skill.
The skills directory
Each skill is a directory containing a SKILL.md file. Here’s how they’re organized in the
Athena repository:
skills/
├── error-classifier/SKILL.md # Routes failures to the right specialist
├── analyze-ansible-failure/SKILL.md # Ansible/AAP2 diagnostic workflow
├── analyze-linux-failure/SKILL.md # Linux host-level diagnostics
├── analyze-openshift-failure/SKILL.md # Kubernetes/OpenShift diagnostics
├── analyze-networking-failure/SKILL.md# Network connectivity diagnostics
├── analyze-package-management/SKILL.md# Package manager + Satellite SOPs
├── create-ticket/SKILL.md # TicketPayload output schema
├── review-ticket/SKILL.md # Quality validation checklist
└── common/
└── log-analysis/SKILL.md # Shared log-parsing workflows
Skills are referenced by path in subagents.yaml — each specialist loads only the skills it needs:
sre_linux:
skills:
- ./skills/analyze-linux-failure/ # Domain-specific diagnostics
- ./skills/create-ticket/ # Output schema guidance
- ./skills/common/ # Shared log-parsing
The error classifier — the routing brain
The error-classifier skill is what the ops_manager uses to decide which specialist should investigate a failure. It maps error signals to domains:
# Error Classifier
Classify the failure domain from an AAP2 job failure incident.
## Workflow
1. **Read** the error excerpt and stdout from the incident envelope
2. **Scan** for domain-specific signals:
- **Ansible**: task/role/play references, module errors, collection not found,
credential failures, jinja2 template errors, variable undefined.
NOT package manager failures — "No package X available" or dnf/yum
errors are package_management even when surfaced by an Ansible task
- **Linux**: systemd unit failures, SELinux denials (avc:),
permission denied on files, filesystem full/mount errors
- **Package Management**: dnf/yum errors, "No match for argument",
"No package X available", missing or disabled repositories,
Satellite content gaps, CRB/EPEL requirements
- **OpenShift/Kubernetes**: pod/container/image references,
CrashLoopBackOff, ImagePullBackOff, RBAC denied, namespace/quota errors
- **Networking**: DNS resolution failed, connection refused/timeout,
SSH errors, TLS/SSL certificate errors, proxy errors
3. **Resolve** ambiguity: if signals span multiple domains, identify
the root cause domain. Example: "Ansible task failed because DNS
lookup timed out" → networking (not ansible)
4. **Emit** classification:
- `domain`: one of ansible, linux, package_management, openshift, networking
- `delegate_to`: the exact subagent name (e.g. sre_linux)
- `confidence`: 0-100 based on signal strength
- `rationale`: one sentence explaining why this domain was chosen
This is the full skill — 25 lines of markdown that drive the entire routing decision. The ambiguity resolution rule in step 3 is critical: when an Ansible task fails because a package is missing, the root cause is package management, not Ansible. This is the kind of institutional knowledge that would otherwise live in a senior engineer’s head.
A domain skill — institutional knowledge as markdown
The richest example is the analyze-package-management skill. It encodes Meridian Financial’s Satellite infrastructure, content view architecture, and SOPs — the kind of knowledge a new hire would take weeks to absorb:
# analyze-package-management
You are analyzing a package management failure on a Red Hat Enterprise
Linux host managed by Meridian Financial's Ansible Automation Platform.
## Diagnostic Workflow
1. Read `incident.json` — identify the exact package name and the DNF
error string
2. Classify the error:
- `No match for argument: <package>` → package absent from all
enabled repos/content views
- `Repository 'X' is disabled` → repo exists in Satellite but not
enabled in this host's activation key
- `Failed to download metadata for repo` → Satellite reachability
or content sync issue
3. Determine whether the package requires CRB, EPEL, or a custom
Satellite content view
4. Select the appropriate SOP path
## Meridian Financial Satellite Infrastructure
- **Primary:** `satellite-primary.meridian.internal` (London DC)
- **Replica:** `satellite-replica.meridian.internal` (Dublin DC)
### Content View Architecture
Content views are **per-team, per-project** — not global:
- `cv-platform-rhel9` — Platform team (includes CRB, internal tooling)
- `cv-security-rhel9` — Security team (includes CRB, SCAP tools)
- `cv-base-rhel9` — Default for most RHEL VMs (no CRB, no EPEL)
**CRB note:** CodeReady Builder is included in `cv-platform-rhel9`
and `cv-security-rhel9` but **not** in `cv-base-rhel9`. Most RHEL VMs
are registered to `cv-base-rhel9`, which is why packages requiring CRB
(like `python3.14`) fail.
## Common Failure Patterns
| DNF Error | Root Cause | Required Action |
|-----------|-----------|----------------|
| `No match for argument: python3.14` | Not in cv-base-rhel9;
requires CRB + EPEL | Request new content view (SOP v2.3) |
| `Repository 'epel' is disabled` | EPEL not in activation key |
Content view update request |
| Package in Dev but not Prod | Content view promoted to Dev only |
Promote to target lifecycle environment |
## SOP v2.3 — New Content View Request
1. Raise ticket to Platform team: "New Content View Request"
2. Include: project name, target lifecycle, package list, CRB/EPEL
requirements, business justification
3. Standard SLA: 2 business days
This skill gives the agent the same knowledge a Meridian Financial SRE would have after months on the team: which Satellite servers exist, how content views are structured, why CRB packages fail on default VMs, and exactly which SOP to follow.
Why skills matter
Skills are "just markdown" — but that simplicity is the point:
-
Accessible: Anyone who can write a diagnostic runbook can create a skill. No Python, no SDK, no build step.
-
Institutional knowledge: Skills encode the expertise that otherwise lives in someone’s head or scattered wiki pages — infrastructure-specific details, SOPs, failure pattern tables.
-
Runtime extensible: Skills live on a PersistentVolumeClaim in OpenShift. You can add, modify, or remove skills without rebuilding the container image or redeploying the application.
-
Composable: Each subagent loads a combination of skills — a domain-specific analysis skill, the shared
create-ticketschema, and common log-parsing workflows. Different specialists get different knowledge.
In Module 3, you added a new specialist by creating a skill and a subagent entry. That’s the pattern: write the knowledge, wire the configuration, and the agent learns a new domain.
Tools — giving agents hands
Skills give agents knowledge. Tools give agents capabilities — the ability to take actions at runtime that go beyond what’s in their context.
In Athena, the tool surface is deliberately tiny. There is exactly one tool: web_search.
from langchain_core.tools import tool
@tool (1)
def web_search(
query: str, (2)
max_results: int = 5,
topic: Literal["general", "news"] = "general",
) -> dict:
"""Search the web for documentation, CVEs, or (3)
troubleshooting guides.
Args:
query: Specific search query (be detailed)
max_results: Number of results to return (default: 5)
topic: "general" for docs/guides,
"news" for recent incidents/CVEs
Returns:
Search results with titles, URLs, and content excerpts.
"""
try:
from tavily import TavilyClient
api_key = os.environ.get("TAVILY_API_KEY")
if not api_key:
return {
"error": "TAVILY_API_KEY not set — "
"web search unavailable"
}
client = TavilyClient(api_key=api_key)
return client.search(
query, max_results=max_results, topic=topic
)
except Exception as e:
return {"error": f"Search failed: {e}"}
| 1 | The @tool decorator from LangChain — it extracts the function signature and docstring to create a tool schema that the LLM can call |
| 2 | Typed parameters become the tool’s input schema — the LLM sees these types and descriptions |
| 3 | The docstring becomes the tool’s description — this is what the agent reads to decide when to use the tool |
How tools are wired
Tools are registered in pipeline.py by name and referenced in subagents.yaml:
# In pipeline.py
available_tools = {
"web_search": web_search, # String name → function reference
}
# In subagents.yaml
sre_linux:
tools:
- web_search # References the string name above
The design decision: one tool. Domain knowledge lives in skills, not tools. A tool is for capabilities the agent can’t get from its static context — searching the web for CVEs, checking documentation for a specific error code. Most of the analysis work uses the knowledge already loaded via skills.
Beyond tools — MCP
The @tool pattern works well for tools that are part of your application — you write the function, decorate it, and the agent can call it.
But what about tools that live outside your application?
The Model Context Protocol (MCP) is a standard for agents to discover and use tools across system boundaries. Think of it like USB for AI agents — a universal connector that lets an agent talk to any MCP-compatible tool server.
With @tool, the tool function is compiled into the agent at build time.
With MCP, an agent connects to a tool server at runtime and discovers what capabilities are available — a database query server, a monitoring API, a ticketing system.
The agent doesn’t need to know about these tools in advance; it discovers them dynamically.
Athena doesn’t use MCP today — its single web_search tool is compiled in.
But the pattern is compatible: the agent’s tool interface is the same whether the tool comes from a local function or an MCP server.
As the agent’s capabilities grow beyond web search, MCP provides the architecture to connect to external services without bundling everything into one application.
Explore further
To dig deeper into the code and concepts covered here:
The Athena repository
Start with these files:
| File | What to Look At |
|---|---|
|
The orchestration core — |
|
The ops_manager’s persona and triage protocol — this is the agent’s "memory" |
|
All specialist definitions — descriptions, models, skills, tools |
|
Every domain skill — start with |
|
Pydantic V2 data models — |
Reference reading
-
Building Effective Agents — Anthropic’s guide to agent architecture patterns, including the manager-specialist delegation pattern used in Athena
-
LangChain Python Documentation — the framework underlying the Deep Agents library
-
Model Context Protocol — the specification for runtime tool discovery mentioned in the MCP section