Module 2: Your First Failure

In Module 1, you explored a ticket that was already waiting for you. Now you’re going to trigger a failure yourself, watch the Deep Agents pipeline process it end-to-end, and then break down exactly how it works.

Launch a failing job

Step 1: Find the job template

In the AAP2 tab (log in with user-12345 / deeper-agents if prompted), navigate to Automation Execution → Templates in the left sidebar.

Find 03 Install Web Server in the list. This playbook attempts to install a web server package on the RHEL VM — but it has a deliberate error that will cause it to fail.

AAP2 templates list with 03 Install Web Server highlighted

Step 2: Launch the job

Each job template row has three action icons on the right:

AAP2 action icons — rocket launch and edit pencil and kebab menu

Click the rocket icon rocket icon on the 03 Install Web Server row to launch the job immediately.

You’ll be taken to the job output screen. Watch the job run — it should fail within 15-20 seconds.

Step 3: Watch it fail

The job output will show a failure in the package installation task. The error is real — the playbook references a package name that doesn’t match what’s available on the system.

Take note of the error message — you can see "No package apache available" in the output. In a moment, you’ll see the AI agent’s analysis of this same error — and compare it to what you’d have done manually.

AAP2 job output showing the Install Web Server failure

What happens next — inside the Deep Agents pipeline

The job has failed. AAP2 has already fired a webhook. But the AI analysis takes 1-3 minutes to complete — so while you wait, let’s walk through exactly what’s happening behind the scenes.

Deep Agents pipeline sequence diagram

Here’s what each step does:

  1. Webhook fires — AAP2 sends a POST to Athena’s webhook endpoint with the job ID

  2. Retrieve job data — Athena’s ops_manager agent calls the AAP2 API to fetch the full job output, error details, and metadata

  3. Classify the failure — the error-classifier skill analyzes the error and determines the domain: is this an Ansible issue, a Linux issue, an OpenShift issue, or a Networking issue?

  4. Delegate to a specialist — based on the classification, ops_manager instantiates the right subagent on demand. For a package installation failure, it spins up sre_linux — a fresh agent with its own model, system prompt, and skills. The subagent doesn’t exist until it’s needed

  5. Root cause analysissre_linux reads the incident context, analyzes the error against its domain knowledge, and produces a structured ticket with root cause, confidence, risk, and recommended action. When it’s done, it returns its findings and is discarded — keeping the ops_manager context clean

  6. Quality review — a lightweight reviewer agent validates the ticket for coherence and actionability

  7. Create ticket — the structured analysis is submitted to Kira as a ticket with sub-issues

  8. Notify the team — a summary is posted to the Rocket.Chat #support channel

Check the results

By now, the pipeline should have completed. Let’s verify.

Check Kira

In the Kira tab (log in with user-12345 / deeper-agents if prompted), you should see a new ticket on the dashboard — look for the increased open ticket count and a new entry in the Recent Tickets table.

Kira dashboard showing the new web server failure ticket

Click into the new ticket. Compare the AI agent’s analysis to the error you saw in the AAP2 job output:

  • Did it correctly identify the root cause?

  • Is the recommended action specific and actionable?

  • Does the confidence score seem reasonable?

Kira ticket detail showing the DNF package analysis
Use the AI Chatbot (bottom right of the ticket screen) if you want to confer with the AI about the analysis — ask it to explain the root cause in more detail, suggest alternative fixes, or clarify any technical terms. Notice you can choose Open Source models of different sizes — these are provided by MaaS and run across various Nvidia and Intel GPUs utilizing Red Hat OpenShift AI and vLLM
Kira AI chatbot explaining the httpd vs apache package naming

Check Rocket.Chat

In the Rocket.Chat tab (log in with user-12345 / deeper-agents if prompted), navigate to #support. You should see a new notification from the aiops bot for this failure — with the job name, classification, and a link to the Kira ticket.

Rocket.Chat #support channel showing both AI-generated notifications

Breaking it down — the subagent definition

The specialist agent that analyzed your failure is defined in a YAML configuration file. Here’s the actual sre_linux definition:

sre_linux:
  description: > (1)
    Linux specialist. Delegate host-level failures: package manager (dnf/yum),
    systemd services, SELinux, filesystem/permissions, Satellite content issues.
  model: claude-sonnet-4-6 (2)
  system_prompt: | (3)
    You are a senior Linux SRE. You receive incident data from failed AAP2 jobs
    and perform root-cause analysis on host-level Linux issues.

    Always:
    - Read the incident context (incident.json) first
    - Check for package manager errors, repo access issues
    - Look for systemd unit failures, SELinux denials, permission issues
    - Consider Satellite content view configuration for package availability
    - Provide specific commands and config changes as recommendations

    Use the create-ticket skill to structure your analysis as a TicketPayload.
    Set area to "linux" for all Linux-domain issues.
  tools: (4)
    - web_search
  skills: (5)
    - ./skills/analyze-linux-failure/
    - ./skills/create-ticket/
    - ./skills/common/
1 Description — tells the ops_manager when to use this subagent. The ops_manager reads all subagent descriptions and picks the best match based on the error classification. Be specific here — vague descriptions lead to misrouting.
2 Model — which LLM to use from MaaS. claude-sonnet-4-6 balances capability with cost. For simpler tasks, you could use a smaller model. For complex reasoning, a larger one. This is a key cost/quality tradeoff.
3 System prompt — the agent’s persona and instructions. This is where domain expertise is encoded. Notice it’s structured as a role ("You are a senior Linux SRE"), followed by a process ("Always: read context first, check for X, Y, Z"), and an output format instruction ("Use the create-ticket skill"). The quality of this prompt directly determines the quality of the analysis.
4 Tools — functions the agent can call. web_search lets the agent look up documentation or known issues. Tools are the agent’s hands — they let it take actions beyond just reasoning.
5 Skills — markdown documents loaded into the agent’s context that provide domain knowledge and output formatting guidance. analyze-linux-failure contains Linux-specific diagnostic patterns. create-ticket defines the TicketPayload schema. common has shared log-parsing techniques. Skills are the agent’s reference library — they don’t require code, just well-written instructions.

The pattern

This is the Deep Agents pattern: a manager agent that classifies and delegates, backed by specialist subagents with domain-specific skills. The key insight is that each component is independently configurable:

  • Add a new domain — create a new subagent entry in the YAML with its own prompt and skills

  • Improve analysis quality — refine the system prompt or add more detailed skills

  • Control costs — switch individual subagents to cheaper models for simpler failure types

  • Add capabilities — give a subagent new tools (file reading, API calls, database queries)

You don’t need to modify any code to do any of this. The YAML configuration and skill markdown files are all you need to change — and in a later module, you’ll do exactly that.

Takeaways

  • You triggered a real failure and watched the AI pipeline analyze it end-to-end

  • The pipeline takes 1-3 minutes: webhook → classification → specialist analysis → ticket + notification

  • Each subagent is defined in YAML: description, model, system prompt, tools, and skills

  • The system prompt encodes domain expertise — it’s the most important part of the subagent definition

  • Skills are markdown documents that provide reference knowledge — no code required

  • The pattern is extensible: new domains, better prompts, different models, additional tools