/aiops-skill:root-cause-analysis

๐Ÿ”Ž Root Cause Analysis

Investigate failed Ansible/AAP jobs by correlating logs with Splunk OCP pod logs and GitHub configuration, then generate a structured root cause summary with evidence and recommendations.


When to Use

๐Ÿ’ก
Invoke this skill when you want to:
  • Investigate why a specific job failed
  • Analyze Ansible/AAP logs for errors and failure patterns
  • Correlate infrastructure events across AAP logs and Splunk pod logs
  • Find root causes of failed deployments or provisioning runs
  • Troubleshoot Kubernetes/OpenShift problems surfaced during job execution
  • Get specific, evidence-backed recommendations for configuration fixes

Example invocations:

"Analyze job 1234567 for root cause"
"Why did job 1234567 fail?"
"Investigate the failure in job 1234567"
"Debug the deployment failure in job 1234567"

Prerequisites

๐Ÿ“

JOB_LOGS_DIR Required

Local directory where job log files are stored. The skill searches here first before attempting a remote fetch.

๐Ÿ”—

JUMPBOX_URI Required

SSH jumpbox connection string for uploading analysis results and feedback after the investigation.

๐Ÿ–ฅ๏ธ

SSH / REMOTE_HOST

SSH access to the remote log server. Without this, the --fetch flag won't work โ€” logs must be placed in JOB_LOGS_DIR manually.

๐Ÿ“Š

Splunk Credentials

SPLUNK_HOST, SPLUNK_USERNAME, SPLUNK_PASSWORD, SPLUNK_INDEX, SPLUNK_OCP_APP_INDEX, SPLUNK_OCP_INFRA_INDEX. Without these, Steps 2โ€“3 (log correlation) are skipped.

๐Ÿ™

GITHUB_TOKEN

Personal access token for GitHub API. Without this, Step 4 (config fetching) is skipped and AgnosticD/AgnosticV context won't be available.

โ„น๏ธ
Interactive setup: On first run, the skill checks all prerequisites via scripts/cli.py setup --json and offers to walk you through configuring any missing items. Secrets are written to .claude/settings.json โ€” ensure this file is in .gitignore.

5-Step Pipeline

  1. Step 1 โ€” Parse Job Log [Python]

    Reads the local job log file and extracts: job ID, GUID, namespace, failed task names, task durations, and the job time window. Output: .analysis/<job-id>/step1_job_context.json

  2. Step 2 โ€” Query Splunk [Python]

    Fetches OCP pod logs from the job's namespace within the job time window using the Splunk REST API. Searches both app and infra indexes. Output: step2_splunk_logs.json

  3. Step 3 โ€” Correlate [Python]

    Merges AAP and Splunk events into a unified timeline using namespace, GUID, timestamps, and pod names. Identifies which pod errors preceded, coincided with, or followed each failed task. Output: step3_correlation.json

  4. Step 4 โ€” Fetch GitHub Files [Python]

    Parses job metadata, then retrieves AgnosticV configuration files and AgnosticD workload role code from GitHub. Uses the GitHub token to access private repositories. Output: step4_github_fetch_history.json

  5. Step 5 โ€” Analyze & Generate Summary [Claude]

    Claude reads the step 1, 3, and 4 outputs (plus step 2 if deeper investigation is needed) and produces a structured JSON summary with root cause category, confidence, evidence, correlation, and file-level recommendations. Output: step5_analysis_summary.json

    After saving, the skill uploads results to the jumpbox: python scripts/cli.py upload --job-id <job-id>


Workflow

  1. Preflight Check

    Skill creates a virtual environment and installs dependencies, then runs scripts/cli.py setup --json to verify all required settings. Offers interactive configuration for any missing items.

  2. Run Analysis CLI

    ```bash # By job ID โ€” auto-fetches log from remote if not found locally .venv/bin/python scripts/cli.py analyze --job-id 1234567 --fetch # By explicit path โ€” when you already have the log file .venv/bin/python scripts/cli.py analyze --job-log /path/to/job_123.json.gz ```
  3. Claude Analyzes Step Outputs

    After Steps 1โ€“4 complete, Claude reads all output files and generates step5_analysis_summary.json following the evidence and recommendation schema.

  4. Upload Results

    The skill uploads the full .analysis/<job-id>/ directory to the configured jumpbox for team sharing.


Configuration

Variable Purpose Example
JOB_LOGS_DIR Local directory for job log files /home/user/aiops_extracted_logs
JUMPBOX_URI SSH jumpbox for uploading results user@jumpbox.example.com -p 22
REMOTE_HOST SSH alias for remote log server log-server
REMOTE_DIR Directory on remote log server /var/log/aap/jobs
SPLUNK_HOST Splunk server hostname splunk.example.com
SPLUNK_USERNAME Splunk username analyst
SPLUNK_PASSWORD Splunk password (secret)
SPLUNK_INDEX Primary Splunk index aap_jobs
SPLUNK_OCP_APP_INDEX OCP application logs index ocp_app
SPLUNK_OCP_INFRA_INDEX OCP infrastructure logs index ocp_infra
SPLUNK_VERIFY_SSL Whether to verify SSL certificates false
GITHUB_TOKEN GitHub personal access token ghp_xxxxxxxxxxxx
MLFLOW_PORT MLflow tracing server port (optional) 5000
MLFLOW_EXPERIMENT_NAME MLflow experiment name (optional) rca-analysis

Output Files

Step File Author
1 step1_job_context.json Python
2 step2_splunk_logs.json Python
3 step3_correlation.json Python
4 step4_github_fetch_history.json Python
5 step5_analysis_summary.json Claude

All files are written to .analysis/<job-id>/ relative to the skill directory.