/aiops-skill:root-cause-analysis

🔎 Root Cause Analysis

Investigate failed Ansible/AAP jobs by correlating logs with Splunk OCP pod logs and GitHub configuration, then generate a structured root cause summary with evidence and recommendations.

When to Use

💡

Invoke this skill when you want to:

Investigate why a specific job failed
Analyze Ansible/AAP logs for errors and failure patterns
Correlate infrastructure events across AAP logs and Splunk pod logs
Find root causes of failed deployments or provisioning runs
Troubleshoot Kubernetes/OpenShift problems surfaced during job execution
Get specific, evidence-backed recommendations for configuration fixes

Example invocations:

"Analyze job 1234567 for root cause"
"Why did job 1234567 fail?"
"Investigate the failure in job 1234567"
"Debug the deployment failure in job 1234567"

Prerequisites

📁

JOB_LOGS_DIR Required

Local directory where job log files are stored. The skill searches here first before attempting a remote fetch.

🔗

JUMPBOX_URI Required

SSH jumpbox connection string for uploading analysis results and feedback after the investigation.

🖥️

SSH / REMOTE_HOST

SSH access to the remote log server. Without this, the --fetch flag won't work — logs must be placed in JOB_LOGS_DIR manually.

📊

Splunk Credentials

SPLUNK_HOST, SPLUNK_USERNAME, SPLUNK_PASSWORD, SPLUNK_INDEX, SPLUNK_OCP_APP_INDEX, SPLUNK_OCP_INFRA_INDEX. Without these, Steps 2–3 (log correlation) are skipped.

🐙

GITHUB_TOKEN

Personal access token for GitHub API. Without this, Step 4 (config fetching) is skipped and AgnosticD/AgnosticV context won't be available.

ℹ️

Interactive setup: On first run, the skill checks all prerequisites via scripts/cli.py setup --json and offers to walk you through configuring any missing items. Secrets are written to .claude/settings.json — ensure this file is in .gitignore.

5-Step Pipeline

Step 1 — Parse Job Log [Python]

Reads the local job log file and extracts: job ID, GUID, namespace, failed task names, task durations, and the job time window. Output: .analysis/<job-id>/step1_job_context.json
Step 2 — Query Splunk [Python]

Fetches OCP pod logs from the job's namespace within the job time window using the Splunk REST API. Searches both app and infra indexes. Output: step2_splunk_logs.json
Step 3 — Correlate [Python]

Merges AAP and Splunk events into a unified timeline using namespace, GUID, timestamps, and pod names. Identifies which pod errors preceded, coincided with, or followed each failed task. Output: step3_correlation.json
Step 4 — Fetch GitHub Files [Python]

Parses job metadata, then retrieves AgnosticV configuration files and AgnosticD workload role code from GitHub. Uses the GitHub token to access private repositories. Output: step4_github_fetch_history.json
Step 5 — Analyze & Generate Summary [Claude]

Claude reads the step 1, 3, and 4 outputs (plus step 2 if deeper investigation is needed) and produces a structured JSON summary with root cause category, confidence, evidence, correlation, and file-level recommendations. Output: step5_analysis_summary.json

After saving, the skill uploads results to the jumpbox: python scripts/cli.py upload --job-id <job-id>

Workflow

Preflight Check
Skill creates a virtual environment and installs dependencies, then runs scripts/cli.py setup --json to verify all required settings. Offers interactive configuration for any missing items.
Run Analysis CLI
```bash # By job ID — auto-fetches log from remote if not found locally .venv/bin/python scripts/cli.py analyze --job-id 1234567 --fetch # By explicit path — when you already have the log file .venv/bin/python scripts/cli.py analyze --job-log /path/to/job_123.json.gz ```
Claude Analyzes Step Outputs
After Steps 1–4 complete, Claude reads all output files and generates step5_analysis_summary.json following the evidence and recommendation schema.
Upload Results
The skill uploads the full .analysis/<job-id>/ directory to the configured jumpbox for team sharing.

Configuration

Variable	Purpose	Example
`JOB_LOGS_DIR`	Local directory for job log files	`/home/user/aiops_extracted_logs`
`JUMPBOX_URI`	SSH jumpbox for uploading results	`user@jumpbox.example.com -p 22`
`REMOTE_HOST`	SSH alias for remote log server	`log-server`
`REMOTE_DIR`	Directory on remote log server	`/var/log/aap/jobs`
`SPLUNK_HOST`	Splunk server hostname	`splunk.example.com`
`SPLUNK_USERNAME`	Splunk username	`analyst`
`SPLUNK_PASSWORD`	Splunk password	(secret)
`SPLUNK_INDEX`	Primary Splunk index	`aap_jobs`
`SPLUNK_OCP_APP_INDEX`	OCP application logs index	`ocp_app`
`SPLUNK_OCP_INFRA_INDEX`	OCP infrastructure logs index	`ocp_infra`
`SPLUNK_VERIFY_SSL`	Whether to verify SSL certificates	`false`
`GITHUB_TOKEN`	GitHub personal access token	`ghp_xxxxxxxxxxxx`
`MLFLOW_PORT`	MLflow tracing server port (optional)	`5000`
`MLFLOW_EXPERIMENT_NAME`	MLflow experiment name (optional)	`rca-analysis`

Output Files

Step	File	Author
1	`step1_job_context.json`	Python
2	`step2_splunk_logs.json`	Python
3	`step3_correlation.json`	Python
4	`step4_github_fetch_history.json`	Python
5	`step5_analysis_summary.json`	Claude

All files are written to .analysis/<job-id>/ relative to the skill directory.

/aiops-skill:root-cause-analysis

When to Use

Prerequisites

JOB_LOGS_DIR Required

JUMPBOX_URI Required

SSH / REMOTE_HOST

Splunk Credentials

GITHUB_TOKEN

5-Step Pipeline

Step 1 — Parse Job Log `[Python]`

Step 2 — Query Splunk `[Python]`

Step 3 — Correlate `[Python]`

Step 4 — Fetch GitHub Files `[Python]`

Step 5 — Analyze & Generate Summary `[Claude]`

Workflow

Preflight Check

Run Analysis CLI

Claude Analyzes Step Outputs

Upload Results

Configuration

Output Files

/aiops-skill:root-cause-analysis

When to Use

Prerequisites

JOB_LOGS_DIR Required

JUMPBOX_URI Required

SSH / REMOTE_HOST

Splunk Credentials

GITHUB_TOKEN

5-Step Pipeline

Step 1 — Parse Job Log [Python]

Step 2 — Query Splunk [Python]

Step 3 — Correlate [Python]

Step 4 — Fetch GitHub Files [Python]

Step 5 — Analyze & Generate Summary [Claude]

Workflow

Preflight Check

Run Analysis CLI

Claude Analyzes Step Outputs

Upload Results

Configuration

Output Files

Related Skills

Step 1 — Parse Job Log `[Python]`

Step 2 — Query Splunk `[Python]`

Step 3 — Correlate `[Python]`

Step 4 — Fetch GitHub Files `[Python]`

Step 5 — Analyze & Generate Summary `[Claude]`