/aiops-skill:root-cause-analysis
Investigate failed Ansible/AAP jobs by correlating logs with Splunk OCP pod logs and GitHub configuration, then generate a structured root cause summary with evidence and recommendations.
When to Use
- Investigate why a specific job failed
- Analyze Ansible/AAP logs for errors and failure patterns
- Correlate infrastructure events across AAP logs and Splunk pod logs
- Find root causes of failed deployments or provisioning runs
- Troubleshoot Kubernetes/OpenShift problems surfaced during job execution
- Get specific, evidence-backed recommendations for configuration fixes
Example invocations:
"Analyze job 1234567 for root cause"
"Why did job 1234567 fail?"
"Investigate the failure in job 1234567"
"Debug the deployment failure in job 1234567"
Prerequisites
JOB_LOGS_DIR Required
Local directory where job log files are stored. The skill searches here first before attempting a remote fetch.
JUMPBOX_URI Required
SSH jumpbox connection string for uploading analysis results and feedback after the investigation.
SSH / REMOTE_HOST
SSH access to the remote log server. Without this, the --fetch flag won't work โ logs must be placed in JOB_LOGS_DIR manually.
Splunk Credentials
SPLUNK_HOST, SPLUNK_USERNAME, SPLUNK_PASSWORD, SPLUNK_INDEX, SPLUNK_OCP_APP_INDEX, SPLUNK_OCP_INFRA_INDEX. Without these, Steps 2โ3 (log correlation) are skipped.
GITHUB_TOKEN
Personal access token for GitHub API. Without this, Step 4 (config fetching) is skipped and AgnosticD/AgnosticV context won't be available.
scripts/cli.py setup --json and offers to walk you through configuring any missing items. Secrets are written to .claude/settings.json โ ensure this file is in .gitignore.
5-Step Pipeline
-
Step 1 โ Parse Job Log
[Python]Reads the local job log file and extracts: job ID, GUID, namespace, failed task names, task durations, and the job time window. Output:
.analysis/<job-id>/step1_job_context.json -
Step 2 โ Query Splunk
[Python]Fetches OCP pod logs from the job's namespace within the job time window using the Splunk REST API. Searches both app and infra indexes. Output:
step2_splunk_logs.json -
Step 3 โ Correlate
[Python]Merges AAP and Splunk events into a unified timeline using namespace, GUID, timestamps, and pod names. Identifies which pod errors preceded, coincided with, or followed each failed task. Output:
step3_correlation.json -
Step 4 โ Fetch GitHub Files
[Python]Parses job metadata, then retrieves AgnosticV configuration files and AgnosticD workload role code from GitHub. Uses the GitHub token to access private repositories. Output:
step4_github_fetch_history.json -
Step 5 โ Analyze & Generate Summary
[Claude]Claude reads the step 1, 3, and 4 outputs (plus step 2 if deeper investigation is needed) and produces a structured JSON summary with root cause category, confidence, evidence, correlation, and file-level recommendations. Output:
step5_analysis_summary.jsonAfter saving, the skill uploads results to the jumpbox:
python scripts/cli.py upload --job-id <job-id>
Workflow
Preflight Check
Skill creates a virtual environment and installs dependencies, then runs
scripts/cli.py setup --jsonto verify all required settings. Offers interactive configuration for any missing items.Run Analysis CLI
```bash # By job ID โ auto-fetches log from remote if not found locally .venv/bin/python scripts/cli.py analyze --job-id 1234567 --fetch # By explicit path โ when you already have the log file .venv/bin/python scripts/cli.py analyze --job-log /path/to/job_123.json.gz ```Claude Analyzes Step Outputs
After Steps 1โ4 complete, Claude reads all output files and generates
step5_analysis_summary.jsonfollowing the evidence and recommendation schema.Upload Results
The skill uploads the full
.analysis/<job-id>/directory to the configured jumpbox for team sharing.
Configuration
| Variable | Purpose | Example |
|---|---|---|
JOB_LOGS_DIR |
Local directory for job log files | /home/user/aiops_extracted_logs |
JUMPBOX_URI |
SSH jumpbox for uploading results | user@jumpbox.example.com -p 22 |
REMOTE_HOST |
SSH alias for remote log server | log-server |
REMOTE_DIR |
Directory on remote log server | /var/log/aap/jobs |
SPLUNK_HOST |
Splunk server hostname | splunk.example.com |
SPLUNK_USERNAME |
Splunk username | analyst |
SPLUNK_PASSWORD |
Splunk password | (secret) |
SPLUNK_INDEX |
Primary Splunk index | aap_jobs |
SPLUNK_OCP_APP_INDEX |
OCP application logs index | ocp_app |
SPLUNK_OCP_INFRA_INDEX |
OCP infrastructure logs index | ocp_infra |
SPLUNK_VERIFY_SSL |
Whether to verify SSL certificates | false |
GITHUB_TOKEN |
GitHub personal access token | ghp_xxxxxxxxxxxx |
MLFLOW_PORT |
MLflow tracing server port (optional) | 5000 |
MLFLOW_EXPERIMENT_NAME |
MLflow experiment name (optional) | rca-analysis |
Output Files
| Step | File | Author |
|---|---|---|
| 1 | step1_job_context.json |
Python |
| 2 | step2_splunk_logs.json |
Python |
| 3 | step3_correlation.json |
Python |
| 4 | step4_github_fetch_history.json |
Python |
| 5 | step5_analysis_summary.json |
Claude |
All files are written to .analysis/<job-id>/ relative to the skill directory.