mirror of
https://github.com/FlipsideCrypto/near-models.git
synced 2026-02-06 09:22:01 +00:00
Implement AI-powered GitHub Actions failure assessor
Add comprehensive failure analysis system that replaces basic Slack notifications with Claude Code SDK-powered root cause analysis and actionable recommendations. Key Features: - Generic failure assessor using Claude Code SDK for intelligent analysis - Enhanced Slack integration with formatted failure reports - Repository-agnostic design using GitHub Actions context - Comprehensive logging and error handling - Metadata-only analysis when logs are unavailable Components Added: - python/failure_assessor.py: Main analysis script with Claude Code SDK integration - .claude/agents/workflow-failure-investigator.md: Specialized failure analysis agent - Enhanced python/slack_alert.py with analysis text support - Updated dbt_run_adhoc.yml workflow with failure assessor integration Technical Improvements: - Removed hardcoded repository references for portability - Added proper GitHub CLI repository context handling - Implemented fallback analysis for cases with missing logs - Added claude-code-sdk and requests to requirements.txt 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
2c516c4a22
commit
533c34eb5a
300
.claude/agents/workflow-failure-investigator.md
Normal file
300
.claude/agents/workflow-failure-investigator.md
Normal file
@ -0,0 +1,300 @@
|
||||
---
|
||||
name: workflow-failure-investigator
|
||||
description: Use this agent when a Github Actions workflow has failed and you need to systematically investigate the root cause and determine next steps for resolution. This agent follows the documented investigation process to analyze failures, check dependencies, review logs, and provide actionable recommendations. Examples: <example>Context: A scheduled dbt job failed overnight and needs investigation. user: 'The daily dbt run failed this morning with multiple model errors' assistant: 'I'll use the dbt-job-failure-investigator agent to systematically analyze this failure and determine the root cause' <commentary>Since there's a dbt job failure that needs investigation, use the dbt-job-failure-investigator agent to follow the documented process for analyzing the failure.</commentary></example> <example>Context: User notices test failures in their dbt pipeline. user: 'Several dbt tests are failing after the latest deployment' assistant: 'Let me launch the dbt-job-failure-investigator agent to investigate these test failures systematically' <commentary>Test failures in dbt require systematic investigation using the documented process, so use the dbt-job-failure-investigator agent.</commentary></example>
|
||||
tools: Bash, Glob, Grep, LS, Read, Edit, MultiEdit, Write, NotebookRead, NotebookEdit, WebFetch, TodoWrite, WebSearch, Snow, gh
|
||||
model: sonnet
|
||||
color: orange
|
||||
---
|
||||
|
||||
You are a Github Actions Job Failure Investigation Specialist, an expert in diagnosing and resolving dbt pipeline failures with systematic precision. You follow the documented investigation process in @dbt-job-failure-investigation-process.md to ensure thorough and consistent failure analysis.
|
||||
|
||||
Your core responsibilities:
|
||||
- Execute the step-by-step investigation workflow documented in the process file
|
||||
- Analyze dbt job failures systematically, from initial symptoms to root cause identification
|
||||
- Review error logs, dependency chains, and data quality issues methodically
|
||||
- Identify whether failures are due to code issues, data problems, infrastructure, or configuration
|
||||
- Provide clear, actionable recommendations for resolution
|
||||
- Document findings and suggested next steps for the development team
|
||||
|
||||
# Workflow Failure Investigation Process
|
||||
|
||||
## Overview
|
||||
This document outlines the systematic process for investigating workflow failures in GitHub Actions and resolving data issues. This workflow should be followed when a job fails to ensure thorough root cause analysis and proper resolution.
|
||||
|
||||
## Prerequisites
|
||||
- Access to GitHub Actions logs
|
||||
- Snow CLI configured for Snowflake access
|
||||
- Access to the dbt project repository
|
||||
- Understanding of dbt model naming conventions
|
||||
|
||||
## Available Tools
|
||||
|
||||
### Primary CLI Tools
|
||||
- **`gh`**: GitHub CLI for accessing GitHub Actions and repository data
|
||||
- List workflow runs: `gh run list`
|
||||
- View failed runs: `gh run list --status failure`
|
||||
- Get run details: `gh run view <run-id>`
|
||||
- Download logs: `gh run view <run-id> --log`
|
||||
- List runs for specific workflow: `gh run list --workflow="<workflow-name>"`
|
||||
|
||||
- **`snow`**: Snowflake CLI for direct database operations
|
||||
- Execute SQL queries: `snow sql -q "SELECT ..."`
|
||||
- Interactive SQL shell: `snow sql`
|
||||
- View connection info: `snow connection list`
|
||||
|
||||
### Additional Resources
|
||||
- **Web Search**: Use Claude Code's web search feature to access current documentation
|
||||
- Snowflake documentation for error codes and SQL syntax
|
||||
- dbt documentation for configuration options and best practices
|
||||
- GitHub repository documentation for project-specific guidelines
|
||||
- Stack Overflow and community resources for complex issues
|
||||
|
||||
### File System Tools
|
||||
- **Read/Edit capabilities**: Direct access to dbt model files, configuration files, and documentation
|
||||
- **Glob/Grep tools**: Search for patterns across files and directories
|
||||
- **Directory navigation**: Explore project structure and locate relevant files
|
||||
|
||||
## Step 1: Identify the Failure
|
||||
|
||||
### 1.1 Locate the Failed Job Run
|
||||
Use the GitHub CLI to find the target failed job. **Required**: Repository name (e.g., `FlipsideCrypto/near-models`). **Optional**: Specific workflow name (e.g., `dbt_run_scheduled_non_core`).
|
||||
|
||||
```bash
|
||||
# List recent failed runs for the repository
|
||||
gh run list --status failure --limit 10
|
||||
|
||||
# If you know the specific workflow name, filter by it
|
||||
gh run list --status failure --workflow="<workflow-name>" --limit 10
|
||||
|
||||
# Example for near-models scheduled job
|
||||
gh run list --status failure --workflow="dbt_run_scheduled_non_core" --limit 5
|
||||
```
|
||||
|
||||
### 1.2 Extract Job Details
|
||||
From the failed runs list, identify the target failure and extract key details:
|
||||
|
||||
```bash
|
||||
# Get detailed logs from the specific failed run
|
||||
gh run view <run-id> --log
|
||||
|
||||
# Search for specific error patterns in the logs
|
||||
gh run view <run-id> --log | grep -A 10 -B 5 "ERROR\|FAIL\|Database Error"
|
||||
```
|
||||
|
||||
### 1.3 Document Initial Findings
|
||||
From the GitHub Actions output, record:
|
||||
- **Run ID**: For reference and log access
|
||||
- **Workflow Name**: The specific job that failed
|
||||
- **Timestamp**: When the failure occurred
|
||||
- **Failed Model(s)**: Specific dbt model(s) that caused the failure
|
||||
- **Error Type**: Initial categorization (Database Error, Compilation Error, etc.)
|
||||
- **Error Message**: Exact error text from the logs
|
||||
- **Snowflake User**: Extract the user context (typically `DBT_CLOUD_<PROJECT>`)
|
||||
|
||||
### 1.4 Categorize Error Types
|
||||
Common dbt error patterns to look for in the logs:
|
||||
- **Compilation Errors**: SQL syntax, missing references, schema issues
|
||||
- **Runtime Errors**: Data type mismatches, constraint violations, timeouts
|
||||
- **Database Errors**: Permission issues, connection problems
|
||||
- **Constraint Violations**: Unique key violations, not-null violations
|
||||
- **Resource Issues**: Memory limits, query timeouts
|
||||
|
||||
### 1.5 Prepare for Snowflake Investigation
|
||||
With the GitHub Actions information gathered, prepare for the next phase:
|
||||
- Failed model name (e.g., `silver__burrow_repays`)
|
||||
- Database schema (e.g., `NEAR.silver`)
|
||||
- Snowflake user (format: `DBT_CLOUD_<PROJECT>`)
|
||||
- Specific error code and message
|
||||
- Affected table/view names
|
||||
- Timeframe for Snowflake query search
|
||||
|
||||
## Step 2: Query Snowflake for Failed Operations
|
||||
|
||||
### 2.1 Find Recent Failed Queries
|
||||
Use the Snow CLI to search for failed queries related to the job:
|
||||
|
||||
```shell
|
||||
snow sql -q "
|
||||
-- Search for failed queries by user and time
|
||||
SELECT
|
||||
query_id,
|
||||
query_text,
|
||||
execution_status,
|
||||
start_time,
|
||||
end_time,
|
||||
user_name,
|
||||
error_message
|
||||
FROM snowflake.account_usage.query_history
|
||||
WHERE user_name = 'DBT_CLOUD_<PROJECT>'
|
||||
AND execution_status = 'FAILED'
|
||||
AND start_time >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
|
||||
ORDER BY start_time DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
### 2.2 Search by Error Pattern
|
||||
If searching by user doesn't yield results, search by error pattern:
|
||||
|
||||
```shell
|
||||
snow sql -q "
|
||||
-- Search for specific error types
|
||||
SELECT
|
||||
query_id,
|
||||
query_text,
|
||||
execution_status,
|
||||
start_time,
|
||||
user_name,
|
||||
error_message
|
||||
FROM snowflake.account_usage.query_history
|
||||
WHERE error_message ILIKE '%<ERROR_PATTERN>%'
|
||||
AND start_time >= DATEADD(day, -1, CURRENT_TIMESTAMP())
|
||||
ORDER BY start_time DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
### 2.3 Extract Failed SQL
|
||||
- Copy the complete failed SQL query
|
||||
- Identify the operation type (INSERT, MERGE, CREATE, etc.)
|
||||
- Note any specific conditions or filters that might be relevant
|
||||
|
||||
## Step 3: Analyze the dbt Model
|
||||
|
||||
### 3.1 Locate the Model File
|
||||
dbt models follow the naming convention: `<schema>__<table_name>.sql`
|
||||
- Example: `NEAR.SILVER.TABLE_ABC` → `models/silver/../silver__table_abc.sql`
|
||||
|
||||
### 3.2 Review Model Configuration
|
||||
Examine the model's dbt configuration:
|
||||
- Materialization strategy (`incremental`, `table`, `view`)
|
||||
- Unique key definition
|
||||
- Incremental strategy and predicates
|
||||
- Any custom configurations
|
||||
|
||||
### 3.3 Understand the Logic
|
||||
- Read through the SQL logic
|
||||
- Identify CTEs and their purposes
|
||||
- Understand data transformations
|
||||
- Note any complex joins or aggregations
|
||||
|
||||
### 3.4 Identify Dependencies
|
||||
- Check `ref()` calls for upstream dependencies
|
||||
- Verify `source()` references
|
||||
- Map the data lineage
|
||||
|
||||
## Step 4: Reproduce and Diagnose the Issue
|
||||
|
||||
### 4.1 Reconstruct the Failed Query
|
||||
Based on the dbt model logic, recreate the SQL that would generate the temporary table or final result:
|
||||
|
||||
```shell
|
||||
snow sql -q "
|
||||
-- Example: Recreate the logic that caused the failure
|
||||
WITH [CTEs from the model]
|
||||
SELECT
|
||||
[fields from the model],
|
||||
[unique_key_logic] AS surrogate_key,
|
||||
COUNT(*) as duplicate_count
|
||||
FROM [final CTE]
|
||||
GROUP BY [all fields except count]
|
||||
HAVING COUNT(*) > 1
|
||||
ORDER BY duplicate_count DESC;
|
||||
"
|
||||
```
|
||||
|
||||
### 4.2 Identify Root Cause
|
||||
Common issues to investigate:
|
||||
- **Duplicate Key Violations**: Multiple rows generating the same unique key
|
||||
- **Data Type Issues**: Incompatible data types in joins or operations
|
||||
- **Null Value Problems**: Unexpected nulls in required fields
|
||||
- **Logic Errors**: Incorrect business logic producing invalid results
|
||||
- **Upstream Data Issues**: Problems in source data or dependencies
|
||||
|
||||
### 4.3 Analyze Upstream Dependencies
|
||||
If the issue appears to be data-related:
|
||||
- Check upstream models for recent changes
|
||||
- Verify source data quality
|
||||
- Look for schema changes in dependencies
|
||||
|
||||
## Step 5: Develop Proposed Solution
|
||||
|
||||
### 5.1 Design a Proposed Fix
|
||||
Based on the root cause analysis:
|
||||
- **For Unique Key Issues**: Enhance the surrogate key with additional fields
|
||||
- **For Data Quality Issues**: Add validation or filtering logic
|
||||
- **For Logic Errors**: Correct the business logic
|
||||
- **For Schema Issues**: Update model to handle schema changes
|
||||
|
||||
### 5.2 Investigation Log
|
||||
Document the investigation process including:
|
||||
|
||||
**Error Summary:**
|
||||
- Model: `<model_name>`
|
||||
- Error Type: `<error_category>`
|
||||
- Timestamp: `<when_occurred>`
|
||||
|
||||
**Root Cause:**
|
||||
- Brief description of what caused the failure
|
||||
- Specific technical details (e.g., "Multiple log entries from single action generating duplicate surrogate keys")
|
||||
|
||||
**Key Queries Used:**
|
||||
```sql
|
||||
-- Only include queries that revealed important information
|
||||
-- Query 1: Found the failed merge operation
|
||||
SELECT query_id, error_message FROM snowflake.account_usage.query_history WHERE...
|
||||
|
||||
-- Query 2: Identified duplicate records
|
||||
SELECT receipt_id, action_index, COUNT(*) FROM... GROUP BY... HAVING COUNT(*) > 1
|
||||
```
|
||||
|
||||
**Solution Implemented:**
|
||||
- Description of the fix applied
|
||||
- Files modified
|
||||
- Rationale for the approach chosen
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Investigation Approach
|
||||
1. **Be Systematic**: Follow the steps in order to avoid missing important information
|
||||
2. **Be Verbose**: Document your thought process and reasoning during investigation
|
||||
3. **Focus on Root Cause**: Don't just fix symptoms; understand why the issue occurred
|
||||
4. **Test Thoroughly**: Ensure the fix works and doesn't introduce new issues
|
||||
|
||||
### Query Guidelines
|
||||
- Use time-based filters to narrow search scope
|
||||
- Start with broad searches and narrow down
|
||||
- Save successful queries for documentation
|
||||
- Use appropriate LIMIT clauses to avoid overwhelming results
|
||||
|
||||
### Documentation Standards
|
||||
- Only log queries that provided meaningful insights
|
||||
- Include enough context for another analyst to understand and replicate
|
||||
- Explain the reasoning behind the solution chosen
|
||||
- Reference any external resources or documentation used
|
||||
|
||||
## Common Error Patterns and Solutions
|
||||
|
||||
### Duplicate Key Violations
|
||||
- **Symptom**: "Duplicate row detected during DML action"
|
||||
- **Investigation**: Find the duplicate surrogate key values and trace to source
|
||||
- **Solution**: Enhance unique key with additional distinguishing fields
|
||||
|
||||
### Data Type Mismatches
|
||||
- **Symptom**: "cannot be cast to..." errors
|
||||
- **Investigation**: Check upstream schema changes or data evolution
|
||||
- **Solution**: Add explicit casting or handle data type evolution
|
||||
|
||||
### Permission Errors
|
||||
- **Symptom**: "not authorized" or permission denied
|
||||
- **Investigation**: Check if tables/schemas exist and permissions are correct
|
||||
- **Solution**: Coordinate with infrastructure team for access
|
||||
|
||||
### Incremental Logic Issues
|
||||
- **Symptom**: Missing or duplicate data in incremental models
|
||||
- **Investigation**: Check incremental predicates and merge logic
|
||||
- **Solution**: Adjust incremental strategy or add proper filtering
|
||||
|
||||
This process ensures thorough investigation and proper resolution of dbt job failures while maintaining a clear audit trail for future reference.
|
||||
|
||||
You work within the Flipside Crypto dbt environment with Snowflake as the data warehouse. You understand incremental models, testing frameworks, and the bronze/silver/gold data architecture. Always consider the impact on downstream consumers and data freshness requirements when recommending solutions.
|
||||
41
.github/workflows/dbt_run_adhoc.yml
vendored
41
.github/workflows/dbt_run_adhoc.yml
vendored
@ -65,9 +65,42 @@ jobs:
|
||||
run: |
|
||||
${{ inputs.dbt_command }}
|
||||
|
||||
notify-failure:
|
||||
failure-assessor:
|
||||
needs: [run_dbt_jobs]
|
||||
if: failure()
|
||||
uses: ./.github/workflows/slack_notify.yml
|
||||
secrets:
|
||||
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
|
||||
runs-on: ubuntu-latest
|
||||
environment: workflow_prod
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: '18'
|
||||
|
||||
- name: Install Claude Code CLI
|
||||
run: npm install -g @anthropic-ai/claude-code
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.11"
|
||||
|
||||
- name: Install Python dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install -r requirements.txt
|
||||
|
||||
- name: Run Failure Assessor
|
||||
env:
|
||||
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
GITHUB_REPOSITORY: ${{ github.repository }}
|
||||
GITHUB_WORKFLOW: ${{ github.workflow }}
|
||||
GITHUB_SERVER_URL: ${{ github.server_url }}
|
||||
run: |
|
||||
python python/failure_assessor.py \
|
||||
--run-id "${{ github.run_id }}" \
|
||||
--agent-file ".claude/agents/workflow-failure-investigator.md" \
|
||||
--repo "${{ github.repository }}"
|
||||
|
||||
2
.github/workflows/dbt_test.yml
vendored
2
.github/workflows/dbt_test.yml
vendored
@ -48,4 +48,4 @@ jobs:
|
||||
|
||||
- name: Log test results
|
||||
run: |
|
||||
python python_scripts/test_alert/dbt_test_alert.py
|
||||
python python/test_alert/dbt_test_alert.py
|
||||
|
||||
3
.gitignore
vendored
3
.gitignore
vendored
@ -24,3 +24,6 @@ local*
|
||||
.cursorignore
|
||||
.cursorrules
|
||||
.cursor/mcp.json
|
||||
CLAUDE.local.md
|
||||
|
||||
__pycache__/
|
||||
400
python/failure_assessor.py
Normal file
400
python/failure_assessor.py
Normal file
@ -0,0 +1,400 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Failure Assessor - POC Script for analyzing GitHub Actions workflow failures
|
||||
|
||||
This script analyzes failed GitHub Actions workflows using the Claude Code SDK
|
||||
and posts enhanced Slack notifications with AI-generated analysis.
|
||||
|
||||
Usage:
|
||||
python failure_assessor.py --run-id GITHUB_RUN_ID --agent-file AGENT_FILE_PATH
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Dict, Optional, Tuple
|
||||
|
||||
# Set up detailed logging
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.StreamHandler(sys.stdout)
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
try:
|
||||
from claude_code_sdk import query, ClaudeCodeOptions
|
||||
import asyncio
|
||||
logger.info("Successfully imported Claude Code SDK")
|
||||
except ImportError as e:
|
||||
logger.error(f"Failed to import claude-code-sdk: {e}")
|
||||
print("ERROR: claude-code-sdk package not installed. Run: pip install claude-code-sdk")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
class FailureAssessor:
|
||||
def __init__(self, run_id: str, agent_file: str, repo: str = None):
|
||||
logger.info(f"Initializing FailureAssessor with run_id: {run_id}")
|
||||
logger.info(f"Agent file: {agent_file}")
|
||||
logger.info(f"Repository: {repo}")
|
||||
|
||||
self.run_id = run_id
|
||||
self.agent_file = agent_file
|
||||
# Use GitHub environment variable first, then CLI arg, then require explicit specification
|
||||
self.repo = os.environ.get("GITHUB_REPOSITORY") or repo
|
||||
if not self.repo:
|
||||
raise ValueError("Repository must be specified via --repo argument or GITHUB_REPOSITORY environment variable")
|
||||
|
||||
logger.info(f"Using repository: {self.repo}")
|
||||
self.github_token = os.environ.get("GITHUB_TOKEN")
|
||||
self.slack_webhook = os.environ.get("SLACK_WEBHOOK_URL")
|
||||
|
||||
logger.debug(f"GitHub token present: {'Yes' if self.github_token else 'No'}")
|
||||
logger.debug(f"Slack webhook present: {'Yes' if self.slack_webhook else 'No'}")
|
||||
|
||||
if not self.github_token:
|
||||
logger.error("GITHUB_TOKEN environment variable is missing")
|
||||
raise ValueError("GITHUB_TOKEN environment variable is required")
|
||||
if not self.slack_webhook:
|
||||
logger.error("SLACK_WEBHOOK_URL environment variable is missing")
|
||||
raise ValueError("SLACK_WEBHOOK_URL environment variable is required")
|
||||
|
||||
def run_command(self, cmd: str, capture_output: bool = True) -> Tuple[int, str, str]:
|
||||
"""Run a shell command and return exit code, stdout, stderr"""
|
||||
logger.debug(f"Running command: {cmd}")
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
shell=True,
|
||||
capture_output=capture_output,
|
||||
text=True,
|
||||
timeout=60
|
||||
)
|
||||
logger.debug(f"Command exit code: {result.returncode}")
|
||||
logger.debug(f"Command stdout length: {len(result.stdout)} chars")
|
||||
logger.debug(f"Command stderr length: {len(result.stderr)} chars")
|
||||
return result.returncode, result.stdout, result.stderr
|
||||
except subprocess.TimeoutExpired:
|
||||
logger.error("Command timed out")
|
||||
return 1, "", "Command timed out"
|
||||
except Exception as e:
|
||||
logger.error(f"Command failed with exception: {e}")
|
||||
return 1, "", str(e)
|
||||
|
||||
def gather_run_metadata(self) -> Dict:
|
||||
"""Gather metadata about the failed run using gh CLI"""
|
||||
logger.info(f"Gathering metadata for run ID: {self.run_id}")
|
||||
|
||||
# Get run details in JSON format
|
||||
json_fields = "name,status,conclusion,createdAt,headSha,jobs,workflowName,url"
|
||||
logger.debug(f"Requesting JSON fields: {json_fields}")
|
||||
exit_code, stdout, stderr = self.run_command(f"gh run view {self.run_id} --repo {self.repo} --json {json_fields}")
|
||||
|
||||
if exit_code != 0:
|
||||
logger.error(f"Failed to get run details: {stderr}")
|
||||
print(f"ERROR: Failed to get run details: {stderr}")
|
||||
return {}
|
||||
|
||||
logger.debug(f"Raw JSON response: {stdout[:500]}...")
|
||||
try:
|
||||
run_data = json.loads(stdout)
|
||||
logger.debug(f"Parsed JSON keys: {list(run_data.keys()) if run_data else 'None'}")
|
||||
# Use environment variables from GitHub Actions when available
|
||||
github_repo = os.environ.get("GITHUB_REPOSITORY", self.repo)
|
||||
github_workflow = os.environ.get("GITHUB_WORKFLOW", run_data.get("workflowName", "Unknown"))
|
||||
github_server_url = os.environ.get("GITHUB_SERVER_URL", "https://github.com")
|
||||
|
||||
metadata = {
|
||||
"run_id": self.run_id,
|
||||
"workflow_name": github_workflow,
|
||||
"repository": github_repo.split('/')[-1] if '/' in github_repo else github_repo,
|
||||
"full_repository": github_repo,
|
||||
"status": run_data.get("status", "Unknown"),
|
||||
"conclusion": run_data.get("conclusion", "Unknown"),
|
||||
"created_at": run_data.get("createdAt", "Unknown"),
|
||||
"head_commit": "Unknown", # Not available in basic fields
|
||||
"head_sha": run_data.get("headSha", "Unknown")[:8] if run_data.get("headSha") else "Unknown",
|
||||
"jobs": run_data.get("jobs", []),
|
||||
"url": run_data.get("url", f"{github_server_url}/{github_repo}/actions/runs/{self.run_id}")
|
||||
}
|
||||
logger.info(f"Successfully gathered metadata: workflow={metadata['workflow_name']}, status={metadata['status']}")
|
||||
return metadata
|
||||
except json.JSONDecodeError as e:
|
||||
logger.error(f"Failed to parse run JSON: {e}")
|
||||
logger.debug(f"Raw stdout that failed to parse: {stdout}")
|
||||
print(f"ERROR: Failed to parse run JSON: {stdout}")
|
||||
return {}
|
||||
|
||||
def fetch_run_logs(self, max_lines: int = 3000) -> str:
|
||||
"""Fetch and truncate run logs to control token usage"""
|
||||
logger.info(f"Fetching run logs for run ID: {self.run_id}")
|
||||
logger.debug(f"Max lines limit: {max_lines}")
|
||||
|
||||
exit_code, stdout, stderr = self.run_command(f"gh run view {self.run_id} --repo {self.repo} --log")
|
||||
|
||||
if exit_code != 0:
|
||||
logger.error(f"Failed to get run logs: {stderr}")
|
||||
print(f"ERROR: Failed to get run logs: {stderr}")
|
||||
return f"Failed to fetch logs: {stderr}"
|
||||
|
||||
logger.debug(f"Raw logs length: {len(stdout)} characters")
|
||||
|
||||
# Truncate logs to last N lines to control token usage
|
||||
lines = stdout.split('\n')
|
||||
logger.debug(f"Total log lines: {len(lines)}")
|
||||
|
||||
if len(lines) > max_lines:
|
||||
logger.info(f"Truncating logs from {len(lines)} to {max_lines} lines")
|
||||
truncated_logs = '\n'.join(lines[-max_lines:])
|
||||
result = f"[LOG TRUNCATED - showing last {max_lines} lines]\n\n{truncated_logs}"
|
||||
else:
|
||||
logger.info(f"Using all {len(lines)} log lines (under limit)")
|
||||
result = stdout
|
||||
|
||||
logger.debug(f"Final log output length: {len(result)} characters")
|
||||
return result
|
||||
|
||||
def load_agent_instructions(self) -> str:
|
||||
"""Load the agent instructions from the specified file"""
|
||||
logger.info(f"Loading agent instructions from: {self.agent_file}")
|
||||
try:
|
||||
with open(self.agent_file, 'r') as f:
|
||||
content = f.read()
|
||||
logger.debug(f"Agent file loaded successfully, length: {len(content)} characters")
|
||||
return content
|
||||
except FileNotFoundError:
|
||||
logger.error(f"Agent file not found: {self.agent_file}")
|
||||
print(f"ERROR: Agent file not found: {self.agent_file}")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to read agent file: {e}")
|
||||
print(f"ERROR: Failed to read agent file: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
def build_analysis_prompt(self, metadata: Dict, logs: str) -> str:
|
||||
"""Build the prompt for the Claude analysis"""
|
||||
agent_instructions = self.load_agent_instructions()
|
||||
|
||||
# Extract key information for the prompt
|
||||
workflow_name = metadata.get("workflow_name", "Unknown")
|
||||
repository = metadata.get("repository", "Unknown")
|
||||
head_commit = metadata.get("head_commit", "Unknown")
|
||||
head_sha = metadata.get("head_sha", "Unknown")
|
||||
created_at = metadata.get("created_at", "Unknown")
|
||||
|
||||
prompt = f"""
|
||||
You are analyzing a failed GitHub Actions workflow. Please provide a concise analysis for a Slack notification.
|
||||
|
||||
## Context
|
||||
- **Repository**: {repository}
|
||||
- **Workflow**: {workflow_name}
|
||||
- **Run ID**: {self.run_id}
|
||||
- **Commit**: {head_sha} - {head_commit}
|
||||
- **Timestamp**: {created_at}
|
||||
|
||||
## Agent Instructions
|
||||
{agent_instructions}
|
||||
|
||||
## Failed Job Logs
|
||||
```
|
||||
{logs}
|
||||
```
|
||||
|
||||
## Required Output Format
|
||||
You MUST respond in this EXACT markdown format for Slack. Do not include any other text before or after this format:
|
||||
|
||||
**🔍 Failure Analysis**
|
||||
|
||||
**Root Cause:**
|
||||
[1-2 sentence summary of what caused the failure]
|
||||
|
||||
**Evidence:**
|
||||
```
|
||||
[Key error message or relevant log excerpt - keep under 10 lines]
|
||||
```
|
||||
|
||||
**Immediate Actions:**
|
||||
• [Action item 1]
|
||||
• [Action item 2]
|
||||
• [Action item 3 if needed]
|
||||
"""
|
||||
return prompt
|
||||
|
||||
async def analyze_with_claude(self, prompt: str) -> str:
|
||||
"""Send the prompt to Claude Code SDK and get analysis back"""
|
||||
logger.info("Starting analysis with Claude Code SDK...")
|
||||
|
||||
try:
|
||||
# Load agent instructions and include them in the system prompt
|
||||
logger.debug("Loading agent instructions for system prompt...")
|
||||
agent_instructions = self.load_agent_instructions()
|
||||
|
||||
# Create options for focused analysis - disable tools to prevent investigation workflow
|
||||
logger.debug("Creating ClaudeCodeOptions...")
|
||||
options = ClaudeCodeOptions(
|
||||
max_turns=1,
|
||||
system_prompt="""You are a GitHub Actions failure analysis assistant. You must analyze the provided logs and respond in the EXACT format requested. Do not use any tools or perform investigations - just analyze the logs provided and give a formatted response. You are NOT allowed to run commands, read files, or perform any investigations.""",
|
||||
allowed_tools=[] # Disable all tools to prevent investigation workflow
|
||||
)
|
||||
logger.debug(f"Options created: max_turns={options.max_turns}")
|
||||
|
||||
logger.info("Sending query to Claude Code SDK...")
|
||||
logger.debug(f"Prompt length: {len(prompt)} characters")
|
||||
|
||||
analysis_parts = []
|
||||
message_count = 0
|
||||
|
||||
async for message in query(prompt=prompt, options=options):
|
||||
message_count += 1
|
||||
logger.debug(f"Received message #{message_count}, type: {type(message).__name__}")
|
||||
|
||||
if hasattr(message, 'content'):
|
||||
logger.debug(f"Message has content with {len(message.content)} parts")
|
||||
for i, content_block in enumerate(message.content):
|
||||
logger.debug(f"Content block #{i}, type: {type(content_block).__name__}")
|
||||
if hasattr(content_block, 'text'):
|
||||
text_content = content_block.text
|
||||
logger.debug(f"Text content length: {len(text_content)} chars")
|
||||
analysis_parts.append(text_content)
|
||||
else:
|
||||
logger.debug("Message has no content attribute")
|
||||
|
||||
logger.info(f"Completed query iteration, received {message_count} messages")
|
||||
logger.info(f"Collected {len(analysis_parts)} analysis parts")
|
||||
|
||||
if analysis_parts:
|
||||
analysis = '\n'.join(analysis_parts)
|
||||
logger.info(f"Generated analysis length: {len(analysis)} characters")
|
||||
logger.debug(f"Analysis preview: {analysis[:200]}...")
|
||||
else:
|
||||
analysis = "No analysis generated"
|
||||
logger.warning("No analysis parts collected from Claude Code SDK")
|
||||
|
||||
return analysis
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Claude Code SDK analysis failed: {e}", exc_info=True)
|
||||
print(f"ERROR: Failed to analyze with Claude Code SDK: {e}")
|
||||
return f"**Analysis Error**: Failed to analyze logs with AI: {str(e)}"
|
||||
|
||||
def post_to_slack(self, analysis: str, metadata: Dict):
|
||||
"""Post the enhanced failure notification to Slack"""
|
||||
print("Posting enhanced notification to Slack...")
|
||||
|
||||
# Create temporary file with the analysis
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
|
||||
f.write(analysis)
|
||||
analysis_file = f.name
|
||||
|
||||
try:
|
||||
# Call the updated slack_alert.py with analysis
|
||||
cmd = f"python {Path(__file__).parent}/slack_alert.py --analysis-file {analysis_file}"
|
||||
|
||||
# Set up environment for slack_alert.py (preserve existing GitHub env vars)
|
||||
env = os.environ.copy()
|
||||
env.update({
|
||||
'SLACK_WEBHOOK_URL': self.slack_webhook,
|
||||
'GITHUB_REPOSITORY': metadata.get('full_repository', f"FlipsideCrypto/{metadata.get('repository', 'unknown')}"),
|
||||
'GITHUB_WORKFLOW': metadata.get('workflow_name', 'Unknown'),
|
||||
'GITHUB_RUN_ID': str(self.run_id),
|
||||
'GITHUB_SERVER_URL': os.environ.get('GITHUB_SERVER_URL', 'https://github.com')
|
||||
})
|
||||
|
||||
result = subprocess.run(cmd, shell=True, env=env, capture_output=True, text=True)
|
||||
|
||||
if result.returncode != 0:
|
||||
print(f"ERROR: Failed to send Slack notification: {result.stderr}")
|
||||
print(f"STDOUT: {result.stdout}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
print("Successfully sent enhanced Slack notification")
|
||||
|
||||
finally:
|
||||
# Clean up temporary file
|
||||
try:
|
||||
os.unlink(analysis_file)
|
||||
except:
|
||||
pass
|
||||
|
||||
async def run(self):
|
||||
"""Main execution flow"""
|
||||
logger.info("🔍 Starting failure assessment...")
|
||||
print("🔍 Starting failure assessment...")
|
||||
|
||||
# Step 1: Gather run metadata
|
||||
logger.info("=== STEP 1: Gathering run metadata ===")
|
||||
metadata = self.gather_run_metadata()
|
||||
if not metadata:
|
||||
logger.error("Failed to gather run metadata, exiting")
|
||||
print("ERROR: Failed to gather run metadata")
|
||||
sys.exit(1)
|
||||
logger.info("✅ Step 1 complete")
|
||||
|
||||
# Step 2: Fetch logs
|
||||
logger.info("=== STEP 2: Fetching run logs ===")
|
||||
logs = self.fetch_run_logs()
|
||||
if not logs or logs.strip() == "":
|
||||
logger.warning("No logs found - logs may have been purged or run is too old")
|
||||
print("WARNING: No logs found - using metadata-only analysis")
|
||||
logs = f"No logs available for run {self.run_id}. This may be because:\n- Logs have been purged due to retention policy\n- Run is too old\n- Logs were not generated\n\nMetadata indicates: {metadata.get('conclusion', 'Unknown')} conclusion for workflow '{metadata.get('workflow_name', 'Unknown')}'"
|
||||
logger.info("✅ Step 2 complete")
|
||||
|
||||
# Step 3: Build prompt and analyze with Claude
|
||||
logger.info("=== STEP 3: Analyzing with Claude Code SDK ===")
|
||||
prompt = self.build_analysis_prompt(metadata, logs)
|
||||
logger.debug(f"Built prompt length: {len(prompt)} characters")
|
||||
|
||||
analysis = await self.analyze_with_claude(prompt)
|
||||
logger.info("✅ Step 3 complete")
|
||||
|
||||
# Step 4: Post to Slack
|
||||
logger.info("=== STEP 4: Posting to Slack ===")
|
||||
self.post_to_slack(analysis, metadata)
|
||||
logger.info("✅ Step 4 complete")
|
||||
|
||||
logger.info("✅ Failure assessment complete!")
|
||||
print("✅ Failure assessment complete!")
|
||||
|
||||
|
||||
async def main_async():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze GitHub Actions workflow failures with Claude Code SDK"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--run-id",
|
||||
required=True,
|
||||
help="GitHub run ID to analyze"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--agent-file",
|
||||
required=True,
|
||||
help="Path to Claude agent instructions file"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--repo",
|
||||
help="GitHub repository in format 'owner/repo' (uses GITHUB_REPOSITORY env var if not specified)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
assessor = FailureAssessor(args.run_id, args.agent_file, args.repo)
|
||||
await assessor.run()
|
||||
except Exception as e:
|
||||
print(f"ERROR: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def main():
|
||||
asyncio.run(main_async())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,9 +1,10 @@
|
||||
import requests
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
def create_message():
|
||||
"""Creates a simple failure notification message with repo, workflow name, and URL"""
|
||||
def create_message(analysis_text=None):
|
||||
"""Creates a failure notification message with optional AI analysis"""
|
||||
|
||||
# Get GitHub environment variables
|
||||
repository = os.environ.get('GITHUB_REPOSITORY', 'Unknown repository')
|
||||
@ -15,42 +16,48 @@ def create_message():
|
||||
# Build the workflow URL
|
||||
workflow_url = f"{server_url}/{repository}/actions/runs/{run_id}"
|
||||
|
||||
# Base attachment structure
|
||||
attachment = {
|
||||
"color": "#f44336", # Red color for failures
|
||||
"fields": [
|
||||
{
|
||||
"title": "Repository",
|
||||
"value": repository,
|
||||
"short": True
|
||||
},
|
||||
{
|
||||
"title": "Workflow",
|
||||
"value": workflow_name,
|
||||
"short": True
|
||||
}
|
||||
],
|
||||
"actions": [
|
||||
{
|
||||
"type": "button",
|
||||
"text": "View Workflow Run",
|
||||
"style": "primary",
|
||||
"url": workflow_url
|
||||
}
|
||||
],
|
||||
"footer": "GitHub Actions"
|
||||
}
|
||||
|
||||
# Add AI analysis if provided
|
||||
if analysis_text:
|
||||
attachment["text"] = analysis_text
|
||||
attachment["mrkdwn_in"] = ["text"] # Enable markdown formatting
|
||||
|
||||
message_body = {
|
||||
"text": f"Failure in {repo_name}",
|
||||
"attachments": [
|
||||
{
|
||||
"color": "#f44336", # Red color for failures
|
||||
"fields": [
|
||||
{
|
||||
"title": "Repository",
|
||||
"value": repository,
|
||||
"short": True
|
||||
},
|
||||
{
|
||||
"title": "Workflow",
|
||||
"value": workflow_name,
|
||||
"short": True
|
||||
}
|
||||
],
|
||||
"actions": [
|
||||
{
|
||||
"type": "button",
|
||||
"text": "View Workflow Run",
|
||||
"style": "primary",
|
||||
"url": workflow_url
|
||||
}
|
||||
],
|
||||
"footer": "GitHub Actions"
|
||||
}
|
||||
]
|
||||
"attachments": [attachment]
|
||||
}
|
||||
|
||||
return message_body
|
||||
|
||||
def send_alert(webhook_url):
|
||||
def send_alert(webhook_url, analysis_text=None):
|
||||
"""Sends a failure notification to Slack"""
|
||||
|
||||
message = create_message()
|
||||
message = create_message(analysis_text)
|
||||
|
||||
try:
|
||||
response = requests.post(webhook_url, json=message)
|
||||
@ -64,11 +71,39 @@ def send_alert(webhook_url):
|
||||
print(f"Error sending Slack notification: {str(e)}")
|
||||
sys.exit(1)
|
||||
|
||||
if __name__ == '__main__':
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Send Slack failure notification")
|
||||
parser.add_argument(
|
||||
"--analysis-file",
|
||||
help="Path to file containing AI analysis text"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--analysis-text",
|
||||
help="Direct analysis text to include"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
webhook_url = os.environ.get("SLACK_WEBHOOK_URL")
|
||||
|
||||
if not webhook_url:
|
||||
print("ERROR: SLACK_WEBHOOK_URL environment variable is required")
|
||||
sys.exit(1)
|
||||
|
||||
send_alert(webhook_url)
|
||||
analysis_text = None
|
||||
|
||||
# Load analysis from file if provided
|
||||
if args.analysis_file:
|
||||
try:
|
||||
with open(args.analysis_file, 'r') as f:
|
||||
analysis_text = f.read()
|
||||
except Exception as e:
|
||||
print(f"WARNING: Failed to read analysis file: {e}")
|
||||
|
||||
# Use direct text if provided (overrides file)
|
||||
if args.analysis_text:
|
||||
analysis_text = args.analysis_text
|
||||
|
||||
send_alert(webhook_url, analysis_text)
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
@ -1,22 +0,0 @@
|
||||
# Near DBT Project - Python Scripts
|
||||
|
||||
## Setup
|
||||
|
||||
1. Install Python 3 in your machine https://www.python.org/downloads/.
|
||||
|
||||
2. Open your terminal/command prompt and go to this directory.
|
||||
|
||||
3. Run the following command:
|
||||
|
||||
> `pip install -r requirements.txt`
|
||||
|
||||
4. Run the Python script by typing:
|
||||
|
||||
> `python <script name here>.py`
|
||||
|
||||
(e.g. `python token_labels_retriever.py`)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
1. Check if you're in the right directory
|
||||
2. If the commands are not recognized, you need to add Python in your environment variables
|
||||
@ -1,2 +0,0 @@
|
||||
pandas==1.4.2
|
||||
requests==2.27.1
|
||||
@ -1,12 +0,0 @@
|
||||
import pandas as pd
|
||||
import requests
|
||||
|
||||
def table_retriever():
|
||||
token_labels = requests.get('https://api.stats.ref.finance/api/last-tvl').json()
|
||||
df = pd.json_normalize(token_labels)[['token_account_id', 'ftInfo.name','ftInfo.symbol', 'ftInfo.decimals']]
|
||||
df.columns = ['token_contract', 'token', 'symbol', 'decimals']
|
||||
return df
|
||||
|
||||
if __name__ == '__main__':
|
||||
df = table_retriever()
|
||||
df.to_csv('../data/seeds__token_labels.csv', index=False)
|
||||
@ -1,2 +1,4 @@
|
||||
dbt-snowflake>=1.7,<1.8
|
||||
protobuf==4.25.3
|
||||
protobuf==4.25.3
|
||||
claude-code-sdk
|
||||
requests
|
||||
Loading…
Reference in New Issue
Block a user