Skip to content

godaralokesh29/PulseGuard-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PulseGuard - Project Overview

🎯 What This Project Does

RAG of Fire is an AI-powered Incident Response System that automatically analyzes production incidents and recommends mitigation actions using historical data and machine learning.

Real-World Problem It Solves

When a production system experiences an incident (like a database timeout, memory leak, or API rate limiting), your on-call engineer needs to:

  1. Understand what's happening (diagnosis)
  2. Find similar past incidents (search historical knowledge)
  3. Know what worked before (recommendations)
  4. Act quickly to minimize downtime

RAG of Fire automates all of this.


🏗️ Architecture Overview

Three Main Components

1. Backend (Python + FastAPI)

  • REST API that handles incident analysis
  • Connected to a vector database (ChromaDB) for semantic search
  • Has 10 pre-loaded historical incidents with real metrics
  • Can match new incidents to past ones and recommend solutions

2. Frontend (Next.js + React)

  • Dashboard to visualize incidents
  • Forms to search historical data
  • Displays recommendations and confidence scores

3. Database

  • PostgreSQL: Stores structured incident data
  • ChromaDB: Vector database for semantic search (finds similar incidents)

📊 What "RAG" Means

RAG = Retrieval-Augmented Generation

  • Retrieval: Find similar past incidents from vector database
  • Augmented: Combine with current incident data
  • Generation: Use LLM to create personalized recommendations

Example:

Current Incident: Database timeout, 450% spike
                    ↓
Search vector DB for similar incidents
                    ↓
Found: INC-2025-001 (database timeout, 450% spike, fixed by throttling 30%)
                    ↓
LLM generates: "Recommend: Throttle database connections to 30%"
                    ↓
Confidence: 95% (because we have exact match in history)

🧪 What The Test Cases Do

The test_system.py file runs 7 tests simulating real production incidents:

Test 1: Database Timeout

error_type: "database_timeout"
spike_percentage: 450%  # Database response time increased 4.5x
Expected: "Throttle connections to 30%"

What it tests: Can the system detect a database timeout and recommend connection throttling?

Test 2: Kafka Consumer Lag

error_type: "kafka_consumer_lag"
spike_percentage: 320%  # Message queue backed up
Expected: "Throttle consumers to 45%"

What it tests: Can the system detect message queue buildup and recommend consumer throttling?

Test 3: Memory Leak

error_type: "memory_leak"
spike_percentage: 280%  # Memory usage increased 2.8x
Expected: "Throttle traffic to 20%"

What it tests: Can the system detect memory pressure and recommend traffic reduction?

Test 4: Stream Anomaly

Reports streaming anomalies in real-time

What it tests: Can the system accept live anomaly data?

Test 5-6: Search & Statistics

Search historical RCA documents
Get statistics about stored incidents

What it tests: Can the system retrieve and summarize historical data?


🔗 How Everything Integrates

┌─────────────────┐
│  Frontend       │
│  (Next.js)      │
└────────┬────────┘
         │ HTTP requests
         ↓
┌─────────────────────────────────────┐
│  Backend (FastAPI)                  │
│  - Decision Engine                  │
│  - Document Search                  │
│  - Streaming Anomalies              │
└────────┬────────────────────────────┘
         │
    ┌────┴─────┬────────────────┐
    ↓          ↓                ↓
┌────────┐ ┌──────────┐ ┌────────────┐
│PostgreSQL│ ChromaDB  │ LLM Service │
│(DB Data) │(Vector   │ (OpenAI or  │
│          │Search)   │  Mock)      │
└──────────┴──────────┴─────────────┘

📝 Integration Points

What needs to be integrated:

  1. LLM Service

    • Currently using a mock LLM (deterministic responses)
    • Can integrate real OpenAI/Anthropic API
    • Location: backend/services/llm_engine.py
  2. Vector Database

    • Currently using ChromaDB (in-memory, great for demo)
    • Can integrate Pinecone, Weaviate, etc.
    • Location: backend/services/vector_db.py
  3. Real Database

    • PostgreSQL models already defined
    • Need to connect to real PostgreSQL instance
    • Location: backend/database/db.py
  4. Streaming Pipeline

    • Currently using asyncio.Queue (mock Kafka/Flink)
    • Can integrate real Apache Kafka
    • Location: backend/services/stream_processor.py
  5. Frontend Features

    • Basic Next.js UI created
    • Can add real-time dashboards, advanced filtering
    • Location: app/ and components/

✅ Current Status

  • ✅ Core logic working
  • ✅ All 7 tests passing
  • ✅ API endpoints functioning
  • ✅ Backend starting successfully

🔌 WebSocket Real-Time Streaming

What is WebSocket?

A persistent connection between browser and server that allows LIVE, TWO-WAY communication.

How It Works in This Project

Browser                                Backend
   |                                     |
   |-------- Connect via WebSocket ------|
   |                                     |
   |<---- Real-time Decision Updates ----|
   |<---- Anomaly Alerts ------------------|
   |<---- System Events ------------------|

Example Flow: Real-Time Incident Notification

// Frontend (JavaScript)
const ws = new WebSocket('ws://localhost:8000/ws/incidents');

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  if (message.type === 'decision') {
    // Update dashboard in real-time
    console.log('New Decision:', message.data);
    // Display: "Database Timeout - Recommend: Throttle to 30%"
  }
  
  if (message.type === 'anomaly') {
    // Show alert in UI
    showAlert('Anomaly Detected: ' + message.data.error_type);
  }
};

// Send ping to keep connection alive
setInterval(() => {
  ws.send(JSON.stringify({ type: 'ping' }));
}, 30000);

What Gets Sent Over WebSocket

# Backend generates a decision
decision = {
    "id": "dec_12345",
    "matched_incident": "INC-2025-001",
    "symptom": "Database response time increased 450%",
    "recommended_action": "Throttle connections to 30%",
    "confidence_score": 0.95,
    "latency_ms": 245
}

# Broadcast to ALL connected browsers
await notification_service.ws_manager.broadcast({
    "type": "decision",
    "data": decision
})

Result: 100+ web browsers get the same notification instantly! ⚡


💬 Slack Integration

image

What is Slack?

A team messaging platform. Slack integration means the system sends alerts to your Slack channel automatically.

How It Works

Backend detects incident
         ↓
Generates decision/recommendation
         ↓
Sends formatted message to Slack webhook
         ↓
Slack channel receives alert
         ↓
Team members see: "🚨 Database Timeout - Throttle to 30%"

Setup Slack (Step-by-Step)

Step 1: Go to https://api.slack.com/apps and create an app

  • Name: "Incident Response Bot"

Step 2: Enable "Incoming Webhooks"

  • Get the webhook URL (looks like):
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

Step 3: Add to .env file

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX

Step 4: Restart backend

uvicorn backend.main:app --reload

Slack Message Example

When an incident is detected, Slack receives:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🚨 Incident Decision: INC-2025-001
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Symptom:
Database response time increased 450%

Confidence:
95%

Recommended Action:
Throttle connections to 30%

⏱️ 245ms | ID: dec_12345

Code: How Slack Alert Gets Sent

# backend/services/notification.py

class SlackNotifier:
    async def send_decision_alert(self, decision):
        # Build a nice formatted message
        payload = {
            "text": "🚨 Incident Mitigation Decision Generated",
            "blocks": [
                {
                    "type": "header",
                    "text": {
                        "type": "plain_text",
                        "text": f"Incident Decision: {decision.matched_incident}"
                    }
                },
                {
                    "type": "section",
                    "text": {
                        "type": "mrkdwn",
                        "text": f"*Recommended Action:*\n```{decision.recommended_action}```"
                    }
                }
            ]
        }
        
        # Send HTTP POST to Slack webhook
        async with aiohttp.ClientSession() as session:
            await session.post(self.webhook_url, json=payload)

🔄 Complete Workflow Example

Real Production Incident Flow

Scenario: Production database connection pool exhausted

1. MONITORING SYSTEM
   └─ Detects: Database response time 450% spike
   └─ Sends to: POST /api/v1/decisions/stream-anomaly
   
2. BACKEND (Decision Engine)
   ├─ Searches vector DB for similar incidents
   ├─ Finds: INC-2025-001 (exact match!)
   ├─ Generates: "Throttle connections to 30%"
   ├─ Calculates: Confidence 95% (based on exact match)
   └─ Creates Decision object
   
3. NOTIFICATION SERVICE
   ├─ WebSocket Broadcast:
   │  └─ Sends to 42 browsers viewing dashboard
   │  └─ Each browser shows alert immediately
   │
   └─ Slack Notification:
      └─ Sends formatted alert to #incidents channel
      └─ On-call engineer sees alert in Slack
   
4. TEAM RESPONSE
   ├─ Engineer reads: "Throttle to 30%"
   ├─ Executes: kubectl patch deployment...
   ├─ Database recovers: Response time back to normal
   └─ Incident resolved ✅

The Code Path

# Step 1: Incoming anomaly
POST /api/v1/decisions/stream-anomaly {
  "error_type": "database_timeout",
  "spike_percentage": 450
}

# Step 2: Decision generated (backend/routes/decisions.py)
decision = await engine.generate_decision(...)

# Step 3: Notifications sent (backend/services/notification.py)
await notification_service.notify_decision(
    decision=decision,
    channels=["websocket", "slack"]  # Send to BOTH
)

# Step 4: Results
- 42 browsers: See live alert (WebSocket)
- Slack #incidents channel: Formatted message with recommendation

🎯 Real Connections vs Mock

Currently you have:

Component Status What It Does
WebSocket ✅ Working Streams live incidents to dashboards
Slack ✅ Ready to connect When configured, sends alerts to Slack
LLM ✅ Mock (can be real) Generates recommendations
Vector DB ✅ Working Stores & searches 10 historical incidents
PostgreSQL ⏳ Optional Can store permanent incident records

🚀 Next Steps

  1. Verify backend is running: uvicorn backend.main:app --reload
  2. Run tests: python test_system.py
  3. Check API: Visit http://localhost:8000/docs
  4. Optional - Connect Slack:
    • Get webhook URL from Slack API
    • Add to .env
    • Restart backend
    • Try incident analysis - alert will appear in Slack!
  5. Develop frontend or Integrate with real services

About

AI-powered Incident Response System that automatically analyzes production incidents and recommends mitigation actions using historical data and machine learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors