PulseGuard - Project Overview

🎯 What This Project Does

RAG of Fire is an AI-powered Incident Response System that automatically analyzes production incidents and recommends mitigation actions using historical data and machine learning.

Real-World Problem It Solves

When a production system experiences an incident (like a database timeout, memory leak, or API rate limiting), your on-call engineer needs to:

Understand what's happening (diagnosis)
Find similar past incidents (search historical knowledge)
Know what worked before (recommendations)
Act quickly to minimize downtime

RAG of Fire automates all of this.

🏗️ Architecture Overview

Three Main Components

1. Backend (Python + FastAPI)

REST API that handles incident analysis
Connected to a vector database (ChromaDB) for semantic search
Has 10 pre-loaded historical incidents with real metrics
Can match new incidents to past ones and recommend solutions

2. Frontend (Next.js + React)

Dashboard to visualize incidents
Forms to search historical data
Displays recommendations and confidence scores

3. Database

PostgreSQL: Stores structured incident data
ChromaDB: Vector database for semantic search (finds similar incidents)

📊 What "RAG" Means

RAG = Retrieval-Augmented Generation

Retrieval: Find similar past incidents from vector database
Augmented: Combine with current incident data
Generation: Use LLM to create personalized recommendations

Example:

Current Incident: Database timeout, 450% spike
                    ↓
Search vector DB for similar incidents
                    ↓
Found: INC-2025-001 (database timeout, 450% spike, fixed by throttling 30%)
                    ↓
LLM generates: "Recommend: Throttle database connections to 30%"
                    ↓
Confidence: 95% (because we have exact match in history)

🧪 What The Test Cases Do

The test_system.py file runs 7 tests simulating real production incidents:

Test 1: Database Timeout

error_type: "database_timeout"
spike_percentage: 450%  # Database response time increased 4.5x
Expected: "Throttle connections to 30%"

What it tests: Can the system detect a database timeout and recommend connection throttling?

Test 2: Kafka Consumer Lag

error_type: "kafka_consumer_lag"
spike_percentage: 320%  # Message queue backed up
Expected: "Throttle consumers to 45%"

What it tests: Can the system detect message queue buildup and recommend consumer throttling?

Test 3: Memory Leak

error_type: "memory_leak"
spike_percentage: 280%  # Memory usage increased 2.8x
Expected: "Throttle traffic to 20%"

What it tests: Can the system detect memory pressure and recommend traffic reduction?

Test 4: Stream Anomaly

Reports streaming anomalies in real-time

What it tests: Can the system accept live anomaly data?

Test 5-6: Search & Statistics

Search historical RCA documents
Get statistics about stored incidents

What it tests: Can the system retrieve and summarize historical data?

🔗 How Everything Integrates

┌─────────────────┐
│  Frontend       │
│  (Next.js)      │
└────────┬────────┘
         │ HTTP requests
         ↓
┌─────────────────────────────────────┐
│  Backend (FastAPI)                  │
│  - Decision Engine                  │
│  - Document Search                  │
│  - Streaming Anomalies              │
└────────┬────────────────────────────┘
         │
    ┌────┴─────┬────────────────┐
    ↓          ↓                ↓
┌────────┐ ┌──────────┐ ┌────────────┐
│PostgreSQL│ ChromaDB  │ LLM Service │
│(DB Data) │(Vector   │ (OpenAI or  │
│          │Search)   │  Mock)      │
└──────────┴──────────┴─────────────┘

📝 Integration Points

What needs to be integrated:

LLM Service
- Currently using a mock LLM (deterministic responses)
- Can integrate real OpenAI/Anthropic API
- Location: backend/services/llm_engine.py
Vector Database
- Currently using ChromaDB (in-memory, great for demo)
- Can integrate Pinecone, Weaviate, etc.
- Location: backend/services/vector_db.py
Real Database
- PostgreSQL models already defined
- Need to connect to real PostgreSQL instance
- Location: backend/database/db.py
Streaming Pipeline
- Currently using asyncio.Queue (mock Kafka/Flink)
- Can integrate real Apache Kafka
- Location: backend/services/stream_processor.py
Frontend Features
- Basic Next.js UI created
- Can add real-time dashboards, advanced filtering
- Location: app/ and components/

✅ Current Status

✅ Core logic working
✅ All 7 tests passing
✅ API endpoints functioning
✅ Backend starting successfully

🔌 WebSocket Real-Time Streaming

What is WebSocket?

A persistent connection between browser and server that allows LIVE, TWO-WAY communication.

How It Works in This Project

Browser                                Backend
   |                                     |
   |-------- Connect via WebSocket ------|
   |                                     |
   |<---- Real-time Decision Updates ----|
   |<---- Anomaly Alerts ------------------|
   |<---- System Events ------------------|

Example Flow: Real-Time Incident Notification

// Frontend (JavaScript)
const ws = new WebSocket('ws://localhost:8000/ws/incidents');

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  if (message.type === 'decision') {
    // Update dashboard in real-time
    console.log('New Decision:', message.data);
    // Display: "Database Timeout - Recommend: Throttle to 30%"
  }
  
  if (message.type === 'anomaly') {
    // Show alert in UI
    showAlert('Anomaly Detected: ' + message.data.error_type);
  }
};

// Send ping to keep connection alive
setInterval(() => {
  ws.send(JSON.stringify({ type: 'ping' }));
}, 30000);

What Gets Sent Over WebSocket

# Backend generates a decision
decision = {
    "id": "dec_12345",
    "matched_incident": "INC-2025-001",
    "symptom": "Database response time increased 450%",
    "recommended_action": "Throttle connections to 30%",
    "confidence_score": 0.95,
    "latency_ms": 245
}

# Broadcast to ALL connected browsers
await notification_service.ws_manager.broadcast({
    "type": "decision",
    "data": decision
})

Result: 100+ web browsers get the same notification instantly! ⚡

💬 Slack Integration

What is Slack?

A team messaging platform. Slack integration means the system sends alerts to your Slack channel automatically.

How It Works

Backend detects incident
         ↓
Generates decision/recommendation
         ↓
Sends formatted message to Slack webhook
         ↓
Slack channel receives alert
         ↓
Team members see: "🚨 Database Timeout - Throttle to 30%"

Setup Slack (Step-by-Step)

Step 1: Go to https://api.slack.com/apps and create an app

Name: "Incident Response Bot"

Step 2: Enable "Incoming Webhooks"

Get the webhook URL (looks like):

https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

Step 3: Add to .env file

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX

Step 4: Restart backend

uvicorn backend.main:app --reload

Slack Message Example

When an incident is detected, Slack receives:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🚨 Incident Decision: INC-2025-001
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Symptom:
Database response time increased 450%

Confidence:
95%

Recommended Action:
Throttle connections to 30%

⏱️ 245ms | ID: dec_12345

Code: How Slack Alert Gets Sent

# backend/services/notification.py

class SlackNotifier:
    async def send_decision_alert(self, decision):
        # Build a nice formatted message
        payload = {
            "text": "🚨 Incident Mitigation Decision Generated",
            "blocks": [
                {
                    "type": "header",
                    "text": {
                        "type": "plain_text",
                        "text": f"Incident Decision: {decision.matched_incident}"
                    }
                },
                {
                    "type": "section",
                    "text": {
                        "type": "mrkdwn",
                        "text": f"*Recommended Action:*\n```{decision.recommended_action}```"
                    }
                }
            ]
        }
        
        # Send HTTP POST to Slack webhook
        async with aiohttp.ClientSession() as session:
            await session.post(self.webhook_url, json=payload)

🔄 Complete Workflow Example

Real Production Incident Flow

Scenario: Production database connection pool exhausted

1. MONITORING SYSTEM
   └─ Detects: Database response time 450% spike
   └─ Sends to: POST /api/v1/decisions/stream-anomaly
   
2. BACKEND (Decision Engine)
   ├─ Searches vector DB for similar incidents
   ├─ Finds: INC-2025-001 (exact match!)
   ├─ Generates: "Throttle connections to 30%"
   ├─ Calculates: Confidence 95% (based on exact match)
   └─ Creates Decision object
   
3. NOTIFICATION SERVICE
   ├─ WebSocket Broadcast:
   │  └─ Sends to 42 browsers viewing dashboard
   │  └─ Each browser shows alert immediately
   │
   └─ Slack Notification:
      └─ Sends formatted alert to #incidents channel
      └─ On-call engineer sees alert in Slack
   
4. TEAM RESPONSE
   ├─ Engineer reads: "Throttle to 30%"
   ├─ Executes: kubectl patch deployment...
   ├─ Database recovers: Response time back to normal
   └─ Incident resolved ✅

The Code Path

# Step 1: Incoming anomaly
POST /api/v1/decisions/stream-anomaly {
  "error_type": "database_timeout",
  "spike_percentage": 450
}

# Step 2: Decision generated (backend/routes/decisions.py)
decision = await engine.generate_decision(...)

# Step 3: Notifications sent (backend/services/notification.py)
await notification_service.notify_decision(
    decision=decision,
    channels=["websocket", "slack"]  # Send to BOTH
)

# Step 4: Results
- 42 browsers: See live alert (WebSocket)
- Slack #incidents channel: Formatted message with recommendation

🎯 Real Connections vs Mock

Currently you have:

Component	Status	What It Does
WebSocket	✅ Working	Streams live incidents to dashboards
Slack	✅ Ready to connect	When configured, sends alerts to Slack
LLM	✅ Mock (can be real)	Generates recommendations
Vector DB	✅ Working	Stores & searches 10 historical incidents
PostgreSQL	⏳ Optional	Can store permanent incident records

🚀 Next Steps

Verify backend is running: uvicorn backend.main:app --reload
Run tests: python test_system.py
Check API: Visit http://localhost:8000/docs
Optional - Connect Slack:
- Get webhook URL from Slack API
- Add to .env
- Restart backend
- Try incident analysis - alert will appear in Slack!
Develop frontend or Integrate with real services

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
app		app
backend		backend
components		components
flink_jobs		flink_jobs
hooks		hooks
k8s		k8s
lib		lib
public		public
styles		styles
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
alembic.ini		alembic.ini
backend.Dockerfile		backend.Dockerfile
components.json		components.json
coral_gemini_update.md		coral_gemini_update.md
coral_integration_plan.md		coral_integration_plan.md
docker-compose.yml		docker-compose.yml
flink.Dockerfile		flink.Dockerfile
frontend.Dockerfile		frontend.Dockerfile
next-env.d.ts		next-env.d.ts
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
postcss.config.mjs		postcss.config.mjs
prometheus.yml		prometheus.yml
requirements.txt		requirements.txt
run_local.sh		run_local.sh
test_analyze.py		test_analyze.py
test_coral.py		test_coral.py
test_gemini.py		test_gemini.py
test_post.py		test_post.py
test_system.py		test_system.py
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

PulseGuard - Project Overview

🎯 What This Project Does

Real-World Problem It Solves

🏗️ Architecture Overview

Three Main Components

1. Backend (Python + FastAPI)

2. Frontend (Next.js + React)

3. Database

📊 What "RAG" Means

🧪 What The Test Cases Do

Test 1: Database Timeout

Test 2: Kafka Consumer Lag

Test 3: Memory Leak

Test 4: Stream Anomaly

Test 5-6: Search & Statistics

🔗 How Everything Integrates

📝 Integration Points

What needs to be integrated:

✅ Current Status

🔌 WebSocket Real-Time Streaming

What is WebSocket?

How It Works in This Project

Example Flow: Real-Time Incident Notification

What Gets Sent Over WebSocket

💬 Slack Integration

What is Slack?

How It Works

Setup Slack (Step-by-Step)

Slack Message Example

Code: How Slack Alert Gets Sent

🔄 Complete Workflow Example

Real Production Incident Flow

The Code Path

🎯 Real Connections vs Mock

🚀 Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages