🕷️ CrawlMind

AI-Powered Web Crawling & Research Platform

Paste a URL. Describe your research. Let AI do the rest.

CrawlMind combines Cloudflare's crawl infrastructure with AI-powered URL discovery and multi-hop research synthesis — turning any query into structured, crawled knowledge.

Getting Started · Features · Architecture · Deploy

✨ Features

Core Crawling

Smart Input — Auto-detects URLs vs. natural language; just paste or type
Cloudflare-Powered — Fast, reliable crawling via Cloudflare's Browser Rendering API
Multi-Format Output — Markdown, HTML, plaintext, or cleaned readable HTML
JS Rendering — Crawl JavaScript-heavy SPAs with headless rendering
Advanced Controls — Depth, page limits, subdomain inclusion, URL patterns, date filters

🧠 AI Discovery (New)

AI URL Discovery — Describe what you need; Groq finds the best sources to crawl
Depth Tiers — Quick (~30s), Deep Dive (~2min), or Multi-hop Research (~5min)
Multi-Hop Research — Crawl → analyze gaps → discover follow-up sources → repeat (up to 3 rounds)
AI Synthesis — NVIDIA NIM generates a comprehensive research report from all crawled data
Parent-Child Jobs — Research jobs manage multiple sub-crawls independently, no interference with normal crawls

Platform

AI Chat — Ask questions about crawl results with full context awareness
Soft-Delete Library — Archive, restore, and manage past crawls
Analytics Dashboard — Track crawl usage, search patterns, and AI queries
Plan-Based Limits — Tiered pricing with Stripe integration
Auth — GitHub, Google, and email sign-in via Better Auth

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        USER INPUT                               │
│         URL / Natural Language / AI Discovery Toggle             │
└─────────────┬──────────────────────────┬────────────────────────┘
              │                          │
        URL detected              AI Discovery ON
              │                          │
              ▼                          ▼
    ┌─────────────────┐     ┌──────────────────────────┐
    │  POST /api/crawl │     │   POST /api/research     │
    │  Normal Pipeline │     │   AI Research Pipeline   │
    └────────┬────────┘     └────────────┬─────────────┘
             │                           │
             ▼                           ▼
    ┌─────────────────┐     ┌──────────────────────────┐
    │ Cloudflare Crawl │     │ Groq: Discover URLs      │
    │ Single Job       │     │ (llama-3.3-70b-versatile)│
    └────────┬────────┘     └────────────┬─────────────┘
             │                           │
             │                           ▼
             │              ┌──────────────────────────┐
             │              │ Spawn Parallel Sub-Crawls │
             │              │ via Cloudflare Crawl API  │
             │              └────────────┬─────────────┘
             │                           │
             │              ┌────────────▼─────────────┐
             │              │ RESEARCH tier only:       │
             │              │ NIM Gap Analysis →        │
             │              │ Follow-up Crawls (×3)     │
             │              └────────────┬─────────────┘
             │                           │
             │                           ▼
             │              ┌──────────────────────────┐
             │              │ NIM: Synthesis Report     │
             │              │ (nemotron-super-49b)      │
             ▼              └────────────┬─────────────┘
    ┌─────────────────┐                  │
    │  Neon PostgreSQL │◄────────────────┘
    │  (Prisma ORM)    │
    └─────────────────┘

🛠️ Tech Stack

Layer	Technology	Purpose
Framework	Next.js 15 (App Router)	Full-stack React with server components
Database	Neon PostgreSQL + Prisma	Serverless Postgres with type-safe ORM
Auth	Better Auth	GitHub, Google, email authentication
Crawling	Cloudflare Crawl API	Browser rendering + web crawling at scale
AI — Fast	Groq (`llama-3.3-70b`)	URL discovery (~200ms responses)
AI — Deep	NVIDIA NIM (`nemotron-super-49b`)	Gap analysis + synthesis reports
AI Chat	Vercel AI SDK	Streaming chat over crawl results
Payments	Stripe	Subscription billing + webhooks
Styling	Tailwind CSS + shadcn/ui	Utility-first CSS + accessible components
Deployment	Vercel	Edge-optimized serverless hosting

🚀 Getting Started

Prerequisites

Bun v1.0+
Neon PostgreSQL database
Cloudflare account with Crawl API access
Groq API key (for AI URL discovery)
NVIDIA NIM API key (for synthesis)

Quick Start

# Clone
git clone https://github.com/pantha704/CrawlMind.git
cd CrawlMind

# Install
bun install

# Configure
cp .env.example .env.local
# Edit .env.local with your keys (see below)

# Database
bunx prisma db push
bunx prisma generate

# Run
bun run dev

Environment Variables

# Database (Neon)
DATABASE_URL=postgresql://...

# Auth
BETTER_AUTH_SECRET=your-secret
BETTER_AUTH_URL=http://localhost:3001
GITHUB_CLIENT_ID=...
GITHUB_CLIENT_SECRET=...
GOOGLE_CLIENT_ID=...
GOOGLE_CLIENT_SECRET=...

# Cloudflare
CLOUDFLARE_API_TOKEN=...
CLOUDFLARE_ACCOUNT_ID=...

# AI
GROQ_API_KEY=...          # For URL discovery (Groq)
NVIDIA_NIM_API_KEY=...    # For synthesis (NVIDIA NIM)

# Stripe
STRIPE_SECRET_KEY=...
STRIPE_WEBHOOK_SECRET=...
NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY=...

# App
NEXT_PUBLIC_APP_URL=http://localhost:3001

📁 Project Structure

src/
├── app/
│   ├── api/
│   │   ├── crawl/              # Crawl CRUD, results proxy, cancel
│   │   ├── research/           # AI Discovery — create, poll, active
│   │   ├── chat/               # AI chat endpoint
│   │   ├── stripe/             # Payment webhooks
│   │   └── user/               # Usage tracking & settings
│   ├── dashboard/
│   │   ├── page.tsx            # Main dashboard
│   │   ├── jobs/               # Crawl job list + detail
│   │   ├── research/           # AI research detail page
│   │   ├── chat/               # AI chat interface
│   │   ├── library/            # Archived results
│   │   └── analytics/          # Usage analytics
│   ├── pricing/                # Pricing page
│   └── (auth)/                 # Sign in / sign up
├── components/
│   ├── dashboard/              # Dashboard UI (crawl-input, active-jobs, etc.)
│   ├── landing/                # Landing page components
│   └── ui/                     # shadcn/ui primitives
└── lib/
    ├── auth.ts                 # Better Auth config
    ├── cloudflare.ts           # Cloudflare Crawl API client
    ├── research.ts             # AI Discovery — Groq + NIM integration
    ├── ai.ts                   # AI model configuration
    ├── prisma.ts               # Prisma client
    └── stripe.ts               # Stripe client

🧠 AI Discovery — How It Works

Tier	What Happens	Sources	Time
⚡ Quick	AI finds 3-5 relevant sources, crawls them	3-5	~30s
🔍 Deep Dive	AI discovers 10-15 categorized sources	10-15	~2min
🧠 Research	Multi-hop: crawl → gap analysis → follow-up crawls (×3 rounds) → synthesis	15-30+	~5min

Models used:

Groq (llama-3.3-70b-versatile) — Fast URL discovery (~200ms)
NVIDIA NIM (nemotron-super-49b-v1.5) — Deep analysis & comprehensive synthesis

💳 Pricing Tiers

Plan	Price	Crawls/day	Pages/crawl	AI Chat	JS Render
Spark	Free	2	30	3 queries	❌
Pro	$12/mo	25	500	Unlimited	✅
Pro+	$24/mo	75	1,000	Unlimited	✅
Scale	$39/mo	150	5,000	Unlimited	✅

🚢 Deploy

Vercel (Recommended)

Push to GitHub
Import in Vercel
Add all environment variables
Set NEXT_PUBLIC_APP_URL to your Vercel domain
Deploy

Note: Ensure NEXT_PUBLIC_APP_URL points to your deployed domain (not localhost) for webhooks and auth callbacks.

📄 License

MIT — see LICENSE for details.

Built with ☕ and curiosity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🕷️ CrawlMind

AI-Powered Web Crawling & Research Platform

✨ Features

Core Crawling

🧠 AI Discovery (New)

Platform

🏗️ Architecture

🛠️ Tech Stack

🚀 Getting Started

Prerequisites

Quick Start

Environment Variables

📁 Project Structure

🧠 AI Discovery — How It Works

💳 Pricing Tiers

🚢 Deploy

Vercel (Recommended)

📄 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🕷️ CrawlMind

AI-Powered Web Crawling & Research Platform

✨ Features

Core Crawling

🧠 AI Discovery (New)

Platform

🏗️ Architecture

🛠️ Tech Stack

🚀 Getting Started

Prerequisites

Quick Start

Environment Variables

📁 Project Structure

🧠 AI Discovery — How It Works

💳 Pricing Tiers

🚢 Deploy

Vercel (Recommended)

📄 License