This project implements an end-to-end pipeline to scrape job postings from Wellfound, classify them using OpenAI GPT-4, store structured data into MongoDB, and index it into Elasticsearch for advanced search and prospecting.
- Go to Wellfound (formerly AngelList Talent)
- Scrape job listing URLs based on selected
location
and/orrole
- Store the scraped job URLs into the database (
job_scraping
->job_urls
collection) - Each document includes URL, location, role, and a "scraped: false" flag
- Fetch each saved job URL
- Scrape all available data:
- Title, Description
- Salary Range, Location, Experience Type
- Company Info: Industries, Size, Funding, Founder, etc.
- Save full job post data to MongoDB (
job_scraping
->jobs
collection)
- Store the entire structured job data (JSON format) including additional fields
- Fetch raw job post data from
jobs
collection - Send complete job data (excluding MongoDB
_id
) to OpenAI GPT-4 - Classify into structured JSON:
- Categories
- Focus Areas
- Company Info
- Job Info
- Investment Signals
- Summary
- Store the LLM-classified structured JSON into a new collection (
job_scraping
->classified_jobs
)
- Index classified job post into Elasticsearch (
job_classifications
index) - Upsert or update company documents with new job postings
- Score classified jobs with "signal_strength"
- Tag hot leads for outreach based on investment signals
- Set up triggers to notify prospecting teams or marketing automation
- Python
- MongoDB (job data storage)
- OpenAI GPT-4 (classification)
- Elasticsearch (searchable index)
- Asyncio + Tenacity (retries and batch processing)
Built for high-scale prospecting and lead generation workflows. ✨