Intelligent Deep Crawl #1675

Ashoolak · 2025-12-19T19:42:16Z

Ashoolak
Dec 19, 2025

I'm trying to build a crawler that takes a public websites URL let's say a hospital's researcher directory + a goal for contacts to find (e.g. brain surgeons) as input and outputs the name, title, and email of all the people that match the crawl's goal. This would have to work on any website which makes things extract difficult as I can't just use css selectors (since the structure is unknown.) So to make the cost of the AI extraction not absurd I'm trying implement intelligent crawling. I'm abusing LLM calls to this right now in an effort to replicate how a human assigned such a task on a random website would behave. But was wondering if there's a better way of doing this / if such a thing already exists and I just haven't been able to find it?

ghost · 2026-02-03T15:41:56Z

ghost
Feb 3, 2026

Hi Ashoolak,

This is a classic challenge in web scraping known as adaptive crawling or goal-oriented crawling. You are right to be worried about costs—treating every click as an LLM inference task will indeed get expensive quickly.

Here are a few directions you can take, ranging from existing tools to architectural optimizations:

1. Existing Tools

You don't necessarily need to build this from scratch. There are modern tools designed specifically for this:

Firecrawl: Built exactly for this purpose. It crawls any website and converts it into clean markdown, handling dynamic content automatically. It handles the "messy web" part well.
Apify: They have specific "Actors" (like the Website Content Crawler) that use intelligent logic to map out sites and extract data without you having to write CSS selectors.
LangChain / LlamaIndex: If you want to code it yourself, look into their Agentic Web Crawlers. You can define tools (e.g., click_link, scroll) and let an LLM agent decide the next move based on the goal (e.g., "Find brain surgeons").

2. Architecture: "The Human Approach"

To reduce costs, don't use the LLM for everything. Mimic how a human actually surfs: Scan -> Filter -> Deep Dive.

Layer 1: Heuristics (Cheap & Fast):
Don't ask the LLM to analyze every link. Use regex and keyword matching first.
- If your goal is "Brain Surgeons", strictly ignore links like contact, about-us, news.
- Only follow links containing words like staff, directory, faculty, people, researchers.
Layer 2: LLM Classification (Expensive but Focused):
Only send the page content to the LLM if the heuristic score is high. Or, use the LLM to generate the heuristic keywords on the fly for a specific domain (e.g., "This is a university site, look for 'Faculty' and 'Departments'").
Layer 3: Extraction:
Once you are on a page that looks like a directory, then use the LLM (like GPT-4o or Claude 3.5 Sonnet) to extract the structured data (Name, Email, Title) efficiently.

3. Handling "Unknown Structure"

Since you can't use CSS selectors, you have two main options for the extraction part:

Visual Extraction (Multi-modal): Use a model that can "see" the page (like GPT-4o) via a screenshot. This is often more reliable for messy HTML than parsing raw DOM.
Readability Algorithms: Use a library (like trafilatura or Readability from Mozilla) to strip away ads/navbars/menus before sending text to the LLM. This saves tokens and increases accuracy.

Summary

I'd suggest a hybrid approach: Use a lightweight script to follow promising links based on keywords, and only invoke the heavy LLM logic when you land on a page that actually contains list of people.

Good luck with the build!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intelligent Deep Crawl #1675

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Intelligent Deep Crawl #1675

Uh oh!

Ashoolak Dec 19, 2025

Replies: 1 comment

Uh oh!

ghost Feb 3, 2026

1. Existing Tools

2. Architecture: "The Human Approach"

3. Handling "Unknown Structure"

Summary

Ashoolak
Dec 19, 2025

ghost
Feb 3, 2026