Intelligent Deep Crawl #1675
Replies: 1 comment
-
|
Hi Ashoolak, This is a classic challenge in web scraping known as adaptive crawling or goal-oriented crawling. You are right to be worried about costs—treating every click as an LLM inference task will indeed get expensive quickly. Here are a few directions you can take, ranging from existing tools to architectural optimizations: 1. Existing ToolsYou don't necessarily need to build this from scratch. There are modern tools designed specifically for this:
2. Architecture: "The Human Approach"To reduce costs, don't use the LLM for everything. Mimic how a human actually surfs: Scan -> Filter -> Deep Dive.
3. Handling "Unknown Structure"Since you can't use CSS selectors, you have two main options for the extraction part:
SummaryI'd suggest a hybrid approach: Use a lightweight script to follow promising links based on keywords, and only invoke the heavy LLM logic when you land on a page that actually contains list of people. Good luck with the build! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to build a crawler that takes a public websites URL let's say a hospital's researcher directory + a goal for contacts to find (e.g. brain surgeons) as input and outputs the name, title, and email of all the people that match the crawl's goal. This would have to work on any website which makes things extract difficult as I can't just use css selectors (since the structure is unknown.) So to make the cost of the AI extraction not absurd I'm trying implement intelligent crawling. I'm abusing LLM calls to this right now in an effort to replicate how a human assigned such a task on a random website would behave. But was wondering if there's a better way of doing this / if such a thing already exists and I just haven't been able to find it?
Beta Was this translation helpful? Give feedback.
All reactions