Skip to content

Current fetch logic is bad for parallelism #5

@tokahuke

Description

@tokahuke

Despite all my efforts in postgres-lopez, the logic behind fetching new pages to be crawled does not play nicely with parallelism.

I explain: the query fetches a bunch of URLs, but fails to deliver a lot of plurality of domains. Keep in mind that crawl speed within the same domain is a heavily limiting factor. It is better to crawl a bunch of different domains in parallel.

Idea: eschew the fetch in the MasterBackend interface and move it to the WorkerBackend. Also, add a parameter origin: url::Origin so that the worker can control from which domain to fetch. Then, move fancy Url choosing logic to Worker.

This is also a step towards making lopez distributed. I suppose (?), the fetch in MasterBackend is a major bottleneck.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Big CrawlPerformance while crawling reasonably large areas of the Web (i.e., more than one domain)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions