Despite all my efforts in postgres-lopez, the logic behind fetching new pages to be crawled does not play nicely with parallelism.
I explain: the query fetches a bunch of URLs, but fails to deliver a lot of plurality of domains. Keep in mind that crawl speed within the same domain is a heavily limiting factor. It is better to crawl a bunch of different domains in parallel.
Idea: eschew the fetch in the MasterBackend interface and move it to the WorkerBackend. Also, add a parameter origin: url::Origin so that the worker can control from which domain to fetch. Then, move fancy Url choosing logic to Worker.
This is also a step towards making lopez distributed. I suppose (?), the fetch in MasterBackend is a major bottleneck.
Despite all my efforts in
postgres-lopez, the logic behind fetching new pages to be crawled does not play nicely with parallelism.I explain: the query fetches a bunch of URLs, but fails to deliver a lot of plurality of domains. Keep in mind that crawl speed within the same domain is a heavily limiting factor. It is better to crawl a bunch of different domains in parallel.
Idea: eschew the
fetchin theMasterBackendinterface and move it to theWorkerBackend. Also, add a parameterorigin: url::Originso that the worker can control from which domain to fetch. Then, move fancy Url choosing logic toWorker.This is also a step towards making
lopezdistributed. I suppose (?), thefetchinMasterBackendis a major bottleneck.