Does your scrapy spider get identified and blocked by servers because you use the default user-agent or a generic one?
Use this random_useragent module and set a random user-agent for
every request. You are limited only by the number of different
user-agents you set in a text file.
Installing it is pretty simple.
pip install scrapy-random-useragentIn your settings.py file, update the DOWNLOADER_MIDDLEWARES
variable like this.
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 400
}This disables the default UserAgentMiddleware and enables the
RandomUserAgentMiddleware.
Then, create a new variable USER_AGENT_LIST with the path to your
text file which has the list of all user-agents
(one user-agent per line).
USER_AGENT_LIST = "/path/to/useragents.txt"Now all the requests from your crawler will have a random user-agent picked from the text file.