parameterizing dateinc field , change the default date inc#3
parameterizing dateinc field , change the default date inc#3rush00121 wants to merge 1 commit intobpb27:masterfrom
Conversation
|
In my experience, you actually get less results when you use an interval, as opposed to going day by day. Have you tried running this on a user with a lot of tweets (20K+) and compared the total with the interval method? |
|
I was trying out to get tweets from realdonaldtrump .He has > 30k tweets. I refactored the code and ran it with a dateinterval of 50 days. It was way faster than if I get it for 1 day .I did not record metrics to prove this but this definitely sped up the scraping process for me. |
|
By less results I mean you only get / collect 28K total tweets (w/ interval method) instead of the 30K total tweets (w/ day by day method). |
|
I did not test the number of tweets scraped. Let me test it and see if both the results are the same or not . |
|
I tested for dates from 2010-01-01 - 2017-03-01 . For the previous code, I got results : With my modifications, I got results : I took a smaller sample size : dates from 2017-01-01 - 2017-02-01 . Both runs gave me total tweet count: 204 I am not sure why in the previous run , my code gave me more results. Is is a timeout issue with the twitter page or something else. But in both cases, I got a significant speed improvement . |
|
I also ran this PR branch and got 14479 for 2010-01-01 - 2017-06-30. This may well be a facet of the loading of the pages on my machine - a factor which it seems would be an issue regardless - but definitely worth considering. |
|
Increasing the page load time to 2 netted 15687 ids for that same time period. |
Instead of pulling data for every day , this will extend the search filter parameter to get a bigger date range. This should make the scraping process faster .