Skip to content

parameterizing dateinc field , change the default date inc#3

Open
rush00121 wants to merge 1 commit intobpb27:masterfrom
rush00121:master
Open

parameterizing dateinc field , change the default date inc#3
rush00121 wants to merge 1 commit intobpb27:masterfrom
rush00121:master

Conversation

@rush00121
Copy link
Copy Markdown

Instead of pulling data for every day , this will extend the search filter parameter to get a bigger date range. This should make the scraping process faster .

@bpb27
Copy link
Copy Markdown
Owner

bpb27 commented Mar 29, 2017

In my experience, you actually get less results when you use an interval, as opposed to going day by day. Have you tried running this on a user with a lot of tweets (20K+) and compared the total with the interval method?

@rush00121
Copy link
Copy Markdown
Author

I was trying out to get tweets from realdonaldtrump .He has > 30k tweets. I refactored the code and ran it with a dateinterval of 50 days. It was way faster than if I get it for 1 day .I did not record metrics to prove this but this definitely sped up the scraping process for me.

@bpb27
Copy link
Copy Markdown
Owner

bpb27 commented Mar 29, 2017

By less results I mean you only get / collect 28K total tweets (w/ interval method) instead of the 30K total tweets (w/ day by day method).

@rush00121
Copy link
Copy Markdown
Author

I did not test the number of tweets scraped. Let me test it and see if both the results are the same or not .

@rush00121
Copy link
Copy Markdown
Author

I tested for dates from 2010-01-01 - 2017-03-01 .

For the previous code, I got results :
total tweet count: 26141

With my modifications, I got results :
total tweet count: 27357

I took a smaller sample size :

dates from 2017-01-01 - 2017-02-01 .

Both runs gave me total tweet count: 204

I am not sure why in the previous run , my code gave me more results. Is is a timeout issue with the twitter page or something else.

But in both cases, I got a significant speed improvement .

@ryanbateman
Copy link
Copy Markdown

I also ran this PR branch and got 14479 for 2010-01-01 - 2017-06-30. This may well be a facet of the loading of the pages on my machine - a factor which it seems would be an issue regardless - but definitely worth considering.

@ryanbateman
Copy link
Copy Markdown

Increasing the page load time to 2 netted 15687 ids for that same time period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants