Skip to content

Conversation

@Ayushk4
Copy link
Member

@Ayushk4 Ayushk4 commented Jun 16, 2019

An attempt for the approach mentioned in #143 .
As of now, it's near about as fast as the existing one.
Still Work-In-Progress with some functions.

Fixes #74 as well ( Refer #76 )

  • Strip_Articles
  • Strip_pronouns
  • Strip_Prepositions
  • Strip_Stopwords
  • Whitespace
  • Corrupt_utf8
  • Punctuation
  • Numbers
  • Strip_case
  • Strip_frequent and strip_sparse
  • Fixes Replacement function for list of stuff. #23
  • Tests
  • Docstrings
  • Documentation

@Ayushk4
Copy link
Member Author

Ayushk4 commented Jun 23, 2019

This currently supports strip_articles, strip_pronouns, string_prepostions, strip_stopwords - Operations, and on those 4 operations, is at least 4 times faster for 100000 character length docs, 2-3 times faster for 10000 length docs. Works much faster for larger sized documents, but converges to same speed as existing one for smaller documents.

julia> @time fastpreprocess(StringDocument(s))
  0.006278 seconds (3.78 k allocations: 693.500 KiB)

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.024585 seconds (1.65 k allocations: 207.063 KiB)

julia> length(s)
100000

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.027906 seconds (1.65 k allocations: 207.063 KiB, 15.04% gc time)

julia> @time fastpreprocess(StringDocument(s))
  0.007384 seconds (3.78 k allocations: 693.500 KiB)

@aviks
Copy link
Member

aviks commented Nov 2, 2020

Hey @Ayushk4 can we finish this on up?

@Ayushk4
Copy link
Member Author

Ayushk4 commented Nov 2, 2020

I was only able to get this work faster on the initial couple of operations added. When I incorporated the same token buffer approach for more operations later, it resulted in much slower performance overall than the already existing one.

For the time being, I am closing it. If I find some other way to speed it up, then I will re-open this or send another PR.

@Ayushk4 Ayushk4 closed this Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

remove_words! fails for long terms & terms with punctuation Replacement function for list of stuff.

2 participants