Skip to content

remove_words! fails for long terms & terms with punctuation #74

@enkiv2

Description

@enkiv2

Because remove_words! uses regex matching even for string input, it fails on actually-present terms if those terms are larger than the maximum pattern size accepted by PCRE. Actually-present terms also fail if they contain regex-like punctuation. This produces an error message that doesn't specify the failed pattern, and furthermore aborts remove_words! entirely.

The same problem occurs in remove_sparse_terms! and remove_frequent_terms!, since these also file down to a call to remove_pattern.

Would it be possible to force only string-literal substitution in the case where an array of type String is passed (and only use regex if the items passed are actually typed as regular expressions)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions