Fix Repeating tokens in WordEmbeddings #14693

ahmedlone127 · 2025-11-08T18:55:01Z

In the splitter-based pipeline, the WordEmbeddings annotator is generating an excessive number of repeated embeddings.
Normally, when using the splitter base, some token overlap is expected — meaning a certain number of tokens should appear twice due to window overlap.

However, instead of appearing twice, tokens are being repeated four times, causing a mismatch between the token and embedding counts.

this is being fixed by updating TokenizedWithSentence to filter tokens such that tokens fall within the same sentence at most once so a token can't come up in two differnet sentences.

Added a new FastTest to make sure correct behaviour ( no repeated tokens ) works

ahmedlone127 added 2 commits November 8, 2025 23:54

implementing fix

51cb048

Introducing new test case

92eef0c

DevinTDHa changed the title ~~implementing fix~~ Fix Repeating tokens in WordEmbeddings Nov 13, 2025

DevinTDHa added the bug-fix label Nov 13, 2025

DevinTDHa changed the base branch from master to release/622-release-candidate November 13, 2025 10:27

scalafmt

38b8b13

DevinTDHa approved these changes Nov 13, 2025

View reviewed changes

DevinTDHa merged commit 5c7d0d7 into release/622-release-candidate Nov 13, 2025
1 of 4 checks passed

DevinTDHa mentioned this pull request Nov 13, 2025

Spark NLP 6.2.2 Release #14696

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Repeating tokens in WordEmbeddings #14693

Fix Repeating tokens in WordEmbeddings #14693

Uh oh!

ahmedlone127 commented Nov 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix Repeating tokens in WordEmbeddings #14693

Fix Repeating tokens in WordEmbeddings #14693

Uh oh!

Conversation

ahmedlone127 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ahmedlone127 commented Nov 8, 2025 •

edited

Loading