Skip to content

Conversation

@ahmedlone127
Copy link
Contributor

@ahmedlone127 ahmedlone127 commented Nov 8, 2025

In the splitter-based pipeline, the WordEmbeddings annotator is generating an excessive number of repeated embeddings.
Normally, when using the splitter base, some token overlap is expected — meaning a certain number of tokens should appear twice due to window overlap.

However, instead of appearing twice, tokens are being repeated four times, causing a mismatch between the token and embedding counts.

this is being fixed by updating TokenizedWithSentence to filter tokens such that tokens fall within the same sentence at most once so a token can't come up in two differnet sentences.

Added a new FastTest to make sure correct behaviour ( no repeated tokens ) works

@DevinTDHa DevinTDHa changed the title implementing fix Fix Repeating tokens in WordEmbeddings Nov 13, 2025
@DevinTDHa DevinTDHa changed the base branch from master to release/622-release-candidate November 13, 2025 10:27
@DevinTDHa DevinTDHa merged commit 5c7d0d7 into release/622-release-candidate Nov 13, 2025
1 of 4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Nov 13, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants