Skip to content

Add punctuation removal modifier#63

Merged
ZJaume merged 4 commits intohplt-project:mainfrom
mozilla:aug_punkt
Jun 25, 2025
Merged

Add punctuation removal modifier#63
ZJaume merged 4 commits intohplt-project:mainfrom
mozilla:aug_punkt

Conversation

@evgenyrp
Copy link
Contributor

@evgenyrp evgenyrp commented Jun 20, 2025

We have a robustness issue with translation quality dropping with/without punctuation at the end of a sentence (See mozilla/translations#177).

The new RemoveEndPunctuationModifier should help with some of that. It randomly removes punctuation marks at the end of the source and target sentence if their punctuation type matches (for example, a period . and a Chinese full-stop ).

The exact logic here is probably up for discussion, but I think OpusTrainer should not behave as a data cleaning tool, so if the punctuation doesn't match, we assume there's a reason for that (maybe a paragraph includes multiple sentences and translators decided to swap them for some reason). In this case, we might damage the data if, say, a question mark is preserved in the middle of one paragraph but is removed from another one. So we play it safe here.

Also, I currently don't consider any kinds of parentheses, quotes etc., because they come in pairs and removing them would be slightly harder. Also, I think we have an opposite problem for them: adding quotes negatively impacts translation quality.

A note on the punctuation constants that are generated by ChatGPT: I think it should be good enough and cover the majority of cases. Let me know if I should add something else. It's probably not worth it to invest in a comprehensive solution which would include regexes, external libs, ICU etc. to identify all the possible punctuation by kind for each language. It would also likely be slower to run.

@evgenyrp
Copy link
Contributor Author

Actually, we might need another modifier that adds punctuation (or improve this one). This would be trickier for CJK and other non-European languages.

@ZJaume
Copy link
Contributor

ZJaume commented Jun 24, 2025

Maybe the add punctuation modifier is not needed and this one is enough.

Copy link
Contributor

@ZJaume ZJaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems perfect for me. Can you please add a small explanation to the modifiers section in the README before merging this?

@evgenyrp
Copy link
Contributor Author

Maybe the add punctuation modifier is not needed and this one is enough.

Maybe. I'm thinking if we have enough punctuation examples in the data, and since augmentation is random and likely will work differently for multiple epochs of training, the model will see the same sentence with and without the end punctuation, and hopefully will be more robust to it. I looked at de-en random sample and full stops are abundant there. Everything else is quite scarce, so maybe addition would make sense for those other types of punctuation. We can merge this one for now.

@evgenyrp evgenyrp requested a review from ZJaume June 24, 2025 22:21
@evgenyrp
Copy link
Contributor Author

I updated the readme and also handled an edge case with ellipsis and multiple punctuation marks (basically not processing such lines).

@ZJaume ZJaume merged commit ffd62b1 into hplt-project:main Jun 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants