Skip to content

Chinese text is not being tokenized properly #280

@Enkidu93

Description

@Enkidu93

We currently have pretranslations generated that look like:

...
"translation": "声音已经说出来了,耶稣就独自一人了.他们在那些日子里什么都没告诉任何人.",
"translationTokens": [
  "声音已经说出来了,耶稣就独自一人了.他们在那些日子里什么都没告诉任何人",
  "."
],
...

I assume that this is because we are using the LatinWordTokenizer for translation alignment. This is likely happening for some other scripts as well. We should evaluate how many projects this affects and consider using another tokenizer (or dynamically choosing a tokenizer). Unless I'm missing something, this would make the marker placement feature basically unavailable for those translating into scripts that the LatinWordTokenizer does not tokenize properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    🛑 Blocked

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions