add arabic vocabs and some modification for the detection model so the errors are more clear #1957

sridi-Rania · 2025-06-17T08:50:10Z

No description provided.

…e errors are more clear

felixdittrich92

Thanks a lot for the PR 👍

felixdittrich92 · 2025-06-18T06:11:21Z

doctr/datasets/vocabs.py

    "arabic_diacritics": "ًٌٍَُِّْ",
    "arabic_digits": "٠١٢٣٤٥٦٧٨٩",
-    "arabic_letters": "ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي",
+    "arabic_letters": "- ء آ أ ؤ إ ئ ا ٪ ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك  ٰیٕ٪ ل م ن ه ة و ي پ چ ڢ ڤ گ ﻻ ﻷ ﻹ ﻵ ﺀ ﺁ ﺃ ﺅ ﺇ ﺉ ﺍ ﺏ ﺕ ﺙ ﺝ ﺡ ﺥ ﺩ ﺫ ﺭ ﺯ ﺱ ﺵ ﺹ ﺽ ﻁ ﻅ ﻉ ﻍ ﻑ ﻕ ﻙ ﻝ ﻡ ﻥ ﻩ ﻩ ﻭ ﻱ ﺑ ﺗ ﺛ ﺟ ﺣ ﺧ ﺳ ﺷ ﺻ ﺿ ﻃ ﻇ ﻋ ﻏ ﻓ ﻗ ﻛ ﻟ ﻣ ﻧ ﻫ ﻳ ﺒ ﺘ ﺜ ﺠ ﺤ ﺨ ﺴ ﺸ ﺼ ﺾ ﻄ ﻈ ﻌ ﻐ ﻔ ﻘ ﻜ ﻠ ﻤ ﻨ ﻬ ﻴ ﺎ ﺐ ﺖ ﺚ ﺞ ﺢ ﺦ ﺪ ﺬ ﺮ ﺰ ﺲ ﺶ ﺺ ﺾ ﻂ ﻆ ﻊ ﻎ ﻒ ﻖ ﻚ ﻞ ﻢ ﻦ ﻪ ﺔ ﺓﺋ ﺓﺋ ى ﻼوفرّٕ  ﺊ ﻯ ﻀ ﻯ ﻼ ﺋ ﺊﺓى ﻀال ص ح x ـ ـوx  ﻰ ﻮ ﻲ ً ٌ  ؟ ؛ « » — !  # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _  { | } ~",


I would suggest only to extend chars to the existing arabic_letters if some are missing additional arabic specific punctuations to add to the arabic_punctuation because in the arabic entry western punctuation is already included :)

Aditional it should not include whitespaces - our models can't work well with whitespaces so please remove if we want to make it more readable then:

"arabic_letters": "".join(["د", "غ" ...])

Hi,

Thanks for your feedback!

Just to clarify: in Arabic, letters change shape depending on their position in the word (beginning, middle, or end).
The characters I included cover all these contextual forms, which makes them more suitable for training the model accurately.

Also, the whitespaces between characters are not meant for natural spacing but are used intentionally to differentiate between the different forms of each letter during training.

Mh.. Understood
Could we split this into vowels, consonants, diacritics ?

At the end each char needs to be unique and whitespace/s are not allowed as mentioned to avoid that something visual is merged we can use

"".join(["A", "B", ...])

punctuation should be removed because it's later on added to the arabic entry :)

If I merge both I get this:

['ء', 'آ', 'أ', 'ؤ', 'إ', 'ئ', 'ا', 'ب', 'ة', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ـ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ى', 'ي', 'ﻰ', 'ﻚ', 'ﻟ', 'ﺱ', 'ﻦ', 'ٰ', 'ﺞ', 'ﻛ', 'ﺩ', 'ﺀ', 'ﺨ', 'ﻋ', 'x', 'ﺺ', 'ﻫ', 'ﻱ', 'ﺲ', 'ﻝ', 'ﺕ', 'ڢ', 'ﻳ', 'ڤ', 'ﺬ', '؛', 'ﺶ', 'ﺟ', 'ﺔ', 'گ', 'ﻙ', 'ﺦ', 'ﺁ', 'ﺋ', 'ﻞ', 'ﺷ', 'ﺚ', 'ﺃ', 'ﻈ', 'ﻨ', 'ﺴ', 'ﻹ', 'ﺉ', 'ﻊ', 'ﺪ', 'ﻉ', 'ﺝ', 'ﺳ', 'ﻷ', 'ﻓ', 'ﺍ', 'ﺊ', 'ﻖ', 'ﻠ', 'ً', 'ﻍ', 'ﻣ', 'ﻇ', 'ﺾ', 'ٌ', 'چ', 'ﺿ', 'ﻧ', 'ﺡ', 'ﻗ', 'ﺙ', 'ﺼ', 'ﺑ', 'ﻅ', 'ﺓ', 'ﻯ', 'ﻭ', 'ﺒ', 'ﻤ', 'ﻔ', 'پ', 'ﺯ', 'ﻩ', 'ﻑ', 'ﻜ', 'ﺖ', 'ﺛ', 'ﺧ', 'ﺫ', 'ﺠ', 'ﻡ', 'ﻵ', 'ﻌ', 'ﺰ', 'ﻴ', 'ﻘ', 'ﻄ', 'ﻒ', '٪', 'ﺮ', 'ﺇ', 'ﺘ', 'ﺽ', 'ﻢ', 'ﻐ', 'ﻏ', 'ﻃ', 'ی', 'ﺵ', 'ﺸ', 'ﻲ', 'ﻮ', 'ﺻ', 'ﻆ', 'ﻁ', 'ﺏ', 'ﺎ', 'ﻕ', 'ﺹ', 'ﻻ', 'ﻂ', 'ﺣ', 'ﻼ', 'ﺭ', 'ﻪ', '؟', 'ﺐ', 'ﺤ', 'ﻬ', 'ٕ', 'ّ', 'ﻀ', 'ﺗ', 'ﻥ', 'ﻎ', 'ﺥ', 'ﺅ', 'ﺜ', 'ﺢ']

felixdittrich92 · 2025-06-18T06:12:01Z

doctr/datasets/vocabs.py

    dict.fromkeys(
        # latin_based
-        VOCABS["english"]
+        VOCABS["arabic"]


Let's revert this for the moment, we will add this if we have a multilingual dataset including arabic 👍

felixdittrich92 · 2025-06-18T06:28:47Z

doctr/models/detection/differentiable_binarization/base.py

In general a really good idea to add a sanity check 👍

But we need to rethink the implementation a bit, your current code fits only for the db_ models, but such a check should be more generic and contolable so I would suggest the following:

our built-in datasets doesn't needs such a check so full focus on DetectionDataset

Let's move the logic to check into the DetectionDataset itself:
https://github.com/mindee/doctr/blob/main/doctr/datasets/detection.py

Here we can add an boolean argument sanity_check or something like that which defaults to False if True it should do the following before formatting and appending the data:

- Check that the coordinates are in the image ranges - Check that the coordinates are absolute so not in range 0-1

This logic can be added as a private method to the class and called before polygon formatting

Afterwards a test needs to be added here:

doctr/tests/pytorch/test_datasets_pt.py

Line 135 in b547085

def test_detection_dataset(mock_image_folder, mock_detection_label):

and

doctr/tests/tensorflow/test_datasets_tf.py

Line 108 in b547085

def test_detection_dataset(mock_image_folder, mock_detection_label):

If these parts are done we can add an extra arg to the detection training scripts

parser.add_argument("--check-dataset", dest="check_dataset", action="store_true", help="Check the dataset for possible issues")

and corresponding update the DetectionDataset instances:

val_set = DetectionDataset( img_folder=os.path.join(args.val_path, "images"), label_path=os.path.join(args.val_path, "labels.json"), sanity_check=args.check_dataset, ....

add arabic vocabs and some modification for the detection model so th…

a319c88

…e errors are more clear

felixdittrich92 self-requested a review June 18, 2025 06:05

felixdittrich92 requested changes Jun 18, 2025

View reviewed changes

felixdittrich92 marked this pull request as draft July 9, 2025 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add arabic vocabs and some modification for the detection model so the errors are more clear #1957

add arabic vocabs and some modification for the detection model so the errors are more clear #1957

Uh oh!

sridi-Rania commented Jun 17, 2025

Uh oh!

felixdittrich92 left a comment

Uh oh!

felixdittrich92 Jun 18, 2025

Uh oh!

felixdittrich92 Jun 18, 2025

Uh oh!

sridi-Rania Jun 19, 2025

Uh oh!

felixdittrich92 Jun 23, 2025

Uh oh!

felixdittrich92 Jun 18, 2025

Uh oh!

felixdittrich92 Jun 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add arabic vocabs and some modification for the detection model so the errors are more clear #1957

Are you sure you want to change the base?

add arabic vocabs and some modification for the detection model so the errors are more clear #1957

Uh oh!

Conversation

sridi-Rania commented Jun 17, 2025

Uh oh!

felixdittrich92 left a comment

Choose a reason for hiding this comment

Uh oh!

felixdittrich92 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

felixdittrich92 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

sridi-Rania Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

felixdittrich92 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

felixdittrich92 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

felixdittrich92 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants