Skip to content

Conversation

@sridi-Rania
Copy link

No description provided.

@felixdittrich92 felixdittrich92 self-requested a review June 18, 2025 06:05
Copy link
Collaborator

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sridi-Rania 👋,

Thanks a lot for the PR 👍

"arabic_diacritics": "ًٌٍَُِّْ",
"arabic_digits": "٠١٢٣٤٥٦٧٨٩",
"arabic_letters": "ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي",
"arabic_letters": "- ء آ أ ؤ إ ئ ا ٪ ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ٰیٕ٪ ل م ن ه ة و ي پ چ ڢ ڤ گ ﻻ ﻷ ﻹ ﻵ ﺀ ﺁ ﺃ ﺅ ﺇ ﺉ ﺍ ﺏ ﺕ ﺙ ﺝ ﺡ ﺥ ﺩ ﺫ ﺭ ﺯ ﺱ ﺵ ﺹ ﺽ ﻁ ﻅ ﻉ ﻍ ﻑ ﻕ ﻙ ﻝ ﻡ ﻥ ﻩ ﻩ ﻭ ﻱ ﺑ ﺗ ﺛ ﺟ ﺣ ﺧ ﺳ ﺷ ﺻ ﺿ ﻃ ﻇ ﻋ ﻏ ﻓ ﻗ ﻛ ﻟ ﻣ ﻧ ﻫ ﻳ ﺒ ﺘ ﺜ ﺠ ﺤ ﺨ ﺴ ﺸ ﺼ ﺾ ﻄ ﻈ ﻌ ﻐ ﻔ ﻘ ﻜ ﻠ ﻤ ﻨ ﻬ ﻴ ﺎ ﺐ ﺖ ﺚ ﺞ ﺢ ﺦ ﺪ ﺬ ﺮ ﺰ ﺲ ﺶ ﺺ ﺾ ﻂ ﻆ ﻊ ﻎ ﻒ ﻖ ﻚ ﻞ ﻢ ﻦ ﻪ ﺔ ﺓﺋ ﺓﺋ ى ﻼوفرّٕ ﺊ ﻯ ﻀ ﻯ ﻼ ﺋ ﺊﺓى ﻀال ص ح x ـ ـوx ﻰ ﻮ ﻲ ً ٌ ؟ ؛ « » — ! # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ { | } ~",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest only to extend chars to the existing arabic_letters if some are missing additional arabic specific punctuations to add to the arabic_punctuation because in the arabic entry western punctuation is already included :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aditional it should not include whitespaces - our models can't work well with whitespaces so please remove if we want to make it more readable then:

"arabic_letters": "".join(["د", "غ" ...])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thanks for your feedback!

Just to clarify: in Arabic, letters change shape depending on their position in the word (beginning, middle, or end).
The characters I included cover all these contextual forms, which makes them more suitable for training the model accurately.

Also, the whitespaces between characters are not meant for natural spacing but are used intentionally to differentiate between the different forms of each letter during training.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh.. Understood
Could we split this into vowels, consonants, diacritics ?

At the end each char needs to be unique and whitespace/s are not allowed as mentioned to avoid that something visual is merged we can use

"".join(["A", "B", ...])

punctuation should be removed because it's later on added to the arabic entry :)

If I merge both I get this:

['ء', 'آ', 'أ', 'ؤ', 'إ', 'ئ', 'ا', 'ب', 'ة', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ـ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ى', 'ي', 'ﻰ', 'ﻚ', 'ﻟ', 'ﺱ', 'ﻦ', 'ٰ', 'ﺞ', 'ﻛ', 'ﺩ', 'ﺀ', 'ﺨ', 'ﻋ', 'x', 'ﺺ', 'ﻫ', 'ﻱ', 'ﺲ', 'ﻝ', 'ﺕ', 'ڢ', 'ﻳ', 'ڤ', 'ﺬ', '؛', 'ﺶ', 'ﺟ', 'ﺔ', 'گ', 'ﻙ', 'ﺦ', 'ﺁ', 'ﺋ', 'ﻞ', 'ﺷ', 'ﺚ', 'ﺃ', 'ﻈ', 'ﻨ', 'ﺴ', 'ﻹ', 'ﺉ', 'ﻊ', 'ﺪ', 'ﻉ', 'ﺝ', 'ﺳ', 'ﻷ', 'ﻓ', 'ﺍ', 'ﺊ', 'ﻖ', 'ﻠ', 'ً', 'ﻍ', 'ﻣ', 'ﻇ', 'ﺾ', 'ٌ', 'چ', 'ﺿ', 'ﻧ', 'ﺡ', 'ﻗ', 'ﺙ', 'ﺼ', 'ﺑ', 'ﻅ', 'ﺓ', 'ﻯ', 'ﻭ', 'ﺒ', 'ﻤ', 'ﻔ', 'پ', 'ﺯ', 'ﻩ', 'ﻑ', 'ﻜ', 'ﺖ', 'ﺛ', 'ﺧ', 'ﺫ', 'ﺠ', 'ﻡ', 'ﻵ', 'ﻌ', 'ﺰ', 'ﻴ', 'ﻘ', 'ﻄ', 'ﻒ', '٪', 'ﺮ', 'ﺇ', 'ﺘ', 'ﺽ', 'ﻢ', 'ﻐ', 'ﻏ', 'ﻃ', 'ی', 'ﺵ', 'ﺸ', 'ﻲ', 'ﻮ', 'ﺻ', 'ﻆ', 'ﻁ', 'ﺏ', 'ﺎ', 'ﻕ', 'ﺹ', 'ﻻ', 'ﻂ', 'ﺣ', 'ﻼ', 'ﺭ', 'ﻪ', '؟', 'ﺐ', 'ﺤ', 'ﻬ', 'ٕ', 'ّ', 'ﻀ', 'ﺗ', 'ﻥ', 'ﻎ', 'ﺥ', 'ﺅ', 'ﺜ', 'ﺢ']

dict.fromkeys(
# latin_based
VOCABS["english"]
VOCABS["arabic"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this for the moment, we will add this if we have a multilingual dataset including arabic 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general a really good idea to add a sanity check 👍

But we need to rethink the implementation a bit, your current code fits only for the db_ models, but such a check should be more generic and contolable so I would suggest the following:

Here we can add an boolean argument sanity_check or something like that which defaults to False if True it should do the following before formatting and appending the data:

- Check that the coordinates are in the image ranges
- Check that the coordinates are absolute so not in range 0-1

This logic can be added as a private method to the class and called before polygon formatting

Afterwards a test needs to be added here:

def test_detection_dataset(mock_image_folder, mock_detection_label):

and
def test_detection_dataset(mock_image_folder, mock_detection_label):

If these parts are done we can add an extra arg to the detection training scripts

parser.add_argument("--check-dataset", dest="check_dataset", action="store_true", help="Check the dataset for possible issues")

and corresponding update the DetectionDataset instances:

        val_set = DetectionDataset(
            img_folder=os.path.join(args.val_path, "images"),
            label_path=os.path.join(args.val_path, "labels.json"),
            sanity_check=args.check_dataset,
            ....

@felixdittrich92 felixdittrich92 marked this pull request as draft July 9, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants