-
Notifications
You must be signed in to change notification settings - Fork 589
add arabic vocabs and some modification for the detection model so the errors are more clear #1957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e errors are more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sridi-Rania 👋,
Thanks a lot for the PR 👍
| "arabic_diacritics": "ًٌٍَُِّْ", | ||
| "arabic_digits": "٠١٢٣٤٥٦٧٨٩", | ||
| "arabic_letters": "ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي", | ||
| "arabic_letters": "- ء آ أ ؤ إ ئ ا ٪ ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ٰیٕ٪ ل م ن ه ة و ي پ چ ڢ ڤ گ ﻻ ﻷ ﻹ ﻵ ﺀ ﺁ ﺃ ﺅ ﺇ ﺉ ﺍ ﺏ ﺕ ﺙ ﺝ ﺡ ﺥ ﺩ ﺫ ﺭ ﺯ ﺱ ﺵ ﺹ ﺽ ﻁ ﻅ ﻉ ﻍ ﻑ ﻕ ﻙ ﻝ ﻡ ﻥ ﻩ ﻩ ﻭ ﻱ ﺑ ﺗ ﺛ ﺟ ﺣ ﺧ ﺳ ﺷ ﺻ ﺿ ﻃ ﻇ ﻋ ﻏ ﻓ ﻗ ﻛ ﻟ ﻣ ﻧ ﻫ ﻳ ﺒ ﺘ ﺜ ﺠ ﺤ ﺨ ﺴ ﺸ ﺼ ﺾ ﻄ ﻈ ﻌ ﻐ ﻔ ﻘ ﻜ ﻠ ﻤ ﻨ ﻬ ﻴ ﺎ ﺐ ﺖ ﺚ ﺞ ﺢ ﺦ ﺪ ﺬ ﺮ ﺰ ﺲ ﺶ ﺺ ﺾ ﻂ ﻆ ﻊ ﻎ ﻒ ﻖ ﻚ ﻞ ﻢ ﻦ ﻪ ﺔ ﺓﺋ ﺓﺋ ى ﻼوفرّٕ ﺊ ﻯ ﻀ ﻯ ﻼ ﺋ ﺊﺓى ﻀال ص ح x ـ ـوx ﻰ ﻮ ﻲ ً ٌ ؟ ؛ « » — ! # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ { | } ~", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest only to extend chars to the existing arabic_letters if some are missing additional arabic specific punctuations to add to the arabic_punctuation because in the arabic entry western punctuation is already included :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aditional it should not include whitespaces - our models can't work well with whitespaces so please remove if we want to make it more readable then:
"arabic_letters": "".join(["د", "غ" ...])There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks for your feedback!
Just to clarify: in Arabic, letters change shape depending on their position in the word (beginning, middle, or end).
The characters I included cover all these contextual forms, which makes them more suitable for training the model accurately.
Also, the whitespaces between characters are not meant for natural spacing but are used intentionally to differentiate between the different forms of each letter during training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh.. Understood
Could we split this into vowels, consonants, diacritics ?
At the end each char needs to be unique and whitespace/s are not allowed as mentioned to avoid that something visual is merged we can use
"".join(["A", "B", ...])punctuation should be removed because it's later on added to the arabic entry :)
If I merge both I get this:
['ء', 'آ', 'أ', 'ؤ', 'إ', 'ئ', 'ا', 'ب', 'ة', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ـ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ى', 'ي', 'ﻰ', 'ﻚ', 'ﻟ', 'ﺱ', 'ﻦ', 'ٰ', 'ﺞ', 'ﻛ', 'ﺩ', 'ﺀ', 'ﺨ', 'ﻋ', 'x', 'ﺺ', 'ﻫ', 'ﻱ', 'ﺲ', 'ﻝ', 'ﺕ', 'ڢ', 'ﻳ', 'ڤ', 'ﺬ', '؛', 'ﺶ', 'ﺟ', 'ﺔ', 'گ', 'ﻙ', 'ﺦ', 'ﺁ', 'ﺋ', 'ﻞ', 'ﺷ', 'ﺚ', 'ﺃ', 'ﻈ', 'ﻨ', 'ﺴ', 'ﻹ', 'ﺉ', 'ﻊ', 'ﺪ', 'ﻉ', 'ﺝ', 'ﺳ', 'ﻷ', 'ﻓ', 'ﺍ', 'ﺊ', 'ﻖ', 'ﻠ', 'ً', 'ﻍ', 'ﻣ', 'ﻇ', 'ﺾ', 'ٌ', 'چ', 'ﺿ', 'ﻧ', 'ﺡ', 'ﻗ', 'ﺙ', 'ﺼ', 'ﺑ', 'ﻅ', 'ﺓ', 'ﻯ', 'ﻭ', 'ﺒ', 'ﻤ', 'ﻔ', 'پ', 'ﺯ', 'ﻩ', 'ﻑ', 'ﻜ', 'ﺖ', 'ﺛ', 'ﺧ', 'ﺫ', 'ﺠ', 'ﻡ', 'ﻵ', 'ﻌ', 'ﺰ', 'ﻴ', 'ﻘ', 'ﻄ', 'ﻒ', '٪', 'ﺮ', 'ﺇ', 'ﺘ', 'ﺽ', 'ﻢ', 'ﻐ', 'ﻏ', 'ﻃ', 'ی', 'ﺵ', 'ﺸ', 'ﻲ', 'ﻮ', 'ﺻ', 'ﻆ', 'ﻁ', 'ﺏ', 'ﺎ', 'ﻕ', 'ﺹ', 'ﻻ', 'ﻂ', 'ﺣ', 'ﻼ', 'ﺭ', 'ﻪ', '؟', 'ﺐ', 'ﺤ', 'ﻬ', 'ٕ', 'ّ', 'ﻀ', 'ﺗ', 'ﻥ', 'ﻎ', 'ﺥ', 'ﺅ', 'ﺜ', 'ﺢ']| dict.fromkeys( | ||
| # latin_based | ||
| VOCABS["english"] | ||
| VOCABS["arabic"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert this for the moment, we will add this if we have a multilingual dataset including arabic 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general a really good idea to add a sanity check 👍
But we need to rethink the implementation a bit, your current code fits only for the db_ models, but such a check should be more generic and contolable so I would suggest the following:
- our built-in datasets doesn't needs such a check so full focus on
DetectionDataset - Let's move the logic to check into the
DetectionDatasetitself:
https://github.com/mindee/doctr/blob/main/doctr/datasets/detection.py
Here we can add an boolean argument sanity_check or something like that which defaults to False if True it should do the following before formatting and appending the data:
- Check that the coordinates are in the image ranges
- Check that the coordinates are absolute so not in range 0-1
This logic can be added as a private method to the class and called before polygon formatting
Afterwards a test needs to be added here:
doctr/tests/pytorch/test_datasets_pt.py
Line 135 in b547085
| def test_detection_dataset(mock_image_folder, mock_detection_label): |
and
doctr/tests/tensorflow/test_datasets_tf.py
Line 108 in b547085
| def test_detection_dataset(mock_image_folder, mock_detection_label): |
If these parts are done we can add an extra arg to the detection training scripts
parser.add_argument("--check-dataset", dest="check_dataset", action="store_true", help="Check the dataset for possible issues")
and corresponding update the DetectionDataset instances:
val_set = DetectionDataset(
img_folder=os.path.join(args.val_path, "images"),
label_path=os.path.join(args.val_path, "labels.json"),
sanity_check=args.check_dataset,
....
No description provided.