This project is a Telegram bot designed to protect groups from spam using a classification model based on DistilBERT (multi-language cased) from transformers package. The bot utilizes a machine learning model trained to identify spam messages in both English and Russian, along with other languages, to ensure effective spam detection in various contexts.
Users can try out this project here: Try the Project
- Spam Detection: Identifies and handles spam messages in Telegram groups.
- Customizable: Adjust spam detection parameters and manage blocked users.
- Efficient: Leverages a pre-trained DistilBERT model for fast and accurate classification.
The model was trained on the following datasets:
- Russian Spam Dataset:
- English Spam Datasets:
These datasets contain a variety of spam messages, including SMS spam, and feature numerous emojis and special characters, aligning with the context of Telegram chats.
- Sequence Length: The model training took into account the fundamental limitations on sequence length, specifically the maximum message length in Telegram.
- Special Characters: The training data includes a large number of emojis and special characters to ensure robustness in the Telegram environment.
The project includes a database model with the following objects:
- Anti-Spam Chats: Represents chats where spam protection is active.
- Anti-Spam Chat Admins: Information about users managing spam protection settings in chats.
- Banned Users: List of users banned for spam.
- Muted Users: List of users temporarily muted from sending messages.
Initially, the project was tested with PostgreSQL using schema building through SQLAlchemy.
-
Telegram Bot Framework:
aiogram==3.12.0aioschedule==0.5.2aiosignal==1.3.1async-timeout==4.0.3
-
Machine Learning:
transformers==4.44.0torch==2.4.0
-
Configuration and Database:
pyyaml==6.0.2python-dotenv==1.0.1sqlalchemy==2.0.32aiosqlite==0.20.0asyncpg==0.29.0dogpile.cache==1.3.3
-
Timezone dealing:
pytz==2024.1
This project is licensed under the terms specified in the LICENSE.MD file