Mail_Classifier/About.docx at master · adityamodi/Mail_Classifier · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


Peeyush Agarwal , Hardik Bansal , Aditya Modi , Lucky Sahani

Aim :
Content-aware Email Multiclass Classification
Categorize Emails According to Senders (Creating a Gmail add-on for the same).

Ideas for implementation :
We plan to use python, javascript and google APIs.
We would use machine learning algorithms like SVM & Naive Bayes, Random forest for this project implementation.

Utility:
People nowadays encounter tons of emails everyday at work or in their daily life. The large quantities of emails keep causing confusions. Not only spam emails but also unwanted emails (e.g. advertisements) cause people to waste time on reading them. Therefore, it becomes urgent to develop reliable automatic categorization of emails to save the trouble. This project aims at developing an add-on for doing supervised and unsupervised classification of emails according to email content, in particular, putting emails into folders in terms of role of email senders. It will reduce the time required to check mails by automatically classifying the mails under various labels. We also intend to implement some semantic analysis for the same.

Features:
1. Automatically classifying mails under various labels according to its content.
2. Adding the deadlines listed in the mail to the google calendar.
3. Highlighting the important parts of the mail by using semantic analysis.


Estimated Deadlines:

1. 2 Days  : Obtaining and modifying required dataset for training process.
2. 2 Days   : Stemming of words in the dataset
3. 2 Days  : Creating tokens of all the words in the training set , removing words that do not contribute to classification and adding a few extra tokens.
4. 2 Days  : Constructing a vector for each training set and form a matrix.
5. 1 Day    : Converting the matrix into the sparsed form and shuffle the vectors.
6. 1 Day    : Divide the dataset into 10 parts and use 10 combinations of 9 training set and 1 testing pair.
7. 4 Days   : Using the Naïve Bayes to obtain data (Laplacian smoothing to make sure that there is no 0 occurrence of counts ) which can be used to make prediction on new data.
8. 4 Days   : Using SVM to obtain data which can be used to make prediction on new data.
9. 1 Day     : Comparing the results of SVM and Naïve Bayes.
10. 5 Days   : Reading the APIs of google to use it’s various features like modifying mails, giving it labels and adding events to google calendar.
11. 4 Days   : Integrating javascript with python and synchronising it with gmail and google calander.