Spark MLlib and Stream Data Prediction

Trained a simple supervised machine learning classification model (offline) that is able to predict the label of a given Wikipedia edit in online streaming.

Overview of Steps

Gathering historical data Streamed and saved historical data from April 15, 2020 through April 21, 2020. (using provided script spark_streaming_example_saving.py.ipynb). Data files are combined using linux commands.
Data exploration I explored the historical dataset and found the dataset to be unbalanced. I notice that many unsafe and vandal edits are made by Unregistered (IP or not logged in) users with names such as "name_user": "190.215.27.232". Though less common, vandal edits are someimies made by Users with silly names such as "Tylertoney Dude perfect".
Data preprocessing using MLlib For model training purposes, I focus on the three columns: label, text_new, text_old. Train set and test set are preprocessed separately.

Pipeline: feature engineering, model training, tuning, and selection using MLlib I proceed with the train data set.

Streaming and Prediction

Output : Printed stream predictions

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
lrmodel/stages/2_LogisticRegression_37b7305902a7/data		lrmodel/stages/2_LogisticRegression_37b7305902a7/data
model/stages/1_IDF_604fb38f97f9/data		model/stages/1_IDF_604fb38f97f9/data
README.md		README.md
pipeline1.png		pipeline1.png
prediction.png		prediction.png
preprocess1.png		preprocess1.png
spark_model_pipeline.ipynb		spark_model_pipeline.ipynb
spark_streaming_prediction.ipynb		spark_streaming_prediction.ipynb
stream.png		stream.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark MLlib and Stream Data Prediction

Overview of Steps

About

Uh oh!

Releases

Packages

Languages

Finterly/Wiki-Edit-Prediction-PySpark

Folders and files

Latest commit

History

Repository files navigation

Spark MLlib and Stream Data Prediction

Overview of Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages