Employee attrition can reveal hidden issues in a workplace from poor job satisfaction to work-life balance challenges. This project aims to uncover key factors that influence employee turnover using PySpark and a linear regression model. Built as part of my graduate coursework in Big Data Analytics, it demonstrates my ability to set up a dual-node Spark environment, perform data preprocessing, and extract meaningful insights from large datasets.
- Apache Spark (with dual-VM cluster: Hadoop1 as NameNode, Hadoop2 as DataNode)
- PySpark for distributed data processing
- Pandas & Matplotlib for visualization
- Scikit-learn for regression modeling
- IBM HR Analytics Employee Attrition Dataset
📥 View Dataset on Kaggle
Records: 1,470+ employee profiles including fields like:- Age
- Job Role
- Monthly Income
- Work-Life Balance
- Job Satisfaction
- Attrition Status
- ✅ Preprocess and clean data (remove missing values, label encode categories)
- ✅ Load dataset into Spark DataFrames using PySpark
- ✅ Analyze key variables using linear regression
- ✅ Visualize attrition counts using Pandas & Matplotlib
- ✅ Compare single-VM vs. dual-VM Spark performance
- Identified correlations between Monthly Income, Job Satisfaction, and Attrition
- Regression modeling revealed key predictive features of employee turnover
- Data visualization clearly illustrated attrition patterns
- Running Spark in a dual-VM setup improved processing efficiency and memory usage
- 💻 Dual-node Spark Setup
- 📊 Employee Attrition Bar Chart
- 🧹 Data Cleaning in PySpark
This project helped me:
- Set up and configure Spark in a virtual cluster
- Apply machine learning techniques to real-world HR data
- Use PySpark to process and analyze large datasets
- Translate insights into meaningful business recommendations
project_proposal.docx
: Initial planning and project scopeemployee_analysis_report.docx
: Final analysis write-upcode/
: Spark and Python scriptsvisuals/
: Screenshot folder (Spark setup, bar chart)README.md
: Project summary and references
Runs the machine learning pipeline on a single virtual machine using Apache Spark. It performs data preprocessing, model training, and evaluation on a standalone node to establish a performance baseline.
Runs the same ML pipeline on a dual-VM Spark cluster. It distributes computation to evaluate performance scalability and speed improvements compared to the single-node setup.
Screenshots in the visual file highlight runtime performance metrics captured while executing Spark jobs on single and dual VM configurations.
-
Start:End Time Data Loading
Displays the Python terminal session showing the time measurement for data loading from CSV into Spark DataFrame using PySpark. -
Data Loading Time (Single VM)
Shows the output of the time taken to load the dataset into Spark on a single VM node—used to benchmark loading performance. -
Data Cleaning Time (Single VM)
Illustrates the duration required to clean the data (dropping nulls) on a single VM environment. -
Data Analysis Time (Single VM)
Displays the execution time for analyzing employee attrition counts on a single VM using Spark. -
Data Loading Time (Dual VM)
Shows the output of loading the dataset on a dual-VM Spark environment to compare against single VM performance. -
Data Cleaning Time (Dual VM)
Displays the duration of the data cleaning operation across a dual-VM cluster. -
Data Analysis Time (Dual VM)
Shows time required to perform employee attrition group analysis on a dual-VM Spark setup.
Author
Asiana Holloway
Graduate Student – Big Data Analytics
GitHub: AsianaHolloway