Data Science Jobs Salary Estimator: Overview

Created a tool that estimates data science salaries (MAE ~ $27.6K) with the info of the job
Scraped over 1000 job descriptions from glassdoor using python and selenium
Engineered features from the text of each job description to quantify the value companies put on python, excel, sql, aws, spark, tableau, and machine learning.
Optimized Linear, Lasso, Ridge Regression, Gradient Boosting, Decision Tree, and Random Forest Regressors using GridSearchCV to reach the best model.
Built a client-facing web application with Flask featuring an interactive UI for salary predictions

Code and Resources Used

Python Version: 3.12 (also compatible with 3.9+)

Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle

Scikit-learn Version: 1.7.2

For Web Framework Requirements: Use virtual environment (venv) and install packages as needed

Project Reference

Scraper Github

EDA

Flask Productionization

Web Scraping

Scrape 1080 job postings from glassdoor.com. With each job, we got the following:

Job title
Salary Estimate
Job Description
Rating
Company
Location
Company Size
Company Founded Date
Type of Ownership
Industry
Sector
Revenue

Data Cleaning

After scraping the data, It's time to clean it up so that it was usable for our model. I made the following changes and created the following variables:

Removed duplicates and rows without salary
Parsed numeric data out of salary
Made columns for employer provided salary and hourly wages
Transformed founded date into age of company
Replaced -1 to Unknown in certain columns
- Size
- Type of ownership
- Industry
- Sector
- Revenue
Parsed rating out of company text
Made a new column for company state
Column for simplified job title and Seniority
Made columns for if different skills were listed in the job description:
- Python
- Excel
- SQL
- Tableau
- AWS
- Spark
- Machine Learning
Column for description length

EDA

I analyzed the data to uncover salary with differnt variables. Belows are a few highlights.

Model Building

Transformed the categorical variables into dummy variables, and split the data into train and tests sets with a test size of 20%.

Tried different models and evaluated them using Mean Absolute Error(MAE) and Root Mean Square Error(RMSE)

Modols Evaluated

Linear Regression
Lasso Regression
Ridge Regression
Decision Tree
Random Forest
Gradient Boosting

Model Performance

Among all the models, Random Forest model and Gradient Boosting have better performance, and Random Forest is slightly better.

Random Forest MAE: 27.59 RMSE: 54.03
Gradient Boosting MAE: 29.77 RMSE: 56.91

Productionization

Built a Flask web application with both API endpoints and an interactive web interface:

Web UI

Modern, responsive web interface at http://127.0.0.1:5000/
Interactive form for inputting job details (company rating, age, skills, etc.)
Real-time salary predictions displayed in the browser
Beautiful gradient design with user-friendly experience

API Endpoints

/predict - POST endpoint for raw 165-feature array predictions
/predict_simple - POST endpoint for simplified form data (automatically encodes features)

Running the Application

Set up virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install flask numpy pandas scikit-learn

Retrain the model (optional):
```
python retrain_model.py
```
Run the Flask app:
```
cd FlaskAPI
python app.py
```
Access the web interface: Open your browser and navigate to http://127.0.0.1:5000/

Project Structure

DataScience_Job/
├── FlaskAPI/
│   ├── app.py                 # Flask application with web UI and API
│   ├── models/
│   │   └── model_file.p      # Trained Random Forest model
│   ├── templates/
│   │   └── index.html        # Web interface
│   ├── data_input.py         # Sample input data
│   └── request.py            # API testing script
├── retrain_model.py          # Script to retrain model with current sklearn version
├── data_cleaning.ipynb       # Data cleaning pipeline
├── eda.ipynb                 # Exploratory data analysis
├── model_building.ipynb      # Model training and evaluation
├── data_cleaned.csv          # Cleaned dataset
└── README.md

Conclusion

Both the Random Forest and Gradient Boosting models demonstrated satisfactory performance in predicting average salaries. The Random Forest model was chosen for deployment due to slightly better performance. The web application provides an accessible interface for users to get salary estimates based on job characteristics, making the model insights available to non-technical users.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
FlaskAPI		FlaskAPI
__pycache__		__pycache__
images		images
.gitignore		.gitignore
README.md		README.md
data_cleaned.csv		data_cleaned.csv
data_cleaning.ipynb		data_cleaning.ipynb
data_collection.py		data_collection.py
eda.ipynb		eda.ipynb
glassdoor_Business_Analyst.csv		glassdoor_Business_Analyst.csv
glassdoor_Data_Analyst.csv		glassdoor_Data_Analyst.csv
glassdoor_Data_Architect.csv		glassdoor_Data_Architect.csv
glassdoor_Data_Engineer.csv		glassdoor_Data_Engineer.csv
glassdoor_Data_Scientist.csv		glassdoor_Data_Scientist.csv
glassdoor_Machine_Learning_Engineer.csv		glassdoor_Machine_Learning_Engineer.csv
model_building.ipynb		model_building.ipynb
retrain_model.py		retrain_model.py
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Jobs Salary Estimator: Overview

Code and Resources Used

Web Scraping

Data Cleaning

EDA

Model Building

Modols Evaluated

Model Performance

Productionization

Web UI

API Endpoints

Running the Application

Project Structure

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Science Jobs Salary Estimator: Overview

Code and Resources Used

Web Scraping

Data Cleaning

EDA

Model Building

Modols Evaluated

Model Performance

Productionization

Web UI

API Endpoints

Running the Application

Project Structure

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages