Data Cleaning is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset. Cleaning ensures that data is structured, reliable, and ready for analysis or machine learning models.
πΉ Improves Data Accuracy β Errors and inconsistencies can lead to incorrect conclusions.
πΉ Enhances Model Performance β Clean data leads to better machine learning results.
πΉ Eliminates Redundancy β Duplicate records waste storage and computing resources.
πΉ Ensures Better Decision-Making β Reliable data improves business intelligence.
Let's consider a dataset containing student scores with common issues such as missing values, duplicate records, and outliers.
| Student_ID | Name | Age | Score | Subject | Gender |
|---|---|---|---|---|---|
| 101 | Prince | 26 | 85 | Math | F |
| 102 | Lovnish | 90 | Science | M | |
| 103 | Ravi | 18 | -1 | Math | |
| 104 | Pranav | 19 | 88 | Science | M |
| 105 | Chandan | 17 | 92 | Math | F |
| 106 | Rajat | 20 | 89 | M |
β
Missing values in Age, Gender, and Subject columns.
β
Invalid values (negative score).
β
Inconsistent formatting and duplicate records.
β
Outliers in the data.
import pandas as pd
# Load the dataset
df = pd.read_csv('student_scores.csv')print(df.isnull().sum()) # Check missing values in each column- For Numerical Data (Age): Fill with median value
df['Age'].fillna(df['Age'].median(), inplace=True)- For Categorical Data (Gender & Subject): Fill with mode
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Subject'].fillna('Unknown', inplace=True)df = df[df['Score'] >= 0] # Remove rows where Score is negativedf.drop_duplicates(inplace=True)df['Name'] = df['Name'].str.title() # Capitalize Names
df['Subject'] = df['Subject'].str.capitalize() # Standardize SubjectsUsing IQR (Interquartile Range) to detect outliers:
Q1 = df['Score'].quantile(0.25)
Q3 = df['Score'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Score'] >= Q1 - 1.5 * IQR) & (df['Score'] <= Q3 + 1.5 * IQR)]df['Student_ID'] = df['Student_ID'].astype(str) # Convert Student_ID to string
df['Age'] = df['Age'].astype(int) # Convert Age to integerdf.to_csv('cleaned_student_scores.csv', index=False) # Save to a new file- Handled missing values (numerical β median, categorical β mode)
- Removed invalid and duplicate records
- Standardized text formatting
- Detected and removed outliers using IQR
- Converted data types correctly
- Saved the cleaned data for further analysis
With this guide, you can efficiently clean and preprocess datasets, ensuring higher accuracy in data analysis and machine learning models. π