Data Science Data Cleaning

Data Cleaning is a crucial step in the Data Science pipeline. Raw data often contains missing values, duplicate entries, outliers, and inconsistencies, which can lead to incorrect conclusions and poor model performance. Cleaning the data ensures better accuracy and reliability.

1. Why is Data Cleaning Important?

  • Improves Data Quality: Removes errors, inconsistencies, and duplicate records.
  • Enhances Model Performance: Clean data leads to better predictions in machine learning.
  • Prevents Biased Insights: Ensures the data accurately represents real-world patterns.

2. Common Issues in Data Cleaning & Solutions

2.1 Handling Missing Data

Missing data can occur due to errors in data collection, system failures, or human mistakes.

a) Checking for Missing Values

import pandas as pd

# Load dataset
data = pd.read_csv("data.csv")

# Check for missing values
print(data.isnull().sum())

b) Removing Missing Values

# Remove rows with missing values
data_cleaned = data.dropna()

🚨 Use this method only if a few rows are missing; otherwise, valuable data might be lost.

c) Filling Missing Values (Imputation)

# Fill missing values with column mean
data['Age'].fillna(data['Age'].mean(), inplace=True)

# Fill missing values with column median
data['Salary'].fillna(data['Salary'].median(), inplace=True)

# Fill missing categorical values with the most frequent value
data['City'].fillna(data['City'].mode()[0], inplace=True)

💡 Best Practice: Use mean/median for numerical data and mode for categorical data.

 

2.2 Removing Duplicate Data

Duplicate data can lead to biased analysis and incorrect predictions.

a) Checking for Duplicates

# Find duplicate rows
print(data.duplicated().sum())

b) Removing Duplicates

# Remove duplicate rows
data = data.drop_duplicates()

2.3 Handling Inconsistent Data

Inconsistent data occurs when the same information is stored in different formats.

Example:

Name City Country
John New York USA
John NY United States
John New York USA

Fixing Inconsistencies

# Standardizing city names
data['City'] = data['City'].replace({'NY': 'New York'})

# Standardizing country names
data['Country'] = data['Country'].replace({'United States': 'USA'})

🚀 Use regular expressions or mapping dictionaries for large-scale cleaning.

2.4 Handling Outliers

Outliers are extreme values that can distort analysis.

a) Detecting Outliers using Box Plot

import matplotlib.pyplot as plt

plt.boxplot(data['Salary'])
plt.show()

b) Removing Outliers using the IQR Method

Q1 = data['Salary'].quantile(0.25)
Q3 = data['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
data = data[(data['Salary'] >= Q1 - 1.5 * IQR) & (data['Salary'] <= Q3 + 1.5 * IQR)]

🚨 Removing outliers is useful for small datasets but may discard useful data in large datasets.

 

2.5 Standardizing Data Formats

Different formats in date/time, text, and numerical values can cause errors.

a) Standardizing Date Formats

# Convert date column to standard format
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d')

b) Converting Text to Lowercase

# Convert text to lowercase
data['City'] = data['City'].str.lower()

c) Trimming Extra Spaces

# Remove leading and trailing spaces
data['Name'] = data['Name'].str.strip()

3. Automating Data Cleaning Using a Function

def clean_data(df):
    # Handle missing values
    df.fillna(df.median(), inplace=True)

    # Remove duplicates
    df.drop_duplicates(inplace=True)

    # Standardize text columns
    df = df.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)

    return df

# Apply the function
data = clean_data(data)

🔥 This function automates the data cleaning process for any dataset!

 

Data Cleaning is a critical step in Data Science. It ensures accurate analysis, reduces errors, and improves the performance of machine learning models.