Data Science Data Cleaning

Data Cleaning is a crucial step in the Data Science pipeline. Raw data often contains missing values, duplicate entries, outliers, and inconsistencies, which can lead to incorrect conclusions and poor model performance. Cleaning the data ensures better accuracy and reliability.

1. Why is Data Cleaning Important?

  • Improves Data Quality: Removes errors, inconsistencies, and duplicate records.
  • Enhances Model Performance: Clean data leads to better predictions in machine learning.
  • Prevents Biased Insights: Ensures the data accurately represents real-world patterns.

2. Common Issues in Data Cleaning & Solutions

2.1 Handling Missing Data

Missing data can occur due to errors in data collection, system failures, or human mistakes.

a) Checking for Missing Values

import pandas as pd

# Load dataset
data = pd.read_csv("data.csv")

# Check for missing values
print(data.isnull().sum())

Try It Now

b) Removing Missing Values

# Remove rows with missing values
data_cleaned = data.dropna()

Try It Now

🚨 Use this method only if a few rows are missing; otherwise, valuable data might be lost.

c) Filling Missing Values (Imputation)

# Fill missing values with column mean
data['Age'].fillna(data['Age'].mean(), inplace=True)

# Fill missing values with column median
data['Salary'].fillna(data['Salary'].median(), inplace=True)

# Fill missing categorical values with the most frequent value
data['City'].fillna(data['City'].mode()[0], inplace=True)

Try It Now

💡 Best Practice: Use mean/median for numerical data and mode for categorical data.

 

2.2 Removing Duplicate Data

Duplicate data can lead to biased analysis and incorrect predictions.

a) Checking for Duplicates

# Find duplicate rows
print(data.duplicated().sum())

Try It Now

b) Removing Duplicates

# Remove duplicate rows
data = data.drop_duplicates()

Try It Now

2.3 Handling Inconsistent Data

Inconsistent data occurs when the same information is stored in different formats.

Example:

Name City Country
John New York USA
John NY United States
John New York USA

Fixing Inconsistencies

# Standardizing city names
data['City'] = data['City'].replace({'NY': 'New York'})

# Standardizing country names
data['Country'] = data['Country'].replace({'United States': 'USA'})

Try It Now

🚀 Use regular expressions or mapping dictionaries for large-scale cleaning.

2.4 Handling Outliers

Outliers are extreme values that can distort analysis.

a) Detecting Outliers using Box Plot

import matplotlib.pyplot as plt

plt.boxplot(data['Salary'])
plt.show()

Try It Now

b) Removing Outliers using the IQR Method

Q1 = data['Salary'].quantile(0.25)
Q3 = data['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
data = data[(data['Salary'] >= Q1 - 1.5 * IQR) & (data['Salary'] <= Q3 + 1.5 * IQR)]

Try It Now

🚨 Removing outliers is useful for small datasets but may discard useful data in large datasets.

 

2.5 Standardizing Data Formats

Different formats in date/time, text, and numerical values can cause errors.

a) Standardizing Date Formats

# Convert date column to standard format
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d')

Try It Now

b) Converting Text to Lowercase

# Convert text to lowercase
data['City'] = data['City'].str.lower()

Try It Now

c) Trimming Extra Spaces

# Remove leading and trailing spaces
data['Name'] = data['Name'].str.strip()

Try It Now

3. Automating Data Cleaning Using a Function

def clean_data(df):
    # Handle missing values
    df.fillna(df.median(), inplace=True)

    # Remove duplicates
    df.drop_duplicates(inplace=True)

    # Standardize text columns
    df = df.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)

    return df

# Apply the function
data = clean_data(data)

Try It Now

🔥 This function automates the data cleaning process for any dataset!

 

Data Cleaning is a critical step in Data Science. It ensures accurate analysis, reduces errors, and improves the performance of machine learning models.