Data Science Data Wrangling

Introduction to Data Wrangling

Data Wrangling (also known as Data Munging) is the process of cleaning, transforming, and structuring raw data into a format suitable for analysis. It is a crucial step in the Data Science workflow before applying machine learning models or performing data visualization.

Why is Data Wrangling Important?

✅ Helps in handling messy, inconsistent, and incomplete data.
✅ Improves data quality for better analysis and decision-making.
✅ Makes data suitable for machine learning models.

1. Steps in Data Wrangling

1.1 Importing Data

Data can be loaded from multiple sources like CSV, Excel, databases, or APIs.

import pandas as pd

# Load data from a CSV file
data = pd.read_csv("data.csv")

# Display the first 5 rows
print(data.head())

1.2 Handling Missing Data

Missing data can cause errors in analysis. It can be either removed or imputed.

a) Checking for Missing Values

print(data.isnull().sum())  # Count missing values in each column

b) Removing Missing Values

data_cleaned = data.dropna()  # Remove rows with missing values

🚨 Note: Removing data may result in loss of valuable information.

c) Filling Missing Values (Imputation)

data['Age'].fillna(data['Age'].mean(), inplace=True)  # Fill with mean value
data['City'].fillna("Unknown", inplace=True)  # Fill with a constant value

💡 Use mean/median for numerical data and mode for categorical data.

1.3 Handling Duplicates

Duplicate records can lead to incorrect analysis.

print(data.duplicated().sum())  # Count duplicate rows

data = data.drop_duplicates()  # Remove duplicate rows

1.4 Transforming Data (Data Formatting & Standardization)

a) Standardizing Column Names

# Convert all column names to lowercase and remove spaces
data.columns = data.columns.str.lower().str.replace(" ", "_")

b) Converting Data Types

# Convert date column to datetime format
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')

c) Trimming Extra Spaces from Text Columns

data['name'] = data['name'].str.strip()

1.5 Handling Outliers

Outliers are extreme values that can skew data analysis.

a) Detecting Outliers using Box Plot

import matplotlib.pyplot as plt

plt.boxplot(data['salary'])
plt.show()

b) Removing Outliers using the IQR Method

Q1 = data['salary'].quantile(0.25)
Q3 = data['salary'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
data = data[(data['salary'] >= Q1 - 1.5 * IQR) & (data['salary'] <= Q3 + 1.5 * IQR)]

1.6 Merging and Joining Data

Often, data is spread across multiple files and needs to be combined.

a) Concatenating DataFrames (Appending Rows)

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['D', 'E', 'F']})

combined_df = pd.concat([df1, df2])

b) Merging DataFrames (Joining on Common Columns)

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [1000, 2000, 3000]})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Department': ['HR', 'IT', 'Finance']})

merged_df = pd.merge(df1, df2, on='ID')

1.7 Pivoting and Reshaping Data

a) Pivot Table (Reshaping Data for Analysis)

pivot_table = data.pivot_table(index='department', values='salary', aggfunc='mean')

b) Melting Data (Converting Wide Data to Long Format)

melted_data = pd.melt(data, id_vars=['ID'], var_name='Attribute', value_name='Value')

2. Automating Data Wrangling

Instead of repeating steps manually, we can create a function.

def clean_and_prepare_data(df):
    # Handle missing values
    df.fillna(df.mean(), inplace=True)

    # Remove duplicates
    df.drop_duplicates(inplace=True)

    # Standardize column names
    df.columns = df.columns.str.lower().str.replace(" ", "_")

    # Convert date columns
    df['date'] = pd.to_datetime(df['date'], errors='coerce')

    return df

# Apply the function
data = clean_and_prepare_data(data)

🚀 This function helps in automatically preparing raw data for analysis!

Data Wrangling is a vital step that ensures clean, structured, and meaningful data for analysis and machine learning.