Data Science Data Wrangling

Introduction to Data Wrangling

Data Wrangling (also known as Data Munging) is the process of cleaning, transforming, and structuring raw data into a format suitable for analysis. It is a crucial step in the Data Science workflow before applying machine learning models or performing data visualization.

Why is Data Wrangling Important?

✅ Helps in handling messy, inconsistent, and incomplete data.
✅ Improves data quality for better analysis and decision-making.
✅ Makes data suitable for machine learning models.

1. Steps in Data Wrangling

1.1 Importing Data

Data can be loaded from multiple sources like CSV, Excel, databases, or APIs.

import pandas as pd

# Load data from a CSV file
data = pd.read_csv("data.csv")

# Display the first 5 rows
print(data.head())

Try It Now

1.2 Handling Missing Data

Missing data can cause errors in analysis. It can be either removed or imputed.

a) Checking for Missing Values

print(data.isnull().sum())  # Count missing values in each column

Try It Now

b) Removing Missing Values

data_cleaned = data.dropna()  # Remove rows with missing values

Try It Now

🚨 Note: Removing data may result in loss of valuable information.

c) Filling Missing Values (Imputation)

data['Age'].fillna(data['Age'].mean(), inplace=True)  # Fill with mean value
data['City'].fillna("Unknown", inplace=True)  # Fill with a constant value

Try It Now

💡 Use mean/median for numerical data and mode for categorical data.

 

1.3 Handling Duplicates

Duplicate records can lead to incorrect analysis.

print(data.duplicated().sum())  # Count duplicate rows

data = data.drop_duplicates()  # Remove duplicate rows

Try It Now

1.4 Transforming Data (Data Formatting & Standardization)

a) Standardizing Column Names

# Convert all column names to lowercase and remove spaces
data.columns = data.columns.str.lower().str.replace(" ", "_")

Try It Now

b) Converting Data Types

# Convert date column to datetime format
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')

Try It Now

c) Trimming Extra Spaces from Text Columns

data['name'] = data['name'].str.strip()

Try It Now

1.5 Handling Outliers

Outliers are extreme values that can skew data analysis.

a) Detecting Outliers using Box Plot

import matplotlib.pyplot as plt

plt.boxplot(data['salary'])
plt.show()

Try It Now

b) Removing Outliers using the IQR Method

Q1 = data['salary'].quantile(0.25)
Q3 = data['salary'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
data = data[(data['salary'] >= Q1 - 1.5 * IQR) & (data['salary'] <= Q3 + 1.5 * IQR)]

Try It Now

1.6 Merging and Joining Data

Often, data is spread across multiple files and needs to be combined.

a) Concatenating DataFrames (Appending Rows)

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['D', 'E', 'F']})

combined_df = pd.concat([df1, df2])

Try It Now

b) Merging DataFrames (Joining on Common Columns)

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [1000, 2000, 3000]})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Department': ['HR', 'IT', 'Finance']})

merged_df = pd.merge(df1, df2, on='ID')

Try It Now

1.7 Pivoting and Reshaping Data

a) Pivot Table (Reshaping Data for Analysis)

pivot_table = data.pivot_table(index='department', values='salary', aggfunc='mean')

Try It Now

b) Melting Data (Converting Wide Data to Long Format)

melted_data = pd.melt(data, id_vars=['ID'], var_name='Attribute', value_name='Value')

Try It Now

2. Automating Data Wrangling

Instead of repeating steps manually, we can create a function.

def clean_and_prepare_data(df):
    # Handle missing values
    df.fillna(df.mean(), inplace=True)

    # Remove duplicates
    df.drop_duplicates(inplace=True)

    # Standardize column names
    df.columns = df.columns.str.lower().str.replace(" ", "_")

    # Convert date columns
    df['date'] = pd.to_datetime(df['date'], errors='coerce')

    return df

# Apply the function
data = clean_and_prepare_data(data)

Try It Now

🚀 This function helps in automatically preparing raw data for analysis!

 

Data Wrangling is a vital step that ensures clean, structured, and meaningful data for analysis and machine learning.