Introduction to Data Wrangling
Data Wrangling (also known as Data Munging) is the process of cleaning, transforming, and structuring raw data into a format suitable for analysis. It is a crucial step in the Data Science workflow before applying machine learning models or performing data visualization.
Why is Data Wrangling Important?
✅ Helps in handling messy, inconsistent, and incomplete data.
✅ Improves data quality for better analysis and decision-making.
✅ Makes data suitable for machine learning models.
1. Steps in Data Wrangling
1.1 Importing Data
Data can be loaded from multiple sources like CSV, Excel, databases, or APIs.
import pandas as pd # Load data from a CSV file data = pd.read_csv("data.csv") # Display the first 5 rows print(data.head())
1.2 Handling Missing Data
Missing data can cause errors in analysis. It can be either removed or imputed.
a) Checking for Missing Values
print(data.isnull().sum()) # Count missing values in each column
b) Removing Missing Values
data_cleaned = data.dropna() # Remove rows with missing values
🚨 Note: Removing data may result in loss of valuable information.
c) Filling Missing Values (Imputation)
data['Age'].fillna(data['Age'].mean(), inplace=True) # Fill with mean value data['City'].fillna("Unknown", inplace=True) # Fill with a constant value
💡 Use mean/median for numerical data and mode for categorical data.
1.3 Handling Duplicates
Duplicate records can lead to incorrect analysis.
print(data.duplicated().sum()) # Count duplicate rows data = data.drop_duplicates() # Remove duplicate rows
1.4 Transforming Data (Data Formatting & Standardization)
a) Standardizing Column Names
# Convert all column names to lowercase and remove spaces data.columns = data.columns.str.lower().str.replace(" ", "_")
b) Converting Data Types
# Convert date column to datetime format data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
c) Trimming Extra Spaces from Text Columns
data['name'] = data['name'].str.strip()
1.5 Handling Outliers
Outliers are extreme values that can skew data analysis.
a) Detecting Outliers using Box Plot
import matplotlib.pyplot as plt plt.boxplot(data['salary']) plt.show()
b) Removing Outliers using the IQR Method
Q1 = data['salary'].quantile(0.25) Q3 = data['salary'].quantile(0.75) IQR = Q3 - Q1 # Remove outliers data = data[(data['salary'] >= Q1 - 1.5 * IQR) & (data['salary'] <= Q3 + 1.5 * IQR)]
1.6 Merging and Joining Data
Often, data is spread across multiple files and needs to be combined.
a) Concatenating DataFrames (Appending Rows)
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']}) df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['D', 'E', 'F']}) combined_df = pd.concat([df1, df2])
b) Merging DataFrames (Joining on Common Columns)
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [1000, 2000, 3000]}) df2 = pd.DataFrame({'ID': [1, 2, 3], 'Department': ['HR', 'IT', 'Finance']}) merged_df = pd.merge(df1, df2, on='ID')
1.7 Pivoting and Reshaping Data
a) Pivot Table (Reshaping Data for Analysis)
pivot_table = data.pivot_table(index='department', values='salary', aggfunc='mean')
b) Melting Data (Converting Wide Data to Long Format)
melted_data = pd.melt(data, id_vars=['ID'], var_name='Attribute', value_name='Value')
2. Automating Data Wrangling
Instead of repeating steps manually, we can create a function.
def clean_and_prepare_data(df): # Handle missing values df.fillna(df.mean(), inplace=True) # Remove duplicates df.drop_duplicates(inplace=True) # Standardize column names df.columns = df.columns.str.lower().str.replace(" ", "_") # Convert date columns df['date'] = pd.to_datetime(df['date'], errors='coerce') return df # Apply the function data = clean_and_prepare_data(data)
🚀 This function helps in automatically preparing raw data for analysis!
Data Wrangling is a vital step that ensures clean, structured, and meaningful data for analysis and machine learning.