Data preprocessing is a crucial step in Data Science that involves cleaning, transforming, and preparing raw data for analysis and machine learning models. Poor data quality can lead to inaccurate predictions, making preprocessing an essential step.
1. Why is Data Preprocessing Important?
- Real-world data is often incomplete, inconsistent, and contains errors.
- Machine learning models perform better with clean and structured data.
- Helps in handling missing values, outliers, and redundant information.
2. Steps in Data Preprocessing
2.1 Importing Data
Before preprocessing, we need to load the dataset.
Example using Pandas (Python):
import pandas as pd # Load dataset from a CSV file data = pd.read_csv("data.csv") # Display first 5 rows print(data.head())
2.2 Handling Missing Data
Missing values can affect analysis and machine learning models.
a) Removing Missing Values
# Remove rows with missing values data_cleaned = data.dropna()
- Useful when missing values are few.
b) Filling Missing Values (Imputation)
# Fill missing values with the column mean data['Age'].fillna(data['Age'].mean(), inplace=True) # Fill missing values with a constant data['Salary'].fillna(0, inplace=True)
- Can also use median or mode for filling missing values.
2.3 Handling Outliers
Outliers are extreme values that can distort analysis.
Detecting Outliers using Box Plot
import matplotlib.pyplot as plt plt.boxplot(data['Salary']) plt.show()
Removing Outliers using IQR Method
Q1 = data['Salary'].quantile(0.25) Q3 = data['Salary'].quantile(0.75) IQR = Q3 - Q1 # Remove outliers data = data[(data['Salary'] >= Q1 - 1.5 * IQR) & (data['Salary'] <= Q3 + 1.5 * IQR)]
2.4 Encoding Categorical Data
Machine learning models work with numerical data, so categorical data needs to be converted.
a) Label Encoding (For Binary Categories)
from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() data['Gender'] = label_encoder.fit_transform(data['Gender']) # Male → 1, Female → 0
b) One-Hot Encoding (For Multiple Categories)
data = pd.get_dummies(data, columns=['Country'])
- Converts categorical columns into multiple binary (0/1) columns.
2.5 Feature Scaling
Scaling ensures all numerical features have the same scale, improving model performance.
a) Standardization (Mean = 0, Std Dev = 1)
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data[['Age', 'Salary']] = scaler.fit_transform(data[['Age', 'Salary']])
b) Normalization (Values between 0 and 1)
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() data[['Age', 'Salary']] = scaler.fit_transform(data[['Age', 'Salary']])
3. Data Preprocessing Pipeline
Combining multiple steps into a pipeline automates preprocessing.
from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer # Define numerical and categorical features num_features = ['Age', 'Salary'] cat_features = ['Country'] # Create preprocessing pipeline num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) ]) cat_pipeline = Pipeline([ ('encoder', OneHotEncoder(handle_unknown='ignore')) ]) # Combine pipelines preprocessor = ColumnTransformer([ ('num', num_pipeline, num_features), ('cat', cat_pipeline, cat_features) ]) # Apply preprocessing processed_data = preprocessor.fit_transform(data)
Data preprocessing is a mandatory step before building machine learning models. Properly cleaned and transformed data improves accuracy and reliability.