Data Science Data Preprocessing

Data preprocessing is a crucial step in Data Science that involves cleaning, transforming, and preparing raw data for analysis and machine learning models. Poor data quality can lead to inaccurate predictions, making preprocessing an essential step.

1. Why is Data Preprocessing Important?

Real-world data is often incomplete, inconsistent, and contains errors.
Machine learning models perform better with clean and structured data.
Helps in handling missing values, outliers, and redundant information.

2. Steps in Data Preprocessing

2.1 Importing Data

Before preprocessing, we need to load the dataset.

Example using Pandas (Python):

import pandas as pd

# Load dataset from a CSV file
data = pd.read_csv("data.csv")

# Display first 5 rows
print(data.head())

Try It Now

2.2 Handling Missing Data

Missing values can affect analysis and machine learning models.

a) Removing Missing Values

# Remove rows with missing values
data_cleaned = data.dropna()

Try It Now

Useful when missing values are few.

b) Filling Missing Values (Imputation)

# Fill missing values with the column mean
data['Age'].fillna(data['Age'].mean(), inplace=True)

# Fill missing values with a constant
data['Salary'].fillna(0, inplace=True)

Try It Now

Can also use median or mode for filling missing values.

2.3 Handling Outliers

Outliers are extreme values that can distort analysis.

Detecting Outliers using Box Plot

import matplotlib.pyplot as plt

plt.boxplot(data['Salary'])
plt.show()

Try It Now

Removing Outliers using IQR Method

Q1 = data['Salary'].quantile(0.25)
Q3 = data['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
data = data[(data['Salary'] >= Q1 - 1.5 * IQR) & (data['Salary'] <= Q3 + 1.5 * IQR)]

Try It Now

2.4 Encoding Categorical Data

Machine learning models work with numerical data, so categorical data needs to be converted.

a) Label Encoding (For Binary Categories)

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data['Gender'] = label_encoder.fit_transform(data['Gender'])  # Male → 1, Female → 0

Try It Now

b) One-Hot Encoding (For Multiple Categories)

data = pd.get_dummies(data, columns=['Country'])

Try It Now

Converts categorical columns into multiple binary (0/1) columns.

2.5 Feature Scaling

Scaling ensures all numerical features have the same scale, improving model performance.

a) Standardization (Mean = 0, Std Dev = 1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['Age', 'Salary']] = scaler.fit_transform(data[['Age', 'Salary']])

Try It Now

b) Normalization (Values between 0 and 1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data[['Age', 'Salary']] = scaler.fit_transform(data[['Age', 'Salary']])

Try It Now

3. Data Preprocessing Pipeline

Combining multiple steps into a pipeline automates preprocessing.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Define numerical and categorical features
num_features = ['Age', 'Salary']
cat_features = ['Country']

# Create preprocessing pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

# Apply preprocessing
processed_data = preprocessor.fit_transform(data)

Try It Now

Data preprocessing is a mandatory step before building machine learning models. Properly cleaned and transformed data improves accuracy and reliability.