Exploratory Data Analysis (EDA) is a crucial step in the Data Science workflow. It helps in understanding the structure, patterns, and relationships within a dataset before applying machine learning models.
1. What is EDA?
EDA involves summarizing, visualizing, and interpreting datasets to uncover insights. The key goals are:
- Identifying missing values and outliers
- Understanding data distributions
- Detecting patterns, trends, and correlations
- Preparing data for further analysis
2. Steps in EDA
| Step | Description |
|---|---|
| Load the Data | Import data from CSV, Excel, SQL, etc. |
| Understand the Structure | Check data types, dimensions, and basic info |
| Handle Missing Data | Identify and deal with null values |
| Detect Outliers | Use box plots, IQR, or Z-score |
| Visualize Distributions | Histograms, density plots, box plots |
| Find Relationships | Correlation matrix, scatter plots |
| Feature Engineering | Transform or create new features |
3. Performing EDA with Python
3.1. Load and Inspect the Data
import pandas as pd
# Load dataset
df = pd.read_csv("data.csv")
# Display first few rows
print(df.head())
# Get dataset shape
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
# Summary of dataset
print(df.info())
# Check for missing values
print(df.isnull().sum())
3.2. Handling Missing Data
# Drop missing values df_cleaned = df.dropna() # Fill missing values with mean df_filled = df.fillna(df.mean())
3.3. Detecting Outliers
import seaborn as sns import matplotlib.pyplot as plt # Box plot to detect outliers plt.figure(figsize=(8, 5)) sns.boxplot(data=df) plt.show()
3.4. Visualizing Data Distributions
# Histogram to check distribution df.hist(figsize=(10, 8), bins=30) plt.show()
3.5. Checking Correlations
import numpy as np # Correlation matrix corr_matrix = df.corr() # Heatmap sns.heatmap(corr_matrix, annot=True, cmap="coolwarm") plt.show()
Summary
- EDA helps in understanding data before applying models.
- Use descriptive statistics to summarize data.
- Visualizations (histograms, box plots, scatter plots, heatmaps) provide insights.
- Handling missing values and outliers improves data quality.
- Correlation analysis identifies relationships between variables.