Exploratory Data Analysis (EDA) is a crucial step in the Data Science workflow. It helps in understanding the structure, patterns, and relationships within a dataset before applying machine learning models.
1. What is EDA?
EDA involves summarizing, visualizing, and interpreting datasets to uncover insights. The key goals are:
- Identifying missing values and outliers
- Understanding data distributions
- Detecting patterns, trends, and correlations
- Preparing data for further analysis
2. Steps in EDA
Step | Description |
---|---|
Load the Data | Import data from CSV, Excel, SQL, etc. |
Understand the Structure | Check data types, dimensions, and basic info |
Handle Missing Data | Identify and deal with null values |
Detect Outliers | Use box plots, IQR, or Z-score |
Visualize Distributions | Histograms, density plots, box plots |
Find Relationships | Correlation matrix, scatter plots |
Feature Engineering | Transform or create new features |
3. Performing EDA with Python
3.1. Load and Inspect the Data
import pandas as pd # Load dataset df = pd.read_csv("data.csv") # Display first few rows print(df.head()) # Get dataset shape print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}") # Summary of dataset print(df.info()) # Check for missing values print(df.isnull().sum())
3.2. Handling Missing Data
# Drop missing values df_cleaned = df.dropna() # Fill missing values with mean df_filled = df.fillna(df.mean())
3.3. Detecting Outliers
import seaborn as sns import matplotlib.pyplot as plt # Box plot to detect outliers plt.figure(figsize=(8, 5)) sns.boxplot(data=df) plt.show()
3.4. Visualizing Data Distributions
# Histogram to check distribution df.hist(figsize=(10, 8), bins=30) plt.show()
3.5. Checking Correlations
import numpy as np # Correlation matrix corr_matrix = df.corr() # Heatmap sns.heatmap(corr_matrix, annot=True, cmap="coolwarm") plt.show()
Summary
- EDA helps in understanding data before applying models.
- Use descriptive statistics to summarize data.
- Visualizations (histograms, box plots, scatter plots, heatmaps) provide insights.
- Handling missing values and outliers improves data quality.
- Correlation analysis identifies relationships between variables.