Data Science Decision Trees

🔹 Decision Tree is a Supervised Learning Algorithm used for both Classification and Regression.
🔹 It mimics human decision-making by splitting data into branches based on conditions.
🔹 The structure consists of Nodes, Edges, and Leaves:

Root Node → The starting point (entire dataset).
Decision Nodes → Intermediate nodes where a decision is made.
Leaf Nodes → The final output (class or value).

Examples of Decision Tree Applications

✅ Spam Detection (Spam / Not Spam)
✅ Loan Approval (Approve / Reject)
✅ Disease Prediction (Yes / No)

2. How Does a Decision Tree Work?

1️⃣ Find the Best Feature to Split

Uses Gini Impurity or Entropy (Information Gain) to choose the best feature.

2️⃣ Split Data Based on Feature

Each split creates branches leading to more splits or final classifications.

3️⃣ Stop When Conditions Are Met

The tree stops splitting when a condition like maximum depth is reached.

Mathematical Formulas

1️⃣ Gini Impurity (Used in Classification Trees)

\( \text{Gini} = 1 – \sum_{i=1}^{K} p_i^2 \)

\( p_i \) → Probability of a class in a node.

2️⃣ Entropy (Information Gain) (Alternative to Gini)

\( \text{Entropy} = -\sum_{i=1}^{K} p_i \log_2(p_i) \)

3️⃣ Decision Tree for Regression (Mean Squared Error – MSE)

\( MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)

Explanation of the Formula

\( MSE \):
- The Mean Squared Error, a measure of the average squared difference between the actual and predicted values.
\( n \):
- The number of data points (observations).
\( y_i \):
- The actual value of the \( i \)-th observation.
\( \hat{y}_i \):
- The predicted value of the \( i \)-th observation.
\( (y_i – \hat{y}_i)^2 \):
- The squared difference between the actual and predicted values.
\( \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)
- The average of the squared differences over all observations.

3. Implementing Decision Trees in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error
from sklearn.datasets import load_iris

Step 2: Load Dataset (Iris Dataset – Classification Example)

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Step 3: Split Data into Training and Testing Sets

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the Decision Tree Classifier

# Create Decision Tree model
model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)

# Train the model
model.fit(X_train, y_train)

Step 5: Make Predictions

# Predict on test data
y_pred = model.predict(X_test)

Step 6: Evaluate Model Performance

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

Step 7: Visualizing the Decision Tree

plt.figure(figsize=(12, 8))
plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

4. Decision Tree Regression Example

Step 1: Load Sample Dataset (Boston Housing Data)

from sklearn.datasets import fetch_california_housing

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Convert to DataFrame
df = pd.DataFrame(X, columns=data.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Step 2: Split Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Train the Decision Tree Regressor

# Create Decision Tree Regressor model
regressor = DecisionTreeRegressor(max_depth=3, random_state=42)

# Train the model
regressor.fit(X_train, y_train)

Step 4: Make Predictions and Evaluate Model

# Predict on test data
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

5. Understanding the Output

🔹 Accuracy Score → Shows how well the classifier performed.
🔹 Classification Report → Precision, Recall, F1-score for each class.
🔹 Decision Tree Visualization → Displays how the tree makes decisions.
🔹 Mean Squared Error (Regression) → Measures the model’s performance in regression tasks.

6. Advantages & Disadvantages of Decision Trees

✅ Advantages

✔ Easy to understand and visualize.
✔ No need for feature scaling (like normalization).
✔ Handles both classification and regression problems.
✔ Works with missing values and categorical data.

❌ Disadvantages

❌ Prone to overfitting (use max_depth or pruning).
❌ Not good with continuous variables (splits into discrete ranges).
❌ Sensitive to noisy data (can lead to unnecessary splits).

Summary

✔ Decision Trees are used for both classification and regression problems.
✔ They work by splitting data into branches using Gini Impurity or Entropy.
✔ sklearn.tree.DecisionTreeClassifier and DecisionTreeRegressor make it easy to implement in Python.
✔ Overfitting can be avoided using max_depth, min_samples_split, or pruning.