Data Science Decision Trees

๐Ÿ”น Decision Tree is a Supervised Learning Algorithm used for both Classification and Regression.
๐Ÿ”น It mimics human decision-making by splitting data into branches based on conditions.
๐Ÿ”น The structure consists of Nodes, Edges, and Leaves:

  • Root Node โ†’ The starting point (entire dataset).
  • Decision Nodes โ†’ Intermediate nodes where a decision is made.
  • Leaf Nodes โ†’ The final output (class or value).

Examples of Decision Tree Applications

โœ… Spam Detection (Spam / Not Spam)
โœ… Loan Approval (Approve / Reject)
โœ… Disease Prediction (Yes / No)

2. How Does a Decision Tree Work?

1๏ธโƒฃ Find the Best Feature to Split

  • Uses Gini Impurity or Entropy (Information Gain) to choose the best feature.

2๏ธโƒฃ Split Data Based on Feature

  • Each split creates branches leading to more splits or final classifications.

3๏ธโƒฃ Stop When Conditions Are Met

  • The tree stops splitting when a condition like maximum depth is reached.

Mathematical Formulas

1๏ธโƒฃ Gini Impurity (Used in Classification Trees)

\( \text{Gini} = 1 – \sum_{i=1}^{K} p_i^2 \)

\( p_i \) โ†’ Probability of a class in a node.

2๏ธโƒฃ Entropy (Information Gain) (Alternative to Gini)

\( \text{Entropy} = -\sum_{i=1}^{K} p_i \log_2(p_i) \)

3๏ธโƒฃ Decision Tree for Regression (Mean Squared Error – MSE)

\( MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)

Explanation of the Formula

  1. \( MSE \):
    • The Mean Squared Error, a measure of the average squared difference between the actual and predicted values.
  2. \( n \):
    • The number of data points (observations).
  3. \( y_i \):
    • The actual value of the \( i \)-th observation.
  4. \( \hat{y}_i \):
    • The predicted value of the \( i \)-th observation.
  5. \( (y_i – \hat{y}_i)^2 \):
    • The squared difference between the actual and predicted values.
  6. \(ย  \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)
    • The average of the squared differences over all observations.

 

3. Implementing Decision Trees in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error
from sklearn.datasets import load_iris

Try It Now

Step 2: Load Dataset (Iris Dataset – Classification Example)

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Try It Now

Step 3: Split Data into Training and Testing Sets

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try It Now

Step 4: Train the Decision Tree Classifier

# Create Decision Tree model
model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)

# Train the model
model.fit(X_train, y_train)

Try It Now

Step 5: Make Predictions

# Predict on test data
y_pred = model.predict(X_test)

Try It Now

Step 6: Evaluate Model Performance

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

Try It Now

Step 7: Visualizing the Decision Tree

plt.figure(figsize=(12, 8))
plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

Try It Now

4. Decision Tree Regression Example

Step 1: Load Sample Dataset (Boston Housing Data)

from sklearn.datasets import fetch_california_housing

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Convert to DataFrame
df = pd.DataFrame(X, columns=data.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Try It Now

Step 2: Split Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try It Now

Step 3: Train the Decision Tree Regressor

# Create Decision Tree Regressor model
regressor = DecisionTreeRegressor(max_depth=3, random_state=42)

# Train the model
regressor.fit(X_train, y_train)

Try It Now

Step 4: Make Predictions and Evaluate Model

# Predict on test data
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Try It Now

5. Understanding the Output

๐Ÿ”น Accuracy Score โ†’ Shows how well the classifier performed.
๐Ÿ”น Classification Report โ†’ Precision, Recall, F1-score for each class.
๐Ÿ”น Decision Tree Visualization โ†’ Displays how the tree makes decisions.
๐Ÿ”น Mean Squared Error (Regression) โ†’ Measures the model’s performance in regression tasks.

6. Advantages & Disadvantages of Decision Trees

โœ… Advantages

โœ” Easy to understand and visualize.
โœ” No need for feature scaling (like normalization).
โœ” Handles both classification and regression problems.
โœ” Works with missing values and categorical data.

โŒ Disadvantages

โŒ Prone to overfitting (use max_depth or pruning).
โŒ Not good with continuous variables (splits into discrete ranges).
โŒ Sensitive to noisy data (can lead to unnecessary splits).

Summary

โœ” Decision Trees are used for both classification and regression problems.
โœ” They work by splitting data into branches using Gini Impurity or Entropy.
โœ” sklearn.tree.DecisionTreeClassifier and DecisionTreeRegressor make it easy to implement in Python.
โœ” Overfitting can be avoided using max_depth, min_samples_split, or pruning.