๐น Decision Tree is a Supervised Learning Algorithm used for both Classification and Regression.
๐น It mimics human decision-making by splitting data into branches based on conditions.
๐น The structure consists of Nodes, Edges, and Leaves:
- Root Node โ The starting point (entire dataset).
- Decision Nodes โ Intermediate nodes where a decision is made.
- Leaf Nodes โ The final output (class or value).
Examples of Decision Tree Applications
โ
Spam Detection (Spam / Not Spam)
โ
Loan Approval (Approve / Reject)
โ
Disease Prediction (Yes / No)
2. How Does a Decision Tree Work?
1๏ธโฃ Find the Best Feature to Split
- Uses Gini Impurity or Entropy (Information Gain) to choose the best feature.
2๏ธโฃ Split Data Based on Feature
- Each split creates branches leading to more splits or final classifications.
3๏ธโฃ Stop When Conditions Are Met
- The tree stops splitting when a condition like maximum depth is reached.
Mathematical Formulas
1๏ธโฃ Gini Impurity (Used in Classification Trees)
\( \text{Gini} = 1 – \sum_{i=1}^{K} p_i^2 \)
\( p_i \) โ Probability of a class in a node.
2๏ธโฃ Entropy (Information Gain) (Alternative to Gini)
\( \text{Entropy} = -\sum_{i=1}^{K} p_i \log_2(p_i) \)
3๏ธโฃ Decision Tree for Regression (Mean Squared Error – MSE)
\( MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)
Explanation of the Formula
- \( MSE \):
- The Mean Squared Error, a measure of the average squared difference between the actual and predicted values.
- \( n \):
- The number of data points (observations).
- \( y_i \):
- The actual value of the \( i \)-th observation.
- \( \hat{y}_i \):
- The predicted value of the \( i \)-th observation.
- \( (y_i – \hat{y}_i)^2 \):
- The squared difference between the actual and predicted values.
- \(ย \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)
- The average of the squared differences over all observations.
3. Implementing Decision Trees in Python
Step 1: Install Required Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree from sklearn.metrics import accuracy_score, classification_report, mean_squared_error from sklearn.datasets import load_iris
Step 2: Load Dataset (Iris Dataset – Classification Example)
# Load dataset iris = load_iris() X = iris.data y = iris.target # Convert to DataFrame df = pd.DataFrame(X, columns=iris.feature_names) df['Target'] = y # Display first 5 rows print(df.head())
Step 3: Split Data into Training and Testing Sets
# Split into training (80%) and testing (20%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Decision Tree Classifier
# Create Decision Tree model model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42) # Train the model model.fit(X_train, y_train)
Step 5: Make Predictions
# Predict on test data y_pred = model.predict(X_test)
Step 6: Evaluate Model Performance
# Accuracy Score accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy) # Classification Report print("Classification Report:\n", classification_report(y_test, y_pred))
Step 7: Visualizing the Decision Tree
plt.figure(figsize=(12, 8)) plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True) plt.show()
4. Decision Tree Regression Example
Step 1: Load Sample Dataset (Boston Housing Data)
from sklearn.datasets import fetch_california_housing # Load dataset data = fetch_california_housing() X = data.data y = data.target # Convert to DataFrame df = pd.DataFrame(X, columns=data.feature_names) df['Target'] = y # Display first 5 rows print(df.head())
Step 2: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Train the Decision Tree Regressor
# Create Decision Tree Regressor model regressor = DecisionTreeRegressor(max_depth=3, random_state=42) # Train the model regressor.fit(X_train, y_train)
Step 4: Make Predictions and Evaluate Model
# Predict on test data y_pred = regressor.predict(X_test) # Calculate Mean Squared Error mse = mean_squared_error(y_test, y_pred) print("Mean Squared Error:", mse)
5. Understanding the Output
๐น Accuracy Score โ Shows how well the classifier performed.
๐น Classification Report โ Precision, Recall, F1-score for each class.
๐น Decision Tree Visualization โ Displays how the tree makes decisions.
๐น Mean Squared Error (Regression) โ Measures the model’s performance in regression tasks.
6. Advantages & Disadvantages of Decision Trees
โ Advantages
โ Easy to understand and visualize.
โ No need for feature scaling (like normalization).
โ Handles both classification and regression problems.
โ Works with missing values and categorical data.
โ Disadvantages
โ Prone to overfitting (use max_depth or pruning).
โ Not good with continuous variables (splits into discrete ranges).
โ Sensitive to noisy data (can lead to unnecessary splits).
Summary
โ Decision Trees are used for both classification and regression problems.
โ They work by splitting data into branches using Gini Impurity or Entropy.
โ sklearn.tree.DecisionTreeClassifier
and DecisionTreeRegressor
make it easy to implement in Python.
โ Overfitting can be avoided using max_depth
, min_samples_split
, or pruning.