πΉ Decision Tree is a Supervised Learning Algorithm used for both Classification and Regression.
πΉ It mimics human decision-making by splitting data into branches based on conditions.
πΉ The structure consists of Nodes, Edges, and Leaves:
- Root Node β The starting point (entire dataset).
- Decision Nodes β Intermediate nodes where a decision is made.
- Leaf Nodes β The final output (class or value).
Examples of Decision Tree Applications
β
Spam Detection (Spam / Not Spam)
β
Loan Approval (Approve / Reject)
β
Disease Prediction (Yes / No)
2. How Does a Decision Tree Work?
1οΈβ£ Find the Best Feature to Split
- Uses Gini Impurity or Entropy (Information Gain) to choose the best feature.
2οΈβ£ Split Data Based on Feature
- Each split creates branches leading to more splits or final classifications.
3οΈβ£ Stop When Conditions Are Met
- The tree stops splitting when a condition like maximum depth is reached.
Mathematical Formulas
1οΈβ£ Gini Impurity (Used in Classification Trees)
\( \text{Gini} = 1 – \sum_{i=1}^{K} p_i^2 \)
\( p_i \) β Probability of a class in a node.
2οΈβ£ Entropy (Information Gain) (Alternative to Gini)
\( \text{Entropy} = -\sum_{i=1}^{K} p_i \log_2(p_i) \)
3οΈβ£ Decision Tree for Regression (Mean Squared Error – MSE)
\( MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)
Explanation of the Formula
- \( MSE \):
- The Mean Squared Error, a measure of the average squared difference between the actual and predicted values.
- \( n \):
- The number of data points (observations).
- \( y_i \):
- The actual value of the \( i \)-th observation.
- \( \hat{y}_i \):
- The predicted value of the \( i \)-th observation.
- \( (y_i – \hat{y}_i)^2 \):
- The squared difference between the actual and predicted values.
- \(Β \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 \)
- The average of the squared differences over all observations.
3. Implementing Decision Trees in Python
Step 1: Install Required Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree from sklearn.metrics import accuracy_score, classification_report, mean_squared_error from sklearn.datasets import load_iris
Step 2: Load Dataset (Iris Dataset – Classification Example)
# Load dataset iris = load_iris() X = iris.data y = iris.target # Convert to DataFrame df = pd.DataFrame(X, columns=iris.feature_names) df['Target'] = y # Display first 5 rows print(df.head())
Step 3: Split Data into Training and Testing Sets
# Split into training (80%) and testing (20%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Decision Tree Classifier
# Create Decision Tree model model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42) # Train the model model.fit(X_train, y_train)
Step 5: Make Predictions
# Predict on test data y_pred = model.predict(X_test)
Step 6: Evaluate Model Performance
# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))
Step 7: Visualizing the Decision Tree
plt.figure(figsize=(12, 8)) plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True) plt.show()
4. Decision Tree Regression Example
Step 1: Load Sample Dataset (Boston Housing Data)
from sklearn.datasets import fetch_california_housing # Load dataset data = fetch_california_housing() X = data.data y = data.target # Convert to DataFrame df = pd.DataFrame(X, columns=data.feature_names) df['Target'] = y # Display first 5 rows print(df.head())
Step 2: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Train the Decision Tree Regressor
# Create Decision Tree Regressor model regressor = DecisionTreeRegressor(max_depth=3, random_state=42) # Train the model regressor.fit(X_train, y_train)
Step 4: Make Predictions and Evaluate Model
# Predict on test data
y_pred = regressor.predict(X_test)
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
5. Understanding the Output
πΉ Accuracy Score β Shows how well the classifier performed.
πΉ Classification Report β Precision, Recall, F1-score for each class.
πΉ Decision Tree Visualization β Displays how the tree makes decisions.
πΉ Mean Squared Error (Regression) β Measures the model’s performance in regression tasks.
6. Advantages & Disadvantages of Decision Trees
β Advantages
β Easy to understand and visualize.
β No need for feature scaling (like normalization).
β Handles both classification and regression problems.
β Works with missing values and categorical data.
β Disadvantages
β Prone to overfitting (use max_depth or pruning).
β Not good with continuous variables (splits into discrete ranges).
β Sensitive to noisy data (can lead to unnecessary splits).
Summary
β Decision Trees are used for both classification and regression problems.
β They work by splitting data into branches using Gini Impurity or Entropy.
β sklearn.tree.DecisionTreeClassifier and DecisionTreeRegressor make it easy to implement in Python.
β Overfitting can be avoided using max_depth, min_samples_split, or pruning.