๐น Random Forest is an ensemble learning method that combines multiple Decision Trees to improve accuracy and reduce overfitting.
๐น It works by training multiple Decision Trees on random subsets of data and then averaging their predictions (for regression) or using majority voting (for classification).
๐น It is used for both classification and regression tasks.
Examples of Random Forest Applications
โ
Spam Detection (Spam / Not Spam)
โ
Credit Card Fraud Detection
โ
Medical Diagnosis (Disease Prediction)
โ
Stock Price Prediction
2. How Does Random Forest Work?
1๏ธโฃ Bootstrapping โ Randomly selects samples from the dataset for training each tree.
2๏ธโฃ Feature Selection โ Chooses a random subset of features at each split to reduce correlation between trees.
3๏ธโฃ Multiple Decision Trees โ Each tree is trained independently on different subsets.
4๏ธโฃ Aggregation โ Final prediction is made by:
- Classification โ Majority voting (most common class).
- Regression โ Averaging predictions from all trees.
3. Mathematical Formulae
1. Majority Voting (Classification)
\( \hat{Y} = \arg\max_{c} \sum_{i=1}^{n} I(y_i = c) \)
- \( n \) โ Number of trees
- \( I(y_i = c) \) โ Indicator function (1 if tree \( i \) predicts class \( c \), else 0)
2. Average Prediction (Regression)
\( \hat{Y} = \frac{1}{n} \sum_{i=1}^{n} \hat{y}_i \)
- \(ย \hat{y}_i \) โ Prediction from tree \( i \)
4. Implementing Random Forest in Python
Step 1: Install Required Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.metrics import accuracy_score, classification_report, mean_squared_error from sklearn.datasets import load_iris, fetch_california_housing
5. Random Forest for Classification (Iris Dataset Example)
Step 1: Load Dataset
# Load dataset iris = load_iris() X = iris.data # Features y = iris.target # Target classes # Convert to DataFrame df = pd.DataFrame(X, columns=iris.feature_names) df['Target'] = y # Display first 5 rows print(df.head())
Step 2: Split Data into Training and Testing Sets
# Split into training (80%) and testing (20%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Train the Random Forest Classifier
# Create Random Forest model rf_model = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=3, random_state=42) # Train the model rf_model.fit(X_train, y_train)
Step 4: Make Predictions
# Predict on test data y_pred = rf_model.predict(X_test)
Step 5: Evaluate Model Performance
# Accuracy Score accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy) # Classification Report print("Classification Report:\n", classification_report(y_test, y_pred))
Step 6: Feature Importance
# Get feature importances importances = rf_model.feature_importances_ feature_names = iris.feature_names # Plot feature importance plt.figure(figsize=(8, 6)) plt.barh(feature_names, importances, color='skyblue') plt.xlabel("Feature Importance") plt.ylabel("Features") plt.title("Feature Importance in Random Forest") plt.show()
6. Random Forest for Regression (California Housing Dataset Example)
Step 1: Load Dataset
# Load dataset data = fetch_california_housing() X = data.data y = data.target # Convert to DataFrame df = pd.DataFrame(X, columns=data.feature_names) df['Target'] = y # Display first 5 rows print(df.head())
Step 2: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Train the Random Forest Regressor
# Create Random Forest Regressor model rf_regressor = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42) # Train the model rf_regressor.fit(X_train, y_train)
Step 4: Make Predictions
# Predict on test data y_pred = rf_regressor.predict(X_test)
Step 5: Evaluate Model Performance
# Calculate Mean Squared Error mse = mean_squared_error(y_test, y_pred) print("Mean Squared Error:", mse)
7. Understanding the Output
๐น Accuracy Score โ Percentage of correctly classified instances.
๐น Classification Report โ Precision, Recall, F1-score for each class.
๐น Feature Importance โ Shows which features contribute the most to predictions.
๐น Mean Squared Error (Regression) โ Measures how well the model predicts continuous values.
8. Advantages & Disadvantages of Random Forest
โ Advantages
โ More accurate than a single Decision Tree
โ Reduces overfitting (by averaging multiple trees)
โ Works well with missing data
โ Handles large datasets efficiently
โ Disadvantages
โ Slower than Decision Trees (due to multiple trees)
โ Harder to interpret (not as simple as a single tree)
โ More computationally expensive
Summary
โ Random Forest is an ensemble method combining multiple Decision Trees.
โ Uses majority voting (classification) or averaging (regression) for predictions.
โ Reduces overfitting and improves accuracy.
โ RandomForestClassifier
and RandomForestRegressor
make implementation easy in Python.