Data Science Random Forest

๐Ÿ”น Random Forest is an ensemble learning method that combines multiple Decision Trees to improve accuracy and reduce overfitting.
๐Ÿ”น It works by training multiple Decision Trees on random subsets of data and then averaging their predictions (for regression) or using majority voting (for classification).
๐Ÿ”น It is used for both classification and regression tasks.

Examples of Random Forest Applications

โœ… Spam Detection (Spam / Not Spam)
โœ… Credit Card Fraud Detection
โœ… Medical Diagnosis (Disease Prediction)
โœ… Stock Price Prediction

2. How Does Random Forest Work?

1๏ธโƒฃ Bootstrapping โ†’ Randomly selects samples from the dataset for training each tree.
2๏ธโƒฃ Feature Selection โ†’ Chooses a random subset of features at each split to reduce correlation between trees.
3๏ธโƒฃ Multiple Decision Trees โ†’ Each tree is trained independently on different subsets.
4๏ธโƒฃ Aggregation โ†’ Final prediction is made by:

  • Classification โ†’ Majority voting (most common class).
  • Regression โ†’ Averaging predictions from all trees.

3. Mathematical Formulae

1. Majority Voting (Classification)

\( \hat{Y} = \arg\max_{c} \sum_{i=1}^{n} I(y_i = c) \)

  • \( n \) โ†’ Number of trees
  • \( I(y_i = c) \) โ†’ Indicator function (1 if tree \( i \) predicts class \( c \), else 0)

2. Average Prediction (Regression)

\( \hat{Y} = \frac{1}{n} \sum_{i=1}^{n} \hat{y}_i \)

  • \(ย  \hat{y}_i \) โ†’ Prediction from tree \( i \)

4. Implementing Random Forest in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error
from sklearn.datasets import load_iris, fetch_california_housing

Try It Now

5. Random Forest for Classification (Iris Dataset Example)

Step 1: Load Dataset

# Load dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target classes

# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Try It Now

Step 2: Split Data into Training and Testing Sets

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try It Now

Step 3: Train the Random Forest Classifier

# Create Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=3, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

Try It Now

Step 4: Make Predictions

# Predict on test data
y_pred = rf_model.predict(X_test)

Try It Now

Step 5: Evaluate Model Performance

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

Try It Now

Step 6: Feature Importance

# Get feature importances
importances = rf_model.feature_importances_
feature_names = iris.feature_names

# Plot feature importance
plt.figure(figsize=(8, 6))
plt.barh(feature_names, importances, color='skyblue')
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance in Random Forest")
plt.show()

Try It Now

6. Random Forest for Regression (California Housing Dataset Example)

Step 1: Load Dataset

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Convert to DataFrame
df = pd.DataFrame(X, columns=data.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Try It Now

Step 2: Split Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try It Now

Step 3: Train the Random Forest Regressor

# Create Random Forest Regressor model
rf_regressor = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)

# Train the model
rf_regressor.fit(X_train, y_train)

Try It Now

Step 4: Make Predictions

# Predict on test data
y_pred = rf_regressor.predict(X_test)

Try It Now

Step 5: Evaluate Model Performance

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Try It Now

7. Understanding the Output

๐Ÿ”น Accuracy Score โ†’ Percentage of correctly classified instances.
๐Ÿ”น Classification Report โ†’ Precision, Recall, F1-score for each class.
๐Ÿ”น Feature Importance โ†’ Shows which features contribute the most to predictions.
๐Ÿ”น Mean Squared Error (Regression) โ†’ Measures how well the model predicts continuous values.

8. Advantages & Disadvantages of Random Forest

โœ… Advantages

โœ” More accurate than a single Decision Tree
โœ” Reduces overfitting (by averaging multiple trees)
โœ” Works well with missing data
โœ” Handles large datasets efficiently

โŒ Disadvantages

โŒ Slower than Decision Trees (due to multiple trees)
โŒ Harder to interpret (not as simple as a single tree)
โŒ More computationally expensive

Summary

โœ” Random Forest is an ensemble method combining multiple Decision Trees.
โœ” Uses majority voting (classification) or averaging (regression) for predictions.
โœ” Reduces overfitting and improves accuracy.
โœ” RandomForestClassifier and RandomForestRegressor make implementation easy in Python.