Data Science Random Forest

🔹 Random Forest is an ensemble learning method that combines multiple Decision Trees to improve accuracy and reduce overfitting.
🔹 It works by training multiple Decision Trees on random subsets of data and then averaging their predictions (for regression) or using majority voting (for classification).
🔹 It is used for both classification and regression tasks.

Examples of Random Forest Applications

✅ Spam Detection (Spam / Not Spam)
✅ Credit Card Fraud Detection
✅ Medical Diagnosis (Disease Prediction)
✅ Stock Price Prediction

2. How Does Random Forest Work?

1️⃣ Bootstrapping → Randomly selects samples from the dataset for training each tree.
2️⃣ Feature Selection → Chooses a random subset of features at each split to reduce correlation between trees.
3️⃣ Multiple Decision Trees → Each tree is trained independently on different subsets.
4️⃣ Aggregation → Final prediction is made by:

Classification → Majority voting (most common class).
Regression → Averaging predictions from all trees.

3. Mathematical Formulae

1. Majority Voting (Classification)

\( \hat{Y} = \arg\max_{c} \sum_{i=1}^{n} I(y_i = c) \)

\( n \) → Number of trees
\( I(y_i = c) \) → Indicator function (1 if tree \( i \) predicts class \( c \), else 0)

2. Average Prediction (Regression)

\( \hat{Y} = \frac{1}{n} \sum_{i=1}^{n} \hat{y}_i \)

\( \hat{y}_i \) → Prediction from tree \( i \)

4. Implementing Random Forest in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error
from sklearn.datasets import load_iris, fetch_california_housing

Try It Now

5. Random Forest for Classification (Iris Dataset Example)

Step 1: Load Dataset

# Load dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target classes

# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Try It Now

Step 2: Split Data into Training and Testing Sets

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try It Now

Step 3: Train the Random Forest Classifier

# Create Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=3, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

Try It Now

Step 4: Make Predictions

# Predict on test data
y_pred = rf_model.predict(X_test)

Try It Now

Step 5: Evaluate Model Performance

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

Try It Now

Step 6: Feature Importance

# Get feature importances
importances = rf_model.feature_importances_
feature_names = iris.feature_names

# Plot feature importance
plt.figure(figsize=(8, 6))
plt.barh(feature_names, importances, color='skyblue')
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance in Random Forest")
plt.show()

Try It Now

6. Random Forest for Regression (California Housing Dataset Example)

Step 1: Load Dataset

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Convert to DataFrame
df = pd.DataFrame(X, columns=data.feature_names)
df['Target'] = y

# Display first 5 rows
print(df.head())

Try It Now

Step 2: Split Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try It Now

Step 3: Train the Random Forest Regressor

# Create Random Forest Regressor model
rf_regressor = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)

# Train the model
rf_regressor.fit(X_train, y_train)

Try It Now

Step 4: Make Predictions

# Predict on test data
y_pred = rf_regressor.predict(X_test)

Try It Now

Step 5: Evaluate Model Performance

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Try It Now

7. Understanding the Output

🔹 Accuracy Score → Percentage of correctly classified instances.
🔹 Classification Report → Precision, Recall, F1-score for each class.
🔹 Feature Importance → Shows which features contribute the most to predictions.
🔹 Mean Squared Error (Regression) → Measures how well the model predicts continuous values.

8. Advantages & Disadvantages of Random Forest

✅ Advantages

✔ More accurate than a single Decision Tree
✔ Reduces overfitting (by averaging multiple trees)
✔ Works well with missing data
✔ Handles large datasets efficiently

❌ Disadvantages

❌ Slower than Decision Trees (due to multiple trees)
❌ Harder to interpret (not as simple as a single tree)
❌ More computationally expensive

Summary

✔ Random Forest is an ensemble method combining multiple Decision Trees.
✔ Uses majority voting (classification) or averaging (regression) for predictions.
✔ Reduces overfitting and improves accuracy.
✔ RandomForestClassifier and RandomForestRegressor make implementation easy in Python.