Data Science Logistic Regression

Logistic Regression is a Supervised Learning Algorithm used for classification problems. Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts binary (0 or 1) or multi-class outputs.

1. What is Logistic Regression?

πŸ”Ή Logistic Regression is used when the output is categorical (e.g., Spam or Not Spam, Pass or Fail).
πŸ”Ή It models the probability that Y belongs to a particular class.
πŸ”Ή Instead of a straight line, it produces an S-shaped curve (Sigmoid function).

Examples of Logistic Regression Applications:

βœ… Email Spam Detection (Spam or Not Spam)
βœ… Predicting Customer Churn (Will leave or not)
βœ… Disease Prediction (Diabetes: Yes or No)

2. Mathematical Formula of Logistic Regression

Unlike Linear Regression, Logistic Regression uses the Sigmoid function to keep outputs between 0 and 1:

\( P(Y=1 \mid X) = \frac{1}{1 + e^{-(mX + c)}} \)

Where:

  • \( P(Y=1 \mid X) \) β†’ Probability that \( 1 \)
  • \( m \) β†’ Slope (Coefficient)
  • \( c \) β†’ Intercept
  • \( e \) β†’ Euler’s number (~2.718)

If \( P(Y=1 \mid X) \geq 0.5 \) (Yes, Positive).
If \( P(Y=1 \mid X) \lt 0.5 \) (No, Negative).

3. Logistic Regression Implementation in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Load Dataset

For this example, we will use a simple diabetes dataset where the goal is to predict whether a patient has diabetes (1 = Yes, 0 = No).

# Load dataset (replace with actual file if needed)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 
           'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)

# Display first 5 rows
print(df.head())

Step 3: Split Data into Features & Labels

# Features (X) - Independent variables
X = df.drop(columns=['Outcome'])  # Drop target variable

# Target (y) - Dependent variable
y = df['Outcome']

# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the Logistic Regression Model

# Create Logistic Regression model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train, y_train)

Step 5: Make Predictions

# Predict on test data
y_pred = model.predict(X_test)

Step 6: Evaluate Model Performance

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report (Precision, Recall, F1-score)
print("Classification Report:\n", classification_report(y_test, y_pred))

Step 7: Visualizing Decision Boundary (For Simple 2D Cases)

For simplicity, let’s assume we only use Glucose and BMI as features.

from mlxtend.plotting import plot_decision_regions

# Select only two features for visualization
X_vis = X_train[['Glucose', 'BMI']].values
y_vis = y_train.values

# Train model with selected features
model_vis = LogisticRegression()
model_vis.fit(X_vis, y_vis)

# Plot decision boundary
plot_decision_regions(X_vis, y_vis, clf=model_vis, legend=2)
plt.xlabel("Glucose Level")
plt.ylabel("BMI")
plt.title("Decision Boundary for Diabetes Prediction")
plt.show()

4. Understanding the Output

πŸ”Ή Accuracy β†’ Percentage of correct predictions.
πŸ”Ή Confusion Matrix β†’ Shows True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN).
πŸ”Ή Precision, Recall, F1-score β†’ Measure classification performance.

Summary

βœ” Logistic Regression is for classification problems (0 or 1, Yes or No).
βœ” Uses the Sigmoid function to predict probabilities.
βœ” sklearn makes implementation easy in Python.
βœ” Confusion Matrix & Accuracy help evaluate model performance.