Data Science Logistic Regression

Logistic Regression is a Supervised Learning Algorithm used for classification problems. Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts binary (0 or 1) or multi-class outputs.

1. What is Logistic Regression?

πŸ”Ή Logistic Regression is used when the output is categorical (e.g., Spam or Not Spam, Pass or Fail).
πŸ”Ή It models the probability that Y belongs to a particular class.
πŸ”Ή Instead of a straight line, it produces an S-shaped curve (Sigmoid function).

Examples of Logistic Regression Applications:

βœ… Email Spam Detection (Spam or Not Spam)
βœ… Predicting Customer Churn (Will leave or not)
βœ… Disease Prediction (Diabetes: Yes or No)

2. Mathematical Formula of Logistic Regression

Unlike Linear Regression, Logistic Regression uses the Sigmoid function to keep outputs between 0 and 1:

\( P(Y=1 \mid X) = \frac{1}{1 + e^{-(mX + c)}} \)

Where:

  • \( P(Y=1 \mid X) \) β†’ Probability that \( 1 \)
  • \( m \) β†’ Slope (Coefficient)
  • \( c \) β†’ Intercept
  • \( e \) β†’ Euler’s number (~2.718)

If \( P(Y=1 \mid X) \geq 0.5 \) (Yes, Positive).
If \( P(Y=1 \mid X) \lt 0.5 \) (No, Negative).

3. Logistic Regression Implementation in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Try It Now

Step 2: Load Dataset

For this example, we will use a simple diabetes dataset where the goal is to predict whether a patient has diabetes (1 = Yes, 0 = No).

# Load dataset (replace with actual file if needed)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 
           'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)

# Display first 5 rows
print(df.head())

Try It Now

Step 3: Split Data into Features & Labels

# Features (X) - Independent variables
X = df.drop(columns=['Outcome'])  # Drop target variable

# Target (y) - Dependent variable
y = df['Outcome']

# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try It Now

Step 4: Train the Logistic Regression Model

# Create Logistic Regression model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train, y_train)

Try It Now

Step 5: Make Predictions

# Predict on test data
y_pred = model.predict(X_test)

Try It Now

Step 6: Evaluate Model Performance

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report (Precision, Recall, F1-score)
print("Classification Report:\n", classification_report(y_test, y_pred))

Try It Now

Step 7: Visualizing Decision Boundary (For Simple 2D Cases)

For simplicity, let’s assume we only use Glucose and BMI as features.

from mlxtend.plotting import plot_decision_regions

# Select only two features for visualization
X_vis = X_train[['Glucose', 'BMI']].values
y_vis = y_train.values

# Train model with selected features
model_vis = LogisticRegression()
model_vis.fit(X_vis, y_vis)

# Plot decision boundary
plot_decision_regions(X_vis, y_vis, clf=model_vis, legend=2)
plt.xlabel("Glucose Level")
plt.ylabel("BMI")
plt.title("Decision Boundary for Diabetes Prediction")
plt.show()

Try It Now

4. Understanding the Output

πŸ”Ή Accuracy β†’ Percentage of correct predictions.
πŸ”Ή Confusion Matrix β†’ Shows True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN).
πŸ”Ή Precision, Recall, F1-score β†’ Measure classification performance.

Summary

βœ” Logistic Regression is for classification problems (0 or 1, Yes or No).
βœ” Uses the Sigmoid function to predict probabilities.
βœ” sklearn makes implementation easy in Python.
βœ” Confusion Matrix & Accuracy help evaluate model performance.