Data Science Logistic Regression

Logistic Regression is a Supervised Learning Algorithm used for classification problems. Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts binary (0 or 1) or multi-class outputs.

1. What is Logistic Regression?

🔹 Logistic Regression is used when the output is categorical (e.g., Spam or Not Spam, Pass or Fail).
🔹 It models the probability that Y belongs to a particular class.
🔹 Instead of a straight line, it produces an S-shaped curve (Sigmoid function).

Examples of Logistic Regression Applications:

✅ Email Spam Detection (Spam or Not Spam)
✅ Predicting Customer Churn (Will leave or not)
✅ Disease Prediction (Diabetes: Yes or No)

2. Mathematical Formula of Logistic Regression

Unlike Linear Regression, Logistic Regression uses the Sigmoid function to keep outputs between 0 and 1:

\( P(Y=1 \mid X) = \frac{1}{1 + e^{-(mX + c)}} \)

Where:

\( P(Y=1 \mid X) \) → Probability that \( 1 \)
\( m \) → Slope (Coefficient)
\( c \) → Intercept
\( e \) → Euler’s number (~2.718)

If \( P(Y=1 \mid X) \geq 0.5 \) (Yes, Positive).
If \( P(Y=1 \mid X) \lt 0.5 \) (No, Negative).

3. Logistic Regression Implementation in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Load Dataset

For this example, we will use a simple diabetes dataset where the goal is to predict whether a patient has diabetes (1 = Yes, 0 = No).

# Load dataset (replace with actual file if needed)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 
           'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)

# Display first 5 rows
print(df.head())

Step 3: Split Data into Features & Labels

# Features (X) - Independent variables
X = df.drop(columns=['Outcome'])  # Drop target variable

# Target (y) - Dependent variable
y = df['Outcome']

# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the Logistic Regression Model

# Create Logistic Regression model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train, y_train)

Step 5: Make Predictions

# Predict on test data
y_pred = model.predict(X_test)

Step 6: Evaluate Model Performance

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report (Precision, Recall, F1-score)
print("Classification Report:\n", classification_report(y_test, y_pred))

Step 7: Visualizing Decision Boundary (For Simple 2D Cases)

For simplicity, let’s assume we only use Glucose and BMI as features.

from mlxtend.plotting import plot_decision_regions

# Select only two features for visualization
X_vis = X_train[['Glucose', 'BMI']].values
y_vis = y_train.values

# Train model with selected features
model_vis = LogisticRegression()
model_vis.fit(X_vis, y_vis)

# Plot decision boundary
plot_decision_regions(X_vis, y_vis, clf=model_vis, legend=2)
plt.xlabel("Glucose Level")
plt.ylabel("BMI")
plt.title("Decision Boundary for Diabetes Prediction")
plt.show()

4. Understanding the Output

🔹 Accuracy → Percentage of correct predictions.
🔹 Confusion Matrix → Shows True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN).
🔹 Precision, Recall, F1-score → Measure classification performance.

Summary

✔ Logistic Regression is for classification problems (0 or 1, Yes or No).
✔ Uses the Sigmoid function to predict probabilities.
✔ sklearn makes implementation easy in Python.
✔ Confusion Matrix & Accuracy help evaluate model performance.