Logistic Regression is a Supervised Learning Algorithm used for classification problems. Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts binary (0 or 1) or multi-class outputs.
1. What is Logistic Regression?
🔹 Logistic Regression is used when the output is categorical (e.g., Spam or Not Spam, Pass or Fail).
🔹 It models the probability that Y belongs to a particular class.
🔹 Instead of a straight line, it produces an S-shaped curve (Sigmoid function).
Examples of Logistic Regression Applications:
✅ Email Spam Detection (Spam or Not Spam)
✅ Predicting Customer Churn (Will leave or not)
✅ Disease Prediction (Diabetes: Yes or No)
2. Mathematical Formula of Logistic Regression
Unlike Linear Regression, Logistic Regression uses the Sigmoid function to keep outputs between 0 and 1:
\( P(Y=1 \mid X) = \frac{1}{1 + e^{-(mX + c)}} \)
Where:
- \( P(Y=1 \mid X) \) → Probability that \( 1 \)
- \( m \) → Slope (Coefficient)
- \( c \) → Intercept
- \( e \) → Euler’s number (~2.718)
If \( P(Y=1 \mid X) \geq 0.5 \) (Yes, Positive).
If \( P(Y=1 \mid X) \lt 0.5 \) (No, Negative).
3. Logistic Regression Implementation in Python
Step 1: Install Required Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Load Dataset
For this example, we will use a simple diabetes dataset where the goal is to predict whether a patient has diabetes (1 = Yes, 0 = No).
# Load dataset (replace with actual file if needed)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)
# Display first 5 rows
print(df.head())
Step 3: Split Data into Features & Labels
# Features (X) - Independent variables X = df.drop(columns=['Outcome']) # Drop target variable # Target (y) - Dependent variable y = df['Outcome'] # Split data into 80% training and 20% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Logistic Regression Model
# Create Logistic Regression model model = LogisticRegression(max_iter=200) # Train the model model.fit(X_train, y_train)
Step 5: Make Predictions
# Predict on test data y_pred = model.predict(X_test)
Step 6: Evaluate Model Performance
# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Classification Report (Precision, Recall, F1-score)
print("Classification Report:\n", classification_report(y_test, y_pred))
Step 7: Visualizing Decision Boundary (For Simple 2D Cases)
For simplicity, let’s assume we only use Glucose and BMI as features.
from mlxtend.plotting import plot_decision_regions
# Select only two features for visualization
X_vis = X_train[['Glucose', 'BMI']].values
y_vis = y_train.values
# Train model with selected features
model_vis = LogisticRegression()
model_vis.fit(X_vis, y_vis)
# Plot decision boundary
plot_decision_regions(X_vis, y_vis, clf=model_vis, legend=2)
plt.xlabel("Glucose Level")
plt.ylabel("BMI")
plt.title("Decision Boundary for Diabetes Prediction")
plt.show()
4. Understanding the Output
🔹 Accuracy → Percentage of correct predictions.
🔹 Confusion Matrix → Shows True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN).
🔹 Precision, Recall, F1-score → Measure classification performance.
Summary
✔ Logistic Regression is for classification problems (0 or 1, Yes or No).
✔ Uses the Sigmoid function to predict probabilities.
✔ sklearn makes implementation easy in Python.
✔ Confusion Matrix & Accuracy help evaluate model performance.