Logistic Regression is a Supervised Learning Algorithm used for classification problems. Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts binary (0 or 1) or multi-class outputs.
1. What is Logistic Regression?
πΉ Logistic Regression is used when the output is categorical (e.g., Spam or Not Spam, Pass or Fail).
πΉ It models the probability that Y belongs to a particular class.
πΉ Instead of a straight line, it produces an S-shaped curve (Sigmoid function).
Examples of Logistic Regression Applications:
β
Email Spam Detection (Spam or Not Spam)
β
Predicting Customer Churn (Will leave or not)
β
Disease Prediction (Diabetes: Yes or No)
2. Mathematical Formula of Logistic Regression
Unlike Linear Regression, Logistic Regression uses the Sigmoid function to keep outputs between 0 and 1:
\( P(Y=1 \mid X) = \frac{1}{1 + e^{-(mX + c)}} \)
Where:
- \( P(Y=1 \mid X) \) β Probability that \( 1 \)
- \( m \) β Slope (Coefficient)
- \( c \) β Intercept
- \( e \) β Eulerβs number (~2.718)
If \( P(Y=1 \mid X) \geq 0.5 \) (Yes, Positive).
If \( P(Y=1 \mid X) \lt 0.5 \) (No, Negative).
3. Logistic Regression Implementation in Python
Step 1: Install Required Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Load Dataset
For this example, we will use a simple diabetes dataset where the goal is to predict whether a patient has diabetes (1 = Yes, 0 = No).
# Load dataset (replace with actual file if needed) url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'] df = pd.read_csv(url, names=columns) # Display first 5 rows print(df.head())
Step 3: Split Data into Features & Labels
# Features (X) - Independent variables X = df.drop(columns=['Outcome']) # Drop target variable # Target (y) - Dependent variable y = df['Outcome'] # Split data into 80% training and 20% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Logistic Regression Model
# Create Logistic Regression model model = LogisticRegression(max_iter=200) # Train the model model.fit(X_train, y_train)
Step 5: Make Predictions
# Predict on test data y_pred = model.predict(X_test)
Step 6: Evaluate Model Performance
# Calculate Accuracy accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy) # Confusion Matrix conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:\n", conf_matrix) # Classification Report (Precision, Recall, F1-score) print("Classification Report:\n", classification_report(y_test, y_pred))
Step 7: Visualizing Decision Boundary (For Simple 2D Cases)
For simplicity, let’s assume we only use Glucose and BMI as features.
from mlxtend.plotting import plot_decision_regions # Select only two features for visualization X_vis = X_train[['Glucose', 'BMI']].values y_vis = y_train.values # Train model with selected features model_vis = LogisticRegression() model_vis.fit(X_vis, y_vis) # Plot decision boundary plot_decision_regions(X_vis, y_vis, clf=model_vis, legend=2) plt.xlabel("Glucose Level") plt.ylabel("BMI") plt.title("Decision Boundary for Diabetes Prediction") plt.show()
4. Understanding the Output
πΉ Accuracy β Percentage of correct predictions.
πΉ Confusion Matrix β Shows True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN).
πΉ Precision, Recall, F1-score β Measure classification performance.
Summary
β Logistic Regression is for classification problems (0 or 1, Yes or No).
β Uses the Sigmoid function to predict probabilities.
β sklearn
makes implementation easy in Python.
β Confusion Matrix & Accuracy help evaluate model performance.