Data Science Linear Regression

Linear Regression is one of the most fundamental and widely used algorithms in Machine Learning and Data Science. It helps to model relationships between variables and make predictions.

In this tutorial, we will cover:
✅ What is Linear Regression?
✅ Understanding the Mathematical Formula
✅ Python Implementation (Using sklearn)
✅ Evaluating Model Performance

1. What is Linear Regression?

Linear Regression is a Supervised Learning Algorithm used for predicting continuous values. It assumes a linear relationship between the input (X) and output (Y).

Examples of Linear Regression Applications:

Predicting house prices based on size.
Forecasting sales revenue based on past data.
Estimating salary based on years of experience.

2. Mathematical Formula of Linear Regression

The equation of a straight line: Y = mX + c

Where:

\( Y \) → Dependent variable (Prediction)
\( X \) → Independent variable (Feature)
\( m \) → Slope (Coefficient)
\( c \) → Intercept (Constant)

Cost Function (Mean Squared Error – MSE)

We minimize the error between predicted and actual values using:

\( \text{MSE} = \frac{1}{n} \sum (Y_{\text{actual}} – Y_{\text{predicted}})^2 \)

\( \text{MSE} \): Mean Squared Error
\( n \): Number of observations
\( Y_{\text{actual}} \): Actual observed values
\( Y_{\text{predicted}} \) : Predicted values from the model
\( ∑ \) : Summation over all observations

3. Linear Regression Implementation in Python

Step 1: Install Required Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Step 2: Create a Sample Dataset

# Sample dataset: House sizes (sq ft) and corresponding prices ($1000s)
X = np.array([500, 800, 1000, 1500, 1800, 2000, 2500, 3000, 3500, 4000]).reshape(-1, 1)
y = np.array([150, 200, 250, 300, 350, 400, 450, 500, 550, 600])

Step 3: Split Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

80% data for training
20% data for testing

Step 4: Train the Linear Regression Model

# Create a Linear Regression model
model = LinearRegression()

# Train the model using training data
model.fit(X_train, y_train)

Step 5: Model Predictions

y_pred = model.predict(X_test)

y_pred contains the predicted house prices.

Step 6: Evaluate Model Performance

# Print model parameters
print("Slope (m):", model.coef_[0])
print("Intercept (c):", model.intercept_)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Calculate R-Squared Score
r2 = r2_score(y_test, y_pred)
print("R-Squared Score:", r2)

R² Score close to 1 means a good fit.

Step 7: Visualizing the Regression Line

plt.scatter(X, y, color="blue", label="Actual Prices")
plt.plot(X, model.predict(X), color="red", label="Regression Line")
plt.xlabel("House Size (sq ft)")
plt.ylabel("Price ($1000s)")
plt.title("House Price Prediction using Linear Regression")
plt.legend()
plt.show()

4. Understanding the Output

Regression Line: Shows the best-fit line for the data.
MSE (Mean Squared Error): Measures how well the model predicts (lower is better).
R² Score: Measures how much variance is explained by the model (higher is better).

Summary

✔ Linear Regression predicts continuous values using a straight-line equation.
✔ Training involves minimizing the error between actual and predicted values.
✔ Python’s sklearn library makes implementation easy.
✔ Visualizing results helps understand the model’s accuracy.