Correlation & Covariance in Data Science

Correlation and covariance are statistical measures that describe the relationship between two variables. They are widely used in exploratory data analysis (EDA) to understand dependencies between features in a dataset.

1. What is Covariance?

✔ Covariance measures how two variables vary together.
✔ If one variable increases when the other increases, the covariance is positive.
✔ If one variable increases while the other decreases, the covariance is negative.

Formula for Covariance:

Cov(X,Y) = $\frac{\sum (X_i – \bar{X}) (Y_i – \bar{Y})}{n – 1}$

Where:

Xᵢ and Yᵢ are the data points.
X̄ and Ȳ are the mean values.
n is the number of observations.

Covariance in Python:

import numpy as np

# Sample data
X = [1, 2, 3, 4, 5]
Y = [2, 4, 6, 8, 10]

# Compute covariance matrix
cov_matrix = np.cov(X, Y)
print("Covariance Matrix:\n", cov_matrix)

# Extract covariance value
cov_value = cov_matrix[0, 1]
print("Covariance between X and Y:", cov_value)

📌 Limitation: Covariance values depend on units, making interpretation difficult.

2. What is Correlation?

✔Correlation measures the strength and direction of the relationship between two variables.
✔ Unlike covariance, correlation is standardized, making it easier to interpret.
✔ The Pearson correlation coefficient ( $r$ ) is the most common measure.

Formula for Pearson Correlation:

$ r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} $

Where:

Cov(X,Y) = Covariance of X and Y

$ {\sigma_X \cdot \sigma_Y} $ = Standard deviation of X and Y

Interpreting Correlation Values:

Value of r	Interpretation
r = 1	Perfect positive correlation
0.7 ≤ r < 1	Strong positive correlation
0.3 ≤ r < 0.7	Moderate positive correlation
0 ≤ r < 0.3	Weak positive correlation
r = 0	No correlation
−0.3 ≤ r < 0	Weak negative correlation
−0.7 ≤ r < −0.3	Moderate negative correlation
−1 ≤ r < −0.7	Strong negative correlation
r = −1	Perfect negative correlation

Pearson Correlation in Python:

import scipy.stats as stats

# Compute correlation
corr_value, _ = stats.pearsonr(X, Y)
print("Pearson Correlation:", corr_value)

📌 Advantage: Correlation is independent of the scale of the variables.

3. Correlation Matrix (Multiple Variables)

To analyze relationships among multiple variables, we use a correlation matrix.

Correlation Matrix in Python (Using Pandas & Seaborn)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample dataset
data = {
    "Age": [22, 25, 47, 52, 46, 56, 28, 30, 40, 50],
    "Salary": [25000, 27000, 60000, 80000, 75000, 90000, 32000, 35000, 60000, 72000],
    "Experience": [1, 2, 10, 15, 12, 20, 3, 4, 8, 12]
}

df = pd.DataFrame(data)

# Compute correlation matrix
corr_matrix = df.corr()

# Display correlation matrix
print(corr_matrix)

# Heatmap visualization
plt.figure(figsize=(6,4))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

📌 Use case: Helps in feature selection by identifying highly correlated variables.

4. Key Differences Between Covariance and Correlation

Aspect	Covariance	Correlation
Definition	Measures the direction of relationship	Measures both direction and strength
Scale	Dependent on units	Standardized (-1 to +1)
Interpretation	Difficult to interpret due to unit dependency	Easier to interpret
Formula	$ \text{Cov}(X, Y) = \frac{\sum (X_i – \bar{X})(Y_i – \bar{Y})}{n – 1} $	$ r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} $
Range	No fixed range	-1 to +1

5. Summary

✔ Covariance shows how two variables change together but is hard to interpret.
✔ Correlation standardizes relationships and is easier to interpret.
✔ A correlation matrix helps analyze multiple variables at once.
✔ Used in feature selection for machine learning models.