Correlation and covariance are statistical measures that describe the relationship between two variables. They are widely used in exploratory data analysis (EDA) to understand dependencies between features in a dataset.
1. What is Covariance?
✔ Covariance measures how two variables vary together.
✔ If one variable increases when the other increases, the covariance is positive.
✔ If one variable increases while the other decreases, the covariance is negative.
Formula for Covariance:
Cov(X,Y) = \(\frac{\sum (X_i – \bar{X}) (Y_i – \bar{Y})}{n – 1}\)
Where:
- Xᵢ and Yᵢ are the data points.
- X̄ and Ȳ are the mean values.
- n is the number of observations.
Covariance in Python:
import numpy as np
# Sample data
X = [1, 2, 3, 4, 5]
Y = [2, 4, 6, 8, 10]
# Compute covariance matrix
cov_matrix = np.cov(X, Y)
print("Covariance Matrix:\n", cov_matrix)
# Extract covariance value
cov_value = cov_matrix[0, 1]
print("Covariance between X and Y:", cov_value)
📌 Limitation: Covariance values depend on units, making interpretation difficult.
2. What is Correlation?
✔Correlation measures the strength and direction of the relationship between two variables.
✔ Unlike covariance, correlation is standardized, making it easier to interpret.
✔ The Pearson correlation coefficient (rr) is the most common measure.
Formula for Pearson Correlation:
\( r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \)
Where:
Cov(X,Y) = Covariance of X and Y
\( {\sigma_X \cdot \sigma_Y} \) = Standard deviation of X and Y
Interpreting Correlation Values:
| Value of r | Interpretation |
|---|---|
| r = 1 | Perfect positive correlation |
| 0.7 ≤ r < 1 | Strong positive correlation |
| 0.3 ≤ r < 0.7 | Moderate positive correlation |
| 0 ≤ r < 0.3 | Weak positive correlation |
| r = 0 | No correlation |
| −0.3 ≤ r < 0 | Weak negative correlation |
| −0.7 ≤ r < −0.3 | Moderate negative correlation |
| −1 ≤ r < −0.7 | Strong negative correlation |
| r = −1 | Perfect negative correlation |
Pearson Correlation in Python:
import scipy.stats as stats
# Compute correlation
corr_value, _ = stats.pearsonr(X, Y)
print("Pearson Correlation:", corr_value)
📌 Advantage: Correlation is independent of the scale of the variables.
3. Correlation Matrix (Multiple Variables)
To analyze relationships among multiple variables, we use a correlation matrix.
Correlation Matrix in Python (Using Pandas & Seaborn)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sample dataset
data = {
"Age": [22, 25, 47, 52, 46, 56, 28, 30, 40, 50],
"Salary": [25000, 27000, 60000, 80000, 75000, 90000, 32000, 35000, 60000, 72000],
"Experience": [1, 2, 10, 15, 12, 20, 3, 4, 8, 12]
}
df = pd.DataFrame(data)
# Compute correlation matrix
corr_matrix = df.corr()
# Display correlation matrix
print(corr_matrix)
# Heatmap visualization
plt.figure(figsize=(6,4))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()
📌 Use case: Helps in feature selection by identifying highly correlated variables.
4. Key Differences Between Covariance and Correlation
| Aspect | Covariance | Correlation |
|---|---|---|
| Definition | Measures the direction of relationship | Measures both direction and strength |
| Scale | Dependent on units | Standardized (-1 to +1) |
| Interpretation | Difficult to interpret due to unit dependency | Easier to interpret |
| Formula | \( \text{Cov}(X, Y) = \frac{\sum (X_i – \bar{X})(Y_i – \bar{Y})}{n – 1} \) | \( r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \) |
| Range | No fixed range | -1 to +1 |
5. Summary
✔ Covariance shows how two variables change together but is hard to interpret.
✔ Correlation standardizes relationships and is easier to interpret.
✔ A correlation matrix helps analyze multiple variables at once.
✔ Used in feature selection for machine learning models.