Correlation and covariance are statistical measures that describe the relationship between two variables. They are widely used in exploratory data analysis (EDA) to understand dependencies between features in a dataset.
1. What is Covariance?
✔ Covariance measures how two variables vary together.
✔ If one variable increases when the other increases, the covariance is positive.
✔ If one variable increases while the other decreases, the covariance is negative.
Formula for Covariance:
Cov(X,Y) = \(\frac{\sum (X_i – \bar{X}) (Y_i – \bar{Y})}{n – 1}\)
Where:
- Xᵢ and Yᵢ are the data points.
- X̄ and Ȳ are the mean values.
- n is the number of observations.
Covariance in Python:
import numpy as np # Sample data X = [1, 2, 3, 4, 5] Y = [2, 4, 6, 8, 10] # Compute covariance matrix cov_matrix = np.cov(X, Y) print("Covariance Matrix:\n", cov_matrix) # Extract covariance value cov_value = cov_matrix[0, 1] print("Covariance between X and Y:", cov_value)
📌 Limitation: Covariance values depend on units, making interpretation difficult.
2. What is Correlation?
✔Correlation measures the strength and direction of the relationship between two variables.
✔ Unlike covariance, correlation is standardized, making it easier to interpret.
✔ The Pearson correlation coefficient (rr) is the most common measure.
Formula for Pearson Correlation:
\( r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \)
Where:
Cov(X,Y) = Covariance of X and Y
\( {\sigma_X \cdot \sigma_Y} \) = Standard deviation of X and Y
Interpreting Correlation Values:
Value of r | Interpretation |
---|---|
r = 1 | Perfect positive correlation |
0.7 ≤ r < 1 | Strong positive correlation |
0.3 ≤ r < 0.7 | Moderate positive correlation |
0 ≤ r < 0.3 | Weak positive correlation |
r = 0 | No correlation |
−0.3 ≤ r < 0 | Weak negative correlation |
−0.7 ≤ r < −0.3 | Moderate negative correlation |
−1 ≤ r < −0.7 | Strong negative correlation |
r = −1 | Perfect negative correlation |
Pearson Correlation in Python:
import scipy.stats as stats # Compute correlation corr_value, _ = stats.pearsonr(X, Y) print("Pearson Correlation:", corr_value)
📌 Advantage: Correlation is independent of the scale of the variables.
3. Correlation Matrix (Multiple Variables)
To analyze relationships among multiple variables, we use a correlation matrix.
Correlation Matrix in Python (Using Pandas & Seaborn)
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Sample dataset data = { "Age": [22, 25, 47, 52, 46, 56, 28, 30, 40, 50], "Salary": [25000, 27000, 60000, 80000, 75000, 90000, 32000, 35000, 60000, 72000], "Experience": [1, 2, 10, 15, 12, 20, 3, 4, 8, 12] } df = pd.DataFrame(data) # Compute correlation matrix corr_matrix = df.corr() # Display correlation matrix print(corr_matrix) # Heatmap visualization plt.figure(figsize=(6,4)) sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f") plt.title("Correlation Matrix") plt.show()
📌 Use case: Helps in feature selection by identifying highly correlated variables.
4. Key Differences Between Covariance and Correlation
Aspect | Covariance | Correlation |
---|---|---|
Definition | Measures the direction of relationship | Measures both direction and strength |
Scale | Dependent on units | Standardized (-1 to +1) |
Interpretation | Difficult to interpret due to unit dependency | Easier to interpret |
Formula | \( \text{Cov}(X, Y) = \frac{\sum (X_i – \bar{X})(Y_i – \bar{Y})}{n – 1} \) | \( r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \) |
Range | No fixed range | -1 to +1 |
5. Summary
✔ Covariance shows how two variables change together but is hard to interpret.
✔ Correlation standardizes relationships and is easier to interpret.
✔ A correlation matrix helps analyze multiple variables at once.
✔ Used in feature selection for machine learning models.