Descriptive Statistics is the process of summarizing and analyzing data to understand its main characteristics. It helps in understanding the distribution, central tendency, and variability of data before applying any machine learning techniques.
1. Why Use Descriptive Statistics?
β Provides a quick summary of the data.
β Helps in detecting outliers and missing values.
β Aids in data visualization and feature selection.
β Essential for Exploratory Data Analysis (EDA).
2. Types of Descriptive Statistics
Category | Description | Examples |
---|---|---|
Measures of Central Tendency | Describes the center of data | Mean, Median, Mode |
Measures of Dispersion | Describes data spread | Range, Variance, Standard Deviation |
Measures of Shape | Describes the distribution shape | Skewness, Kurtosis |
3. Measures of Central Tendency
3.1. Mean (Average)
The mean is the sum of all values divided by the number of values.
import numpy as np data = [10, 20, 30, 40, 50] mean_value = np.mean(data) print(f"Mean: {mean_value}")
π Use case: Useful when data is normally distributed but sensitive to outliers.
3.2. Median (Middle Value)
The median is the middle value when data is sorted.
median_value = np.median(data) print(f"Median: {median_value}")
π Use case: Better than the mean for skewed data or data with outliers.
3.3. Mode (Most Frequent Value)
The mode is the most frequently occurring value.
from scipy import stats mode_value = stats.mode(data) print(f"Mode: {mode_value.mode[0]}")
π Use case: Useful for categorical data (e.g., most common customer complaint).
4. Measures of Dispersion (Spread of Data)
4.1. Range (Max – Min)
The range is the difference between the highest and lowest values.
range_value = np.max(data) - np.min(data) print(f"Range: {range_value}")
π Use case: Provides a quick idea of spread but is sensitive to outliers.
4.2. Variance (How Data Deviates from Mean)
Variance measures how data points spread around the mean.
variance_value = np.var(data) print(f"Variance: {variance_value}")
π Use case: Larger variance means more spread-out data.
4.3. Standard Deviation (βVariance)
The standard deviation (Ο) is the square root of variance.
std_dev = np.std(data) print(f"Standard Deviation: {std_dev}")
π Use case: Helps in understanding data consistency.
5. Measures of Shape
5.1. Skewness (Symmetry of Data)
- Positive skew β More values are on the left.
- Negative skew β More values are on the right.
- Zero skew β Symmetric distribution.
skew_value = stats.skew(data) print(f"Skewness: {skew_value}")
5.2. Kurtosis (Peakedness of Distribution)
- High kurtosis β More extreme values (outliers).
- Low kurtosis β Flatter distribution.
kurtosis_value = stats.kurtosis(data) print(f"Kurtosis: {kurtosis_value}")
6. Descriptive Statistics Using Pandas
For large datasets, Pandas provides an easy way to summarize data.
import pandas as pd # Create sample data df = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}) # Summary statistics print(df.describe())
π Output includes: Mean, Standard Deviation, Min, Max, and Quartiles (Q1, Q3).
7. Summary Table
Metric | Definition | Python Function |
---|---|---|
Mean | Average of values | np.mean() |
Median | Middle value | np.median() |
Mode | Most common value | stats.mode() |
Range | Max – Min | np.max() - np.min() |
Variance | Spread of values | np.var() |
Standard Deviation | Dispersion from the mean | np.std() |
Skewness | Symmetry of distribution | stats.skew() |
Kurtosis | Peakness of distribution | stats.kurtosis() |
Conclusion
β Descriptive statistics provide important insights before applying machine learning.
β Helps in detecting outliers, skewness, and spread of data.
β Use Pandas, NumPy, and SciPy to quickly analyze datasets.