Descriptive Statistics in Data Science

Descriptive Statistics is the process of summarizing and analyzing data to understand its main characteristics. It helps in understanding the distribution, central tendency, and variability of data before applying any machine learning techniques.

1. Why Use Descriptive Statistics?

✔ Provides a quick summary of the data.
✔ Helps in detecting outliers and missing values.
✔ Aids in data visualization and feature selection.
✔ Essential for Exploratory Data Analysis (EDA).

2. Types of Descriptive Statistics

Category	Description	Examples
Measures of Central Tendency	Describes the center of data	Mean, Median, Mode
Measures of Dispersion	Describes data spread	Range, Variance, Standard Deviation
Measures of Shape	Describes the distribution shape	Skewness, Kurtosis

3. Measures of Central Tendency

3.1. Mean (Average)

The mean is the sum of all values divided by the number of values.

import numpy as np

data = [10, 20, 30, 40, 50]
mean_value = np.mean(data)
print(f"Mean: {mean_value}")

📌 Use case: Useful when data is normally distributed but sensitive to outliers.

3.2. Median (Middle Value)

The median is the middle value when data is sorted.

median_value = np.median(data)
print(f"Median: {median_value}")

📌 Use case: Better than the mean for skewed data or data with outliers.

3.3. Mode (Most Frequent Value)

The mode is the most frequently occurring value.

from scipy import stats

mode_value = stats.mode(data)
print(f"Mode: {mode_value.mode[0]}")

📌 Use case: Useful for categorical data (e.g., most common customer complaint).

4. Measures of Dispersion (Spread of Data)

4.1. Range (Max – Min)

The range is the difference between the highest and lowest values.

range_value = np.max(data) - np.min(data)
print(f"Range: {range_value}")

📌 Use case: Provides a quick idea of spread but is sensitive to outliers.

4.2. Variance (How Data Deviates from Mean)

Variance measures how data points spread around the mean.

variance_value = np.var(data)
print(f"Variance: {variance_value}")

📌 Use case: Larger variance means more spread-out data.

4.3. Standard Deviation (√Variance)

The standard deviation (σ) is the square root of variance.

std_dev = np.std(data)
print(f"Standard Deviation: {std_dev}")

📌 Use case: Helps in understanding data consistency.

5. Measures of Shape

5.1. Skewness (Symmetry of Data)

Positive skew → More values are on the left.
Negative skew → More values are on the right.
Zero skew → Symmetric distribution.

skew_value = stats.skew(data)
print(f"Skewness: {skew_value}")

5.2. Kurtosis (Peakedness of Distribution)

High kurtosis → More extreme values (outliers).
Low kurtosis → Flatter distribution.

kurtosis_value = stats.kurtosis(data)
print(f"Kurtosis: {kurtosis_value}")

6. Descriptive Statistics Using Pandas

For large datasets, Pandas provides an easy way to summarize data.

import pandas as pd

# Create sample data
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})

# Summary statistics
print(df.describe())

📌 Output includes: Mean, Standard Deviation, Min, Max, and Quartiles (Q1, Q3).

7. Summary Table

Metric	Definition	Python Function
Mean	Average of values	`np.mean()`
Median	Middle value	`np.median()`
Mode	Most common value	`stats.mode()`
Range	Max – Min	`np.max() - np.min()`
Variance	Spread of values	`np.var()`
Standard Deviation	Dispersion from the mean	`np.std()`
Skewness	Symmetry of distribution	`stats.skew()`
Kurtosis	Peakness of distribution	`stats.kurtosis()`

Conclusion

✔ Descriptive statistics provide important insights before applying machine learning.
✔ Helps in detecting outliers, skewness, and spread of data.
✔ Use Pandas, NumPy, and SciPy to quickly analyze datasets.