Descriptive Statistics in Data Science

Descriptive Statistics is the process of summarizing and analyzing data to understand its main characteristics. It helps in understanding the distribution, central tendency, and variability of data before applying any machine learning techniques.

1. Why Use Descriptive Statistics?

βœ” Provides a quick summary of the data.
βœ” Helps in detecting outliers and missing values.
βœ” Aids in data visualization and feature selection.
βœ” Essential for Exploratory Data Analysis (EDA).

2. Types of Descriptive Statistics

Category Description Examples
Measures of Central Tendency Describes the center of data Mean, Median, Mode
Measures of Dispersion Describes data spread Range, Variance, Standard Deviation
Measures of Shape Describes the distribution shape Skewness, Kurtosis

3. Measures of Central Tendency

3.1. Mean (Average)

The mean is the sum of all values divided by the number of values.

import numpy as np

data = [10, 20, 30, 40, 50]
mean_value = np.mean(data)
print(f"Mean: {mean_value}")

Try It Now

πŸ“Œ Use case: Useful when data is normally distributed but sensitive to outliers.

 

3.2. Median (Middle Value)

The median is the middle value when data is sorted.

median_value = np.median(data)
print(f"Median: {median_value}")

Try It Now

πŸ“Œ Use case: Better than the mean for skewed data or data with outliers.

 

3.3. Mode (Most Frequent Value)

The mode is the most frequently occurring value.

from scipy import stats

mode_value = stats.mode(data)
print(f"Mode: {mode_value.mode[0]}")

Try It Now

πŸ“Œ Use case: Useful for categorical data (e.g., most common customer complaint).

 

4. Measures of Dispersion (Spread of Data)

4.1. Range (Max – Min)

The range is the difference between the highest and lowest values.

range_value = np.max(data) - np.min(data)
print(f"Range: {range_value}")

Try It Now

πŸ“Œ Use case: Provides a quick idea of spread but is sensitive to outliers.

 

4.2. Variance (How Data Deviates from Mean)

Variance measures how data points spread around the mean.

variance_value = np.var(data)
print(f"Variance: {variance_value}")

Try It Now

πŸ“Œ Use case: Larger variance means more spread-out data.

 

4.3. Standard Deviation (√Variance)

The standard deviation (Οƒ) is the square root of variance.

std_dev = np.std(data)
print(f"Standard Deviation: {std_dev}")

Try It Now

πŸ“Œ Use case: Helps in understanding data consistency.

 

5. Measures of Shape

5.1. Skewness (Symmetry of Data)

  • Positive skew β†’ More values are on the left.
  • Negative skew β†’ More values are on the right.
  • Zero skew β†’ Symmetric distribution.

 

skew_value = stats.skew(data)
print(f"Skewness: {skew_value}")

Try It Now

5.2. Kurtosis (Peakedness of Distribution)

  • High kurtosis β†’ More extreme values (outliers).
  • Low kurtosis β†’ Flatter distribution.
kurtosis_value = stats.kurtosis(data)
print(f"Kurtosis: {kurtosis_value}")

Try It Now

6. Descriptive Statistics Using Pandas

For large datasets, Pandas provides an easy way to summarize data.

import pandas as pd

# Create sample data
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})

# Summary statistics
print(df.describe())

Try It Now

πŸ“Œ Output includes: Mean, Standard Deviation, Min, Max, and Quartiles (Q1, Q3).

 

7. Summary Table

Metric Definition Python Function
Mean Average of values np.mean()
Median Middle value np.median()
Mode Most common value stats.mode()
Range Max – Min np.max() - np.min()
Variance Spread of values np.var()
Standard Deviation Dispersion from the mean np.std()
Skewness Symmetry of distribution stats.skew()
Kurtosis Peakness of distribution stats.kurtosis()

Conclusion

βœ” Descriptive statistics provide important insights before applying machine learning.
βœ” Helps in detecting outliers, skewness, and spread of data.
βœ” Use Pandas, NumPy, and SciPy to quickly analyze datasets.