Data Science Confidence Intervals

Confidence intervals (CIs) are a key concept in data science, providing a range of values within which we expect a population parameter (like a mean or proportion) to lie, given a certain level of confidence. Here’s an overview

1. What Is a Confidence Interval?

Definition:
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter. The interval is computed from sample data and is associated with a confidence level (e.g., 95%).
Confidence Level:
This is the probability (expressed as a percentage) that the confidence interval contains the true population parameter if you were to repeat the experiment many times. For example, a 95% confidence level means that if you repeated the study 100 times, about 95 of those confidence intervals would capture the true parameter.

2. Constructing a Confidence Interval

The basic formula for a confidence interval for a population mean is:

CI = $ \bar{x} \pm z^* \times \left( \frac{\sigma}{\sqrt{n}} \right) $

$ \bar{x} $ is the sample mean.
$z^*$ (or $t^*$ for small sample sizes) is the critical value from the standard normal (or t-) distribution corresponding to the desired confidence level.
$ \sigma $ is the population standard deviation (or sample standard deviation when is unknown).
$n$ is the sample size.
The term $ \frac{\sigma}{\sqrt{n}} $ is known as the standard error.

For smaller sample sizes or when the population standard deviation is unknown, we use the t-distribution:

CI = $ \bar{x} \pm t^* \times \left( \frac{\sigma}{\sqrt{n}} \right) $

Where $ t^* $ is the critical value from the t-distribution with $ n $ – 1 degrees of freedom.

3. Steps to Compute a Confidence Interval

Collect the Data:
Gather your sample and compute the necessary statistics (mean, standard deviation, and sample size).
Determine the Confidence Level:
Choose your confidence level (commonly 90%, 95%, or 99%).
Select the Appropriate Distribution:
- Use the z-distribution if the population standard deviation is known or if $ n $ is large (usually $ n \gt 30 $ ).
- Use the t-distribution if the population standard deviation is unknown and $ n $ is small.
Calculate the Standard Error:
Compute $ \frac{\sigma}{\sqrt{n}} $ or $ \frac{s}{\sqrt{n}} $.
Find the Critical Value:
- For a 95% confidence level, the z-critical value is approximately 1.96.
- For the t-distribution, look up the value in a t-table based on your chosen confidence level and degrees of freedom ($ n – 1 $).
Compute the Margin of Error:
Multiply the critical value by the standard error.
Construct the Interval:
Add and subtract the margin of error from the sample mean.

4. Example

Imagine you have a sample of 40 observations with:

Sample mean $ \bar{x} = 100 $
Sample standard deviation $ {s} = 15 $
Desired confidence level: 95%

Step-by-Step:

Standard Error: $\frac{s}{\sqrt{n}} = \frac{15}{\sqrt{40}} \approx \frac{15}{6.32} \approx 2.37$
Critical Value:
For a 95% confidence level with 39 degrees of freedom (using the t-distribution), assume $t^* \approx 2.02$ (this value can vary slightly depending on the table).
Margin of Error: $\text{Margin of Error} = 2.02 \times 2.37 \approx 4.79$
Confidence Interval: $100 \pm 4.79 \text{ or } [95.21, 104.79]$

So, you can be 95% confident that the true population mean lies between 95.21 and 104.79.

5. Why Confidence Intervals Matter in Data Science

Uncertainty Quantification:
They provide a measure of uncertainty around estimates, giving more context than a simple point estimate.
Decision Making:
By understanding the range within which a parameter lies, data scientists can make more informed decisions and recommendations.
Comparisons:
Confidence intervals can be used to compare groups or treatments. If intervals do not overlap significantly, this may suggest a statistically significant difference.