Inference for a single mean

STA35B: Statistical Data Science 2

Akira Horiguchi

Based on Ch 19 of IMS

library(tidyverse)
library(openintro)
library(infer)

library(knitr)
library(ggpubr)
library(kableExtra)
library(gghighlight)

library(scales) # label_dollar

options(pillar.print_min = 9)  # to avoid annoying scroll behavior
knitr::opts_chunk$set(out.height = "100%")
theme_set(theme_bw() + theme(axis.text = element_text(size = 14), 
                             axis.title = element_text(size = 16), 
                             ))

Mean…

We want to know the average price of cars on a lot by taking a random sample

Our sample has five cars with prices 18,300; 20,100; 9,600; 10,700; 27,000
- Sample mean: 17,140

How much can we expect the sample mean to change from sample to sample?

…vs proportion

Previously, we used the Central Limit Theorem to show that the variability of a sample proportion is well-described by normal distribution.

For proportions, the single parameter $p\in [0,1]$ determines the mean $p$ and the variance $p(1-p)$ (and thus s.d. $\sqrt{p(1-p)}$).
- Thus the only source of uncertainty is in measuring $p$.

This slide deck will deal more generally with numerical variables.

For numerical variables, there are two sources of uncertainty:
- The average/mean $\mu$ — a “typical” value.
- The standard deviation $\sigma$ — typical variability in the parameter.
Thus we may be interested in estimating both.
For both, we will employ bootstrap CI and mathematical models.

Bootstrap CI for a mean: ideal

If we can sample more from the population, we can repeatedly:

sample five cars from the population, then
compute the sample mean of the five cars.

This gives a distribution of the sample mean of random samples of size 5, which gives us the sample mean’s sample-to-sample variability.

Bootstrap CI for a mean: in practice

If we cannot sample more from the population, we can estimate this sample-to-sample variability by bootstrapping: take $B$ bootstrap samples of size $n$ from the original sample of size $n$.

Each bootstrap sample has its own sample mean.
There is variability across these sample means.
This “bootstrap” variability can be an estimate for the variability of the sample mean induced by sampling repeatedly from the population.

Figure 1: Cartoon illustrating using the bootstrap procedure to estimate the distribution of the sample mean of random samples of size 5.

Suppose we took 1,000 bootstrap samples, and for each bootstrap sample we compute its sample mean.

Figure 2: Histogram of 1,000 bootstrapped means from the original sample of 5 car prices.

To develop bootstrap CI for the population mean at e.g. 90% confidence level, we can calculate the 5% and 95% percentile of the bootstrapped statistics.

Let’s build a function which does the following:

takes in a sample (a vector of numerics), and then computes a desired number of bootstrap samples and returns the means for each bootstrap sample.

boot_samp_means <- function(obs, n_resamp=10000) {
  means <- numeric(n_resamp)
  num_obs <- length(obs)
  for(i in 1:n_resamp) {
    resampled <- sample(obs, size=num_obs, replace=TRUE)
    means[i] <- mean(resampled, na.rm=TRUE)
  }
  return(means)
}

boot_samp_means <- function(obs, n_resamp=10000) {
  num_obs <- length(obs)
  replicate(
      n_resamp, 
      sample(obs, size=num_obs, replace=TRUE) |> mean(na.rm=TRUE)
      )
}

For the car example, here’s how we could do this:

obs <- c(18300, 20100, 9600, 10700, 27000)
mean(obs)

[1] 17140

bs_means <- boot_samp_means(obs, 10000)
str(bs_means)

 num [1:10000] 13300 17360 16560 20620 22360 ...

To construct a 90% CI, find 5th percentile and 95th percentile.

(qs <- quantile(bs_means, probs=c(0.05, 0.95)))

   5%   95% 
12140 22140

We are 90% confident that the true average is between $12140 and $22140.

Bootstrap CI for a standard deviation

We have so far focused on trying to understand the mean of a population.

Often interested in the standard deviation $\sigma$ (Greek “sigma”) of the variable.
This represents the natural variability of the population mean $\mu$.
Variance $\sigma^2$ is just the squared value of $\sigma$.
We can construct a point estimate for the population variance $\sigma^2$ via

\[ s^2 \;=\; \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2. \]

We can construct a confidence interval for $\sigma^2$ using bootstrapping. Same idea:

Take bootstrap re-samples of dataset.
Compute sample variance $s^2$ for each bootstrap sample.
Create histogram of sample variances across each bootstrap sample.
Use the percentiles of the histogram to get confidence intervals.

Confidence interval for population standard deviation $\sigma$:

Create histogram of sample standard deviation $s = \sqrt{s^2}$.

Bootstrap CI for a standard deviation: car

Results of bootstrap standard deviations:

Figure 3: Histogram of 1,000 bootstrapped standard deviations from the original sample of 5 car prices.

Very high variability. This is due to the tiny sample size - only 5 observations.
As we increase the original dataset sample size, the bootstrap improves.
A precise characterization of how sample size / number bootstrap trials affect the accuracy of the bootstrap is beyond this course. (It is still the subject of current research.)

Mathematical model for a mean

Central Limit Theorem for the sample proportion:

If [conditions], then the sampling distribution of $\hat p$ is \[\approx N\Big(p,\; p(1-p)\Big).\]
Only one source of uncertainty: $p$.

Central Limit Theorem for the sample mean:

If sufficiently large number of $n$ independent samples $x_1, x_2, \ldots, x_n$ from a population with mean $\mu$ and standard deviation $\sigma$, then the sampling distribution of $\bar x_n$ is \[\approx N\Big(\mu,\; \frac{\sigma^2}{n}\Big).\]
Two sources of uncertainty: $\mu$ and $\sigma$.

Generally do not know population-level $\sigma$, so we have to estimate it.

Having two parameters to estimate can increase uncertainty in inference for sample mean $\bar x_n$ (compared to for sample proportion $\hat p$).
Hence, when you have to estimate $\sigma$, the sampling distribution of $\bar x_n$ is not approximately normal, but is what’s called a “t distribution”, which accounts for the additional uncertainty.

$t$ distribution

The $t$ distribution is defined in terms of degrees of freedom df.
Has a similar bell-shaped curve to normal, but has “thicker tails”, which allows for more extreme events to occur than in a normal distribution.

Data from a normal distribution has very little data beyond 2.5;
Data from a $t$ distribution has relatively more, particularly when df is small.

Effect of df on $t$ distribution:

Given data with $n$ samples, we will typically use $t$ distribution with $n-1$ df.
- Few samples: more uncertainty when estimating population’s variance.
- Many samples: less uncertainty when estimating population’s variance.
As df increases, the $t$ distribution gets thinner tails and increasingly resembles a standard normal.

Figure 5: Probability density curve for standard normal distribution and for $t$ distribution with various df.

When df $>30$, almost indistinguishable from standard normal.
Intuition: height/thickness represents how likely values are; when df ($n-1$) is small, we are more uncertain and so more extreme values are more likely.

$t$ distribution: how to calculate

Recall in R, to calculate probabilities under normal,

pnorm(val, mean, sd) to calculate probability that $N(\mu, \sigma)$ is $<= \text{val}$
qnorm(quantile, mean, sd) to calculate value corresponding to quantile

Similarly, for t-distribution, we use

pt(val, df) to calculate probability that $t$ distribution with df degrees of freedom is $<= \text{val}$
qt(quantile, df) to calculate value corresponding to quantile for $t$ distribution with df degrees of freedom

pnorms <- tibble(val = seq(-4, 4, 1)) |>
  mutate(pt_2 = pt(val, df = 2),
         pt_5 = pt(val, df = 5),
         pt_30 = pt(val, df = 30), 
         pnorm = pnorm(val))
pnorms

# A tibble: 9 × 5
    val   pt_2    pt_5    pt_30     pnorm
  <dbl>  <dbl>   <dbl>    <dbl>     <dbl>
1    -4 0.0286 0.00516 0.000191 0.0000317
2    -3 0.0477 0.0150  0.00269  0.00135  
3    -2 0.0918 0.0510  0.0273   0.0228   
4    -1 0.211  0.182   0.163    0.159    
5     0 0.5    0.5     0.5      0.5      
6     1 0.789  0.818   0.837    0.841    
7     2 0.908  0.949   0.973    0.977    
8     3 0.952  0.985   0.997    0.999    
9     4 0.971  0.995   1.000    1.000

qnorms <- tibble(quantiles = c(0.05, 0.25, 0.5, 0.75, 0.95)) |>
  mutate(qnorm = qnorm(quantiles),
         qt_2 = qt(quantiles, df = 2),
         qt_5 = qt(quantiles, df = 5),
         qt_30 = qt(quantiles, df = 30))
qnorms

# A tibble: 5 × 5
  quantiles  qnorm   qt_2   qt_5  qt_30
      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1      0.05 -1.64  -2.92  -2.02  -1.70 
2      0.25 -0.674 -0.816 -0.727 -0.683
3      0.5   0      0      0      0    
4      0.75  0.674  0.816  0.727  0.683
5      0.95  1.64   2.92   2.02   1.70

$t$ distribution: example calculations I

Probability that $t$ distribution with 20 degrees of freedom is less than -1.5?

# use pt() to find probability 
# under the $t$-distribution
pt(-1.5, df = 20)

[1] 0.07461789

Figure 6: Area under the density curve of a t distribution with 20 degrees of freedom for values less than -1.5.

Probability that $t$ distribution with 11 degrees of freedom is bigger than 2.5?

1 - pt(2.5, df = 11)

[1] 0.01475319

Area under the density curve of a t distribution with 11 degrees of freedom for values larger than 2.5.

$t$ distribution: example calculations II

Probability that $t$ distr. with 2 df is more than 3 units away from the mean?

# use pt() to find probability under the $t$-distribution
pt(-3, df = 2) + (1 - pt(3, df = 2))

[1] 0.09546597

Figure 7: Area under the density curve of a t distribution with 2 degrees of freedom for values smaller than -3 or larger than 3.

Any $t$ distribution is symmetric around 0, so…

2*pt(-3, df = 2)  # ...could also do this

[1] 0.09546597

Compare with what happens with standard normal: 68-95-99.7 rule says that only 0.3% (=0.003) would be more than 3 units from the mean.
Since $t$ distribution has fatter tails, it assigns greater probability to extreme values, so we get significantly more area for $t$ distribution.
As degrees of freedom increase, this becomes less and less the case.

One-sample $t$-distribution CI: conditions

Previously: if we have independent samples and sufficiently large dataset where population has mean $\mu$ and s.d. $\sigma$, then $\bar x_n$ is approximately normal, with mean $\mu$ and s.d. $\sigma/\sqrt n$.

However, we don’t know $\sigma$, and if we want to use the sample estimate $s$ for $\sigma$ then we can’t say that $\bar x_n$ is approximately normal, but will instead be $t$ distribution with $n-1$ df under certain conditions
Conditions needed:
- Independent sample observations (satisfied w/ random sample)
- Normality of samples - each $x_i$ is from a normal distribution (or approximately). How to check?
  - If $n<30$ and no clear outliers, then OK.
  - If $n\geq 30$ and no particularly extreme outliers, then OK
If these assumptions hold, then the confidence interval for the mean is

\[\text{point estimate} \pm t_{df}^* \times SE\]

Here, the point estimate is $\bar{x}$, and the SE is $s/\sqrt{n}$.
Critical value $t^*_{df}$ found same way as for $z^*$ (next slide)

One-sample $t$-distribution CI: calculation

Same idea holds for finding $t^*_{df}$: to get confidence level of $1-\alpha$, we use
- qt(1 - alpha/2, df = df)
E.g. if df = 5 and we want 95% confidence level,

qt(1 - 0.05/2, df = 5)

[1] 2.570582

The tails of this distribution are heavier than the tails of a standard normal distribution.

E.g. if df = 10 and we want 95% confidence level,

qt(1 - 0.05/2, df = 10)

[1] 2.228139

E.g. if df = 50 and we want 95% confidence level,

qt(1 - 0.05/2, df = 50)

[1] 2.008559

E.g. if df = 500 and we want 95% confidence level,

qt(1 - 0.05/2, df = 500)

[1] 1.96472

E.g. if df = 5000 and we want 95% confidence level,

qt(1 - 0.05/2, df = 5000)

[1] 1.960439

As df increases, the t-distribution’s critical value seems to be approaching

qnorm(1 - 0.05/2)

[1] 1.959964

One-sample $t$-distribution CI: mercury in tuna

High mercury concentrations can be dangerous for tuna and humans that eat them.

Let’s consider problem of measuring the amount of mercury in tuna.
Suppose we have a random sample of 19 tunas, with the following summary statistics (measurements in micrograms mercury / gram tuna).

Table 1

n	Mean	SD	Min	Max
19	4.4	2.3	1.7	9.2

Are conditions for applying $t$ distribution satisfied?

Independent since random sample
$n<30$ and summary stats suggest no clear outliers.

Let’s calculate 95% confidence interval

Calculate standard error: \[ SE = \frac{s}{\sqrt n} = \frac{2.3}{\sqrt{19}} = 0.528\]
Since df is $n-1=19-1$, calculate $t^*_{19-1}$ for $1-0.05$ confidence level:

qt(1-0.05/2, df = 18)

[1] 2.100922

Thus 95% confidence interval is

\[ \begin{aligned} \bar{x} \ &\pm\ t^{\star}_{18} \times SE \newline \\ 4.4 \ &\pm\ 2.10 \times 0.528 \newline \\ (3.29 \ &, \ 5.51) \end{aligned} \]

We are 95% confident that average mercury content in tunas is between 3.29 and 5.51 micrograms per gram of tuna - very high.

One sample $t$-test

We now know how to calculate a $t$-distribution based CI.

Let’s look now at a hypothesis test for the mean.
T score is a ratio of how the sample mean differs from the hypothesized mean as compared to how the observations vary.

\[ T = \frac{\bar{x} - \mbox{null value}}{s/\sqrt{n}} \]

When the null hypothesis is true and the conditions are met, T has a t-distribution with $df = n - 1.$ Conditions:
- Independent observations.
- Large samples and no extreme outliers.

One sample $t$-test: Cherry Blossom

Are runners in Washington, DC races are getting slower or faster over time?

In 2006, DC Cherry Blossom Race (10 miles) had average of 93.29 minutes
Will use data from 100 participants from 2017 to determine whether runners are getting faster or slower (vs. possiblity of no change)
- $H_0$: average 10 mile time is same in 2006 as in 2017, so $\mu = 93.29$
- $H_A$: average 10 mile run time is different in 2017; $\mu \neq 93.29$
Data looks like:

Large enough samples, no extreme outliers, can proceed
To do the hypothesis test, same procedure as before
- Normal setting: find Z-score using observed value, null value, standard error, then use normal distribution to calculate tail area / p value
- This setting: T-score using observed value, null value, standard error, then use t distribution to calculate tail area / p value
For sample of 100 runners, 2017 data had average of 98.78 and s.d. of 16.59; average run time in 2006 was 93.29
First calculate standard error:

\[ SE = \frac{s}{\sqrt{n}} = \frac{16.59}{\sqrt{100}} = 1.66\]

T score:

\[ T = \frac{\text{observed}-\text{null}}{SE} = \frac{98.8 - 93.29}{1.66} = 3.32\]

We have $n=100$ observations, so $df=100-1=99$.

We can use pt to find this; by symmetry, area to right of 3.32 is same as area to left of -3.32, so

pt(-3.32, df = 99)

[1] 0.0006305322

Double this to get the p-value: 0.00126.
$p$-value is $<0.05$, so we can reject the null hypothesis at 95% confidence level.
- Can reject at even 99% confidence level.
- Thus average run time in 2017 is different than 2006.

Inference for a single mean

Mean…

…vs proportion

Bootstrap CI for a mean: ideal

Bootstrap CI for a mean: in practice

Bootstrap CI for a standard deviation

Bootstrap CI for a standard deviation: car

Mathematical model for a mean

\(t\) distribution

Effect of df on \(t\) distribution:

\(t\) distribution: how to calculate

\(t\) distribution: example calculations I

\(t\) distribution: example calculations II

One-sample \(t\)-distribution CI: conditions

One-sample \(t\)-distribution CI: calculation

One-sample \(t\)-distribution CI: mercury in tuna

One sample \(t\)-test

One sample \(t\)-test: Cherry Blossom