Inference for a single mean

class: center, middle, inverse, title-slide

.title[
# Inference for a single mean
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[
### Based on Ch 19-20 of IMS
We will see how to use the bootstrap to develop confidence intervals for a single mean

* We want to know the average price of cars on a lot by taking a random sample
* Our sample has five cars with prices
  - $18,300; $20,100; $9,600; $10,700; $27,000
  - Sample mean: $17,140
  - Standard deviation:

\begin{align}
  s^2 &= \frac{1}{5-1} \Big( (18300-17140)^2 + (20100-17140)^2 \newline \\
  &\qquad\qquad+ (9600-17140)^2 + (10700-17140)^2 \newline \\
  &\qquad\qquad+ (27000-17140)^2 \Big)\newline \\
  &\quad\implies s = \$7,170.29
  \end{align}

]

.pull-right[
How much can we expect the sample mean to change from sample to sample (of five cars)?

* If we **can** sample more from the population,<br> we can repeatedly:<br> (1) sample five cars from the population, then<br> (2) compute the sample mean of the five cars.<br> This will give us a distribution of sample means, which gives us **sample-to-sample variability**.
* If we **cannot** sample more from the population, can estimate this **sample-to-sample variability** by using the bootstrap.

]

---
.pull-left[
Bootstrapping: take `$B$` bootstrap samples of size `$n$` from the original sample of size `$n$`.

* Each bootstrap sample has its own sample mean.
* There is variability across these sample means.
* This "bootstrap" variability can be an estimate<br> for the variability of the sample mean induced by sampling repeatedly from the population.

<img src="bootmeans1mean.png" width="210%" />
]
--
.pull-right[
Suppose we took 1,000 bootstrap samples, and for each bootstrap sample we compute its sample mean

<img src="lec20_files/figure-html/fig-carsbsmean-1.png" width="100%" />
* To develop bootstrap confidence interval for the population mean at e.g. 90% confidence level,<br> we can simply calculate the 5% and 95% percentile of the bootstrapped statistics
]

---
#### Computing bootstrap means and confidence intervals
Let's build a function which: takes in a sample (a vector of numerics), and then computes a desired number of bootstrap re-samples and returns the sample means for each bootstrap re-sample.

``` r
boot_samp_means <- function(obs, n_resamp=10000) {
  means <- numeric(n_resamp)
  num_obs <- length(obs)
  for(i in 1:n_resamp) {
    resampled <- sample(obs, size=num_obs, replace=TRUE)
    means[i] <- mean(resampled, na.rm=TRUE)
  }
  return(means)
}
```
or

``` r
boot_samp_means <- function(obs, n_resamp=10000) {
  num_obs <- length(obs)
  replicate(
      n_resamp, 
      sample(obs, size=num_obs, replace=TRUE) |> mean(na.rm=TRUE)
      )
}
```

---

.pull-left[
* For the car example, here's how we could do this:

``` r
obs <- c(18300, 20100, 9600, 10700, 27000)
mean(obs)
#> [1] 17140
bs_means <- boot_samp_means(obs, 10000)
str(bs_means)
#>  num [1:10000] 13300 17360 16560 20620 22360 ...
```
* To construct a 90% confidence interval, find 5th percentile and 95th percentile

``` r
(qs <- quantile(bs_means, probs=c(0.05, 0.95)))
#>    5%   95% 
#> 12140 22140
```
* We are 90% confident that<br> the true average is between $12140 and $22140







]

.pull-right[
]

---
.pull-left[
#### Bootstrap percentile confidence interval for a standard deviation
We have thus far focused on trying to understand the **mean** or average of a population

* Also often interested in the **standard deviation** `$\sigma$` (Greek letter "sigma") of the variable
* This represents the natural variability of the population value `$\mu$`
* Variance `$\sigma^2$` is just the squared value of `$\sigma$`
* We have constructed a **point estimate** for `$\sigma^2$` using samples via

$$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 $$

* If we want to construct a *confidence interval* for `$\sigma$`, then we can use bootstrap idea as before.
]

.pull-right[

]

---

.pull-left[

Recall for constructing confidence interval for **population mean** using bootstrap:

- Take bootstrap re-samples of dataset
- Compute **sample mean** for each bootstrap sample
- Create histogram of **sample means** across<br> each bootstrap sample
- Use the percentiles of the histogram to get confidence intervals
]

.pull-right[
Same idea to construct a confidence interval for<br> **population variance** `$\sigma^2$`:

- Take bootstrap re-samples of dataset
- Compute **sample variance `$s^2$`** for each bootstrap sample
- Create histogram of **sample variances** across<br> each bootstrap sample
- Use the percentiles of the histogram to get confidence intervals

Confidence interval for **population standard deviation** `$\sigma$`:

- Create histogram of sample standard deviation `$s = \sqrt{s^2}$`.
]
---

.pull-left[
* Results of bootstrap *standard deviations*: 
<img src="lec20_files/figure-html/fig-carsbssd-1.png" width="100%" />
* Very high variability
* This is due to the very small sample size - only 5 observations
* As we increase the original dataset sample size, the bootstrap gets better
* Precise characterization of how sample size / number bootstrap trials affect the accuracy of the bootstrap is beyond this course -- subject of current research
]

.pull-right[
]

---

.pull-left[
#### Mathematical model for a mean
Recall: we used the **Central Limit Theorem** to show that the variability of sample proportion is well-described by normal distribution

* For proportions, the single parameter `$p\in [0,1]$` determines the mean `$p$` and the variance `$p(1-p)$` (and thus s.d. `$\sqrt{p(1-p)}$`) 
    * Thus the only source of uncertainty is in measuring `$p$`
* For numerical variables, there are two sources of uncertainty:
  - The average/mean `$\mu$` - typical value
  - The standard deviation `$\sigma$` - typical variability in the parameter
* Thus we need a slightly different approach
]

.pull-right[
Central Limit Theorem for the sample mean:

- If sufficiently large number of `$n$` independent samples `$x_1, x_2, \ldots, x_n$` from a population with mean `$\mu$`, s.d. `$\sigma$`, then:
- Sampling distribution of `$\bar x_n$` is approximately **normal** with mean `$\mu$` and s.d. `$\sigma/\sqrt n$`.

However, generally do not know population-level `$\sigma$` - we have to estimate `$\sigma$`

- When you have to estimate `$\sigma$`, the sampling distribution of `$\bar x_n$` is not approximately normal, but is what's called a "t distribution"
]
---
.pull-left[
#### `$t$` distribution

* The `$t$` distribution is defined in terms of **degrees of freedom** `df`. 
* Has a similar bell-shaped curve to normal, but has "thicker tails" -- allows for more extreme events to occur than normal

<img src="lec20_files/figure-html/fig-tDistCompareToNormalDist-1.png" width="90%" />
* This shows `$t$` distribution with 1 df
* For normal, very little data beyond 2.5, while for t distribution, relatively more 
]

.pull-right[
* When looking at a dataset with `$n$` samples, we will typically use the `$t$` distribution with `$n-1$` df
* As df increases, the `$t$` distribution gets thinner tails and looks more like a standard normal 
* When df `$>30$`, almost indistinguishable from standard normal 
* Intuition: height/thickness represents how likely values are; when df (`$n-1$`) is small, we are more uncertain and so more extreme values are more likely

<img src="lec20_files/figure-html/fig-tDistConvergeToNormalDist-1.png" width="432" />
]

---

.pull-left[
### `$t$` distribution calculations
Recall in R, to calculate probabilities under **normal**,
- `pnorm(val, mean, sd)` to calculate probability that `$N(\mu, \sigma)$` is `$<= \text{val}$`
- `qnorm(quantile, mean, sd)` to calculate value corresponding to `quantile`

]

.pull-right[

Similarly, for **t-distribution**, we use
- `pt(val, df)` to calculate probability that `$t$` distr. with `df` degrees of freedom is `$<= \text{val}$`
- `qt(quantile, df)` to calculate value corresponding to `quantile` for `$t$` distr. with `df` degrees of freedom

``` r
pnorms
#> # A tibble: 7 × 5
#>     val   pnorm   pt_2   pt_5   pt_30
#>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
#> 1    -3 0.00135 0.0477 0.0150 0.00269
#> 2    -2 0.0228  0.0918 0.0510 0.0273 
#> 3    -1 0.159   0.211  0.182  0.163  
#> 4     0 0.5     0.5    0.5    0.5    
#> 5     1 0.841   0.789  0.818  0.837  
#> 6     2 0.977   0.908  0.949  0.973  
#> # ℹ 1 more row
```

]

---

.pull-left[
### `$t$` distribution calculations
* Probability that `$t$` distribution with 20 degrees of freedom is less than -1.5?

``` r
# use pt() to find probability 
# under the $t$-distribution
pt(-1.5, df = 18)
#> [1] 0.07547523
```

<img src="lec20_files/figure-html/fig-tDistDF18LeftTail2Point10-1.png" width="100%" />
]

.pull-right[

* Probability that `$t$` distribution with 11 degrees of freedom is bigger than 2.5?

``` r
1 - pt(2.5, df = 11)
#> [1] 0.01475319
```

]

---
.pull-left[
* Probability that `$t$` distribution with 2 degrees of freedom is more than 3 units away from the mean?

``` r
# use pt() to find probability under the $t$-distribution
pt(-3, df = 2) + (1 - pt(3, df = 2))
#> [1] 0.09546597
```

<img src="lec20_files/figure-html/fig-tDistDF23UnitsFromMean-1.png" width="80%" />
* `$t$` distribution is symmetric around 0, so could also do

``` r
2*pt(-3, df = 2)
#> [1] 0.09546597
```

]

--
.pull-right[
* Compare with what happens with standard normal: 68-95-99.7 rule says that only 0.3% (=0.003) would be more than 3 units from the mean.
* Since `$t$` distribution has fatter tails, it assigns greater probability to extreme values, so we get significantly more area for `$t$` distribution. 
* As degrees of freedom increase, this becomes less and less the case. 
]

---
### Conditions needed for using `$t$`-distr. confidence intervals
Previously: *if* we have independent samples and sufficiently large dataset where population has mean `$\mu$` and s.d. `$\sigma$`, then `$\bar x_n$` is approximately normal, with mean `$\mu$` and s.d. `$\sigma/\sqrt n$`.

* However, we don't know `$\sigma$`, and if we want to use the sample estimate `$s$` for `$\sigma$` then we can't say that `$\bar x_n$` is approximately normal, but will instead be `$t$` distribution with `$n-1$` df under certain conditions
* Conditions needed:
  - Independent sample observations (satisfied w/ random sample)
  - Normality of samples - each `$x_i$` is from a normal distribution (or approximately).  How to check?
    * If `$n<30$` and no clear outliers, then OK.
    * If `$n\geq 30$` and no particularly *extreme* outliers, then OK
* If these assumptions hold, then the confidence interval for the mean is

<img src="t.png" width="35%" style="display: block; margin: auto;" />
* `$t^*_{df}$` found same way as for `$z^*$`: 
* `$t^*_{df}$` is s.t. the proportion of `$t$` distr with `$df$` df within a distance `$t^*_{df}$` of 0 matches confidence level of interest

---
.pull-left[ 
### One sample t-intervals

* Recall if we wanted to find `$z^*$` corresponding to 95% confidence level, we could calculate

``` r
qnorm(1 - 0.05/2)
#> [1] 1.959964
```
  
<img src="lec20_files/figure-html/unnamed-chunk-22-1.png" width="65%" />
]

.pull-right[
### 
* Same idea holds for finding `$t^*_{df}$`: to get confidence level of `$1-\alpha$`, we use
  - `qt(1 - alpha/2, df = df)`
* E.g. if df = 5 and we want 95% confidence level,

``` r
qt(1 - 0.05/2, df = 5)
#> [1] 2.570582
```

<img src="lec20_files/figure-html/unnamed-chunk-24-1.png" width="75%" />
]

---
.pull-left[
#### Example: mercury in tuna
Let's consider problem of measuring the amount of mercury in tuna

* High mercury concentrations can be dangerous for the tuna and humans that eat them
* Suppose we have a random sample of 19 tunas, with measurements as follows; measurements in micrograms mercury / gram tuna

* Conditions for applying `$t$` distr. satisfied?   
  - Independent since random sample
  - `$n<30$` and summary stats suggest no clear outliers.
  - All good
]

.pull-right[
* Let's calculate 95% confidence interval
* Calculate standard error:
$$ SE = \frac{s}{\sqrt n} = \frac{2.3}{\sqrt{19}} = 0.528$$
* Since df is `$n-1=19-1$`, calculate `$t^*_{19-1}$` for `$1-0.05$` confidence level:

``` r
qt(1-0.05/2, df = 18)
#> [1] 2.100922
```
* Thus 95% confidence interval is

$$
\begin{aligned}
\bar{x} \ &\pm\  t^{\star}_{18} \times SE \newline \\
4.4 \ &\pm\  2.10 \times 0.528 \newline \\
(3.29 \  &, \ 5.51)
\end{aligned} 
$$
* We are 95% confident that average mercury content in tunas is between 3.29 and 5.51 micrograms per gram of tuna - very high. 
]

---

.pull-left[
#### One sample `$t$`-tests 
We now know how to calculate a `$t$`-distribution based confidence interval

* Let's look now at hypothesis tests for the mean
* **T score** is a ratio of how the sample mean differs from the hypothesized mean as compared to how the observations vary.

$$ T = \frac{\bar{x} - \mbox{null value}}{s/\sqrt{n}} $$

* When the null hypothesis is true and the conditions are met, T has a t-distribution with `$df = n - 1.$` Conditions:
  -   Independent observations.
  -   Large samples and no extreme outliers.
]

.pull-right[
Example: testing whether runners in Washington, DC races are getting slower or faster over time

* In 2006, DC Cherry Blossom Race (10 miles) had average of 93.29 minutes
* Will use data from 100 participants from 2017 to determine whether runners are getting faster or slower (vs. possiblity of no change)
  - Null hypothesis `$H_0$`: average 10 mile time is same in 2006 as in 2017, so `$\mu = 93.29$`
  - Alternative hypothesis `$H_A$`: average 10 mile run time is different in 2017; `$\mu \neq 93.29$`
* Data looks like:

<img src="lec20_files/figure-html/unnamed-chunk-27-1.png" width="60%" />
* Large enough samples, no extreme outliers, can proceed 
]

---
.pull-left[
* To do the hypothesis test, same procedure as before
  - Normal setting: find Z-score using observed value, null value, standard error, then use normal distribution to calculate tail area / p value
  - This setting: T-score using observed value, null value, standard error, then use **t distribution** to calculate tail area / p value
* For sample of 100 runners, 2017 data had average of 98.78 and s.d. of 16.59; average run time in 2006 was 93.29
* First calculate standard error:

$$ SE = \frac{s}{\sqrt{n}} = \frac{16.59}{\sqrt{100}} = 1.66$$

* T score:

$$ T = \frac{\text{observed}-\text{null}}{SE} = \frac{98.8 - 93.29}{1.66} = 3.32$$
]

.pull-right[
* We have `$n=100$` observations, so `$df=100-1=99$`.
<img src="lec20_files/figure-html/unnamed-chunk-28-1.png" width="75%" />
* We can use `pt` to find this; by symmetry, area to right of 3.32 is same as area to left of -3.32, so

``` r
pt(-3.32, df = 99)
#> [1] 0.0006305322
```
* Double this to get the p-value - 0.00126. 
* Since `$p$`-value is `$<0.05$`, we can reject the null hypothesis at 95% confidence level.
  - Can reject at even 99% confidence level. 
  - Thus average run time in 2017 is different than 2006. 
]