class: center, middle, inverse, title-slide .title[ # Inference for a single mean ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Akira Horiguchi
Figures taken from [IMS], Cetinkaya-Rundel and Hardin text ] --- .pull-left[ ### Based on Ch 19-20 of IMS We will see how to use the bootstrap to develop confidence intervals for a single mean * We want to know the average price of cars on a lot by taking a random sample * Our sample has five cars with prices - $18,300; $20,100; $9,600; $10,700; $27,000 - Sample mean: $17,140 - Standard deviation: \begin{align} s^2 &= \frac{1}{5-1} \Big( (18300-17140)^2 + (20100-17140)^2 \newline \\ &\qquad\qquad+ (9600-17140)^2 + (10700-17140)^2 \newline \\ &\qquad\qquad+ (27000-17140)^2 \Big)\newline \\ &\quad\implies s = \$7,170.29 \end{align} ] .pull-right[ How much can we expect the sample mean to change from sample to sample (of five cars)? * If we **can** sample more from the population,<br> we can repeatedly:<br> (1) sample five cars from the population, then<br> (2) compute the sample mean of the five cars.<br> This will give us a distribution of sample means, which gives us **sample-to-sample variability**. * If we **cannot** sample more from the population, can estimate this **sample-to-sample variability** by using the bootstrap. <img src="bootpop1mean.png" width="100%" /> ] --- .pull-left[ Bootstrapping: take `\(B\)` bootstrap samples of size `\(n\)` from the original sample of size `\(n\)`. * Each bootstrap sample has its own sample mean. * There is variability across these sample means. * This "bootstrap" variability can be an estimate<br> for the variability of the sample mean induced by sampling repeatedly from the population. <!-- * This bootstrap variability gives us an estimate for the variability of the sample mean from the population --> <!-- * The variability across sample means across bootstrap re-samples gives us an estimate for the variability of the sample mean from the population --> <img src="bootmeans1mean.png" width="210%" /> ] -- .pull-right[ Suppose we took 1,000 bootstrap samples, and for each bootstrap sample we compute its sample mean <img src="lec20_files/figure-html/fig-carsbsmean-1.png" width="100%" /> * To develop bootstrap confidence interval for the population mean at e.g. 90% confidence level,<br> we can simply calculate the 5% and 95% percentile of the bootstrapped statistics ] --- #### Computing bootstrap means and confidence intervals Let's build a function which: takes in a sample (a vector of numerics), and then computes a desired number of bootstrap re-samples and returns the sample means for each bootstrap re-sample. ``` r boot_samp_means <- function(obs, n_resamp=10000) { means <- numeric(n_resamp) num_obs <- length(obs) for(i in 1:n_resamp) { resampled <- sample(obs, size=num_obs, replace=TRUE) means[i] <- mean(resampled, na.rm=TRUE) } return(means) } ``` or ``` r boot_samp_means <- function(obs, n_resamp=10000) { num_obs <- length(obs) replicate( n_resamp, sample(obs, size=num_obs, replace=TRUE) |> mean(na.rm=TRUE) ) } ``` --- .pull-left[ * For the car example, here's how we could do this: ``` r obs <- c(18300, 20100, 9600, 10700, 27000) mean(obs) #> [1] 17140 bs_means <- boot_samp_means(obs, 10000) str(bs_means) #> num [1:10000] 13300 17360 16560 20620 22360 ... ``` * To construct a 90% confidence interval, find 5th percentile and 95th percentile ``` r (qs <- quantile(bs_means, probs=c(0.05, 0.95))) #> 5% 95% #> 12140 22140 ``` * We are 90% confident that<br> the true average is between $12140 and $22140 <!-- * We can also see variability across bootstrap samples: --> <!-- ```{r} --> <!-- bs_means_100 <- boot_samp_means(obs, 10000) --> <!-- quantile(bs_means_100, probs = c(0.05, 0.95)) --> <!-- bs_means_10000 <- boot_samp_means(obs, 10000) --> <!-- quantile(bs_means_10000, probs = c(0.05, 0.95)) --> <!-- ``` --> ] .pull-right[ ] --- .pull-left[ #### Bootstrap percentile confidence interval for a standard deviation We have thus far focused on trying to understand the **mean** or average of a population * Also often interested in the **standard deviation** `\(\sigma\)` (Greek letter "sigma") of the variable * This represents the natural variability of the population value `\(\mu\)` * Variance `\(\sigma^2\)` is just the squared value of `\(\sigma\)` * We have constructed a **point estimate** for `\(\sigma^2\)` using samples via $$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 $$ <!-- * Then we have that `\(s^2\)` is a good approximation for `\(\sigma^2\)` --> * If we want to construct a *confidence interval* for `\(\sigma\)`, then we can use bootstrap idea as before. ] .pull-right[ ] --- .pull-left[ Recall for constructing confidence interval for **population mean** using bootstrap: - Take bootstrap re-samples of dataset - Compute **sample mean** for each bootstrap sample - Create histogram of **sample means** across<br> each bootstrap sample - Use the percentiles of the histogram to get confidence intervals ] .pull-right[ Same idea to construct a confidence interval for<br> **population variance** `\(\sigma^2\)`: - Take bootstrap re-samples of dataset - Compute **sample variance `\(s^2\)`** for each bootstrap sample - Create histogram of **sample variances** across<br> each bootstrap sample - Use the percentiles of the histogram to get confidence intervals Confidence interval for **population standard deviation** `\(\sigma\)`: - Create histogram of sample standard deviation `\(s = \sqrt{s^2}\)`. ] --- .pull-left[ * Results of bootstrap *standard deviations*: <img src="lec20_files/figure-html/fig-carsbssd-1.png" width="100%" /> * Very high variability * This is due to the very small sample size - only 5 observations * As we increase the original dataset sample size, the bootstrap gets better * Precise characterization of how sample size / number bootstrap trials affect the accuracy of the bootstrap is beyond this course -- subject of current research ] .pull-right[ ] --- .pull-left[ #### Mathematical model for a mean Recall: we used the **Central Limit Theorem** to show that the variability of sample proportion is well-described by normal distribution * For proportions, the single parameter `\(p\in [0,1]\)` determines the mean `\(p\)` and the variance `\(p(1-p)\)` (and thus s.d. `\(\sqrt{p(1-p)}\)`) * Thus the only source of uncertainty is in measuring `\(p\)` * For numerical variables, there are two sources of uncertainty: - The average/mean `\(\mu\)` - typical value - The standard deviation `\(\sigma\)` - typical variability in the parameter * Thus we need a slightly different approach ] -- .pull-right[ Central Limit Theorem for the sample mean: - If sufficiently large number of `\(n\)` independent samples `\(x_1, x_2, \ldots, x_n\)` from a population with mean `\(\mu\)`, s.d. `\(\sigma\)`, then: - Sampling distribution of `\(\bar x_n\)` is approximately **normal** with mean `\(\mu\)` and s.d. `\(\sigma/\sqrt n\)`. However, generally do not know population-level `\(\sigma\)` - we have to estimate `\(\sigma\)` - When you have to estimate `\(\sigma\)`, the sampling distribution of `\(\bar x_n\)` is not approximately normal, but is what's called a "t distribution" ] --- .pull-left[ #### `\(t\)` distribution * The `\(t\)` distribution is defined in terms of **degrees of freedom** `df`. * Has a similar bell-shaped curve to normal, but has "thicker tails" -- allows for more extreme events to occur than normal <img src="lec20_files/figure-html/fig-tDistCompareToNormalDist-1.png" width="90%" /> * This shows `\(t\)` distribution with 1 df * For normal, very little data beyond 2.5, while for t distribution, relatively more ] -- .pull-right[ * When looking at a dataset with `\(n\)` samples, we will typically use the `\(t\)` distribution with `\(n-1\)` df * As df increases, the `\(t\)` distribution gets thinner tails and looks more like a standard normal * When df `\(>30\)`, almost indistinguishable from standard normal * Intuition: height/thickness represents how likely values are; when df (`\(n-1\)`) is small, we are more uncertain and so more extreme values are more likely <img src="lec20_files/figure-html/fig-tDistConvergeToNormalDist-1.png" width="432" /> ] --- .pull-left[ ### `\(t\)` distribution calculations Recall in R, to calculate probabilities under **normal**, - `pnorm(val, mean, sd)` to calculate probability that `\(N(\mu, \sigma)\)` is `\(<= \text{val}\)` - `qnorm(quantile, mean, sd)` to calculate value corresponding to `quantile` ] .pull-right[ Similarly, for **t-distribution**, we use - `pt(val, df)` to calculate probability that `\(t\)` distr. with `df` degrees of freedom is `\(<= \text{val}\)` - `qt(quantile, df)` to calculate value corresponding to `quantile` for `\(t\)` distr. with `df` degrees of freedom ``` r pnorms #> # A tibble: 7 × 5 #> val pnorm pt_2 pt_5 pt_30 #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 -3 0.00135 0.0477 0.0150 0.00269 #> 2 -2 0.0228 0.0918 0.0510 0.0273 #> 3 -1 0.159 0.211 0.182 0.163 #> 4 0 0.5 0.5 0.5 0.5 #> 5 1 0.841 0.789 0.818 0.837 #> 6 2 0.977 0.908 0.949 0.973 #> # ℹ 1 more row ``` ] --- .pull-left[ ### `\(t\)` distribution calculations * Probability that `\(t\)` distribution with 20 degrees of freedom is less than -1.5? ``` r # use pt() to find probability # under the $t$-distribution pt(-1.5, df = 18) #> [1] 0.07547523 ``` <img src="lec20_files/figure-html/fig-tDistDF18LeftTail2Point10-1.png" width="100%" /> ] .pull-right[ * Probability that `\(t\)` distribution with 11 degrees of freedom is bigger than 2.5? ``` r 1 - pt(2.5, df = 11) #> [1] 0.01475319 ``` <img src="lec20_files/figure-html/unnamed-chunk-16-1.png" width="80%" /> ] --- .pull-left[ * Probability that `\(t\)` distribution with 2 degrees of freedom is more than 3 units away from the mean? ``` r # use pt() to find probability under the $t$-distribution pt(-3, df = 2) + (1 - pt(3, df = 2)) #> [1] 0.09546597 ``` <img src="lec20_files/figure-html/fig-tDistDF23UnitsFromMean-1.png" width="80%" /> * `\(t\)` distribution is symmetric around 0, so could also do ``` r 2*pt(-3, df = 2) #> [1] 0.09546597 ``` ] -- .pull-right[ * Compare with what happens with standard normal: 68-95-99.7 rule says that only 0.3% (=0.003) would be more than 3 units from the mean. * Since `\(t\)` distribution has fatter tails, it assigns greater probability to extreme values, so we get significantly more area for `\(t\)` distribution. * As degrees of freedom increase, this becomes less and less the case. ] --- ### Conditions needed for using `\(t\)`-distr. confidence intervals Previously: *if* we have independent samples and sufficiently large dataset where population has mean `\(\mu\)` and s.d. `\(\sigma\)`, then `\(\bar x_n\)` is approximately normal, with mean `\(\mu\)` and s.d. `\(\sigma/\sqrt n\)`. * However, we don't know `\(\sigma\)`, and if we want to use the sample estimate `\(s\)` for `\(\sigma\)` then we can't say that `\(\bar x_n\)` is approximately normal, but will instead be `\(t\)` distribution with `\(n-1\)` df under certain conditions * Conditions needed: - Independent sample observations (satisfied w/ random sample) - Normality of samples - each `\(x_i\)` is from a normal distribution (or approximately). How to check? * If `\(n<30\)` and no clear outliers, then OK. * If `\(n\geq 30\)` and no particularly *extreme* outliers, then OK * If these assumptions hold, then the confidence interval for the mean is <!-- \begin{aligned} --> <!-- \text{point estimate} &\pm t_{df}* \times SE \\ --> <!-- \bar{x} &\pm t_{df}* \times \frac{s}{\sqrt{n}} --> <!-- \end{aligned} --> <img src="t.png" width="35%" style="display: block; margin: auto;" /> * `\(t^*_{df}\)` found same way as for `\(z^*\)`: * `\(t^*_{df}\)` is s.t. the proportion of `\(t\)` distr with `\(df\)` df within a distance `\(t^*_{df}\)` of 0 matches confidence level of interest --- .pull-left[ ### One sample t-intervals * Recall if we wanted to find `\(z^*\)` corresponding to 95% confidence level, we could calculate ``` r qnorm(1 - 0.05/2) #> [1] 1.959964 ``` <img src="lec20_files/figure-html/unnamed-chunk-22-1.png" width="65%" /> ] -- .pull-right[ ### * Same idea holds for finding `\(t^*_{df}\)`: to get confidence level of `\(1-\alpha\)`, we use - `qt(1 - alpha/2, df = df)` * E.g. if df = 5 and we want 95% confidence level, ``` r qt(1 - 0.05/2, df = 5) #> [1] 2.570582 ``` <img src="lec20_files/figure-html/unnamed-chunk-24-1.png" width="75%" /> ] --- .pull-left[ #### Example: mercury in tuna Let's consider problem of measuring the amount of mercury in tuna * High mercury concentrations can be dangerous for the tuna and humans that eat them * Suppose we have a random sample of 19 tunas, with measurements as follows; measurements in micrograms mercury / gram tuna <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> n </th> <th style="text-align:center;"> Mean </th> <th style="text-align:center;"> SD </th> <th style="text-align:center;"> Min </th> <th style="text-align:center;"> Max </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;width: 6em; "> 19 </td> <td style="text-align:center;width: 6em; "> 4.4 </td> <td style="text-align:center;width: 6em; "> 2.3 </td> <td style="text-align:center;width: 6em; "> 1.7 </td> <td style="text-align:center;width: 6em; "> 9.2 </td> </tr> </tbody> </table> * Conditions for applying `\(t\)` distr. satisfied? - Independent since random sample - `\(n<30\)` and summary stats suggest no clear outliers. - All good ] -- .pull-right[ * Let's calculate 95% confidence interval * Calculate standard error: $$ SE = \frac{s}{\sqrt n} = \frac{2.3}{\sqrt{19}} = 0.528$$ * Since df is `\(n-1=19-1\)`, calculate `\(t^*_{19-1}\)` for `\(1-0.05\)` confidence level: ``` r qt(1-0.05/2, df = 18) #> [1] 2.100922 ``` * Thus 95% confidence interval is $$ \begin{aligned} \bar{x} \ &\pm\ t^{\star}_{18} \times SE \newline \\ 4.4 \ &\pm\ 2.10 \times 0.528 \newline \\ (3.29 \ &, \ 5.51) \end{aligned} $$ * We are 95% confident that average mercury content in tunas is between 3.29 and 5.51 micrograms per gram of tuna - very high. ] --- .pull-left[ #### One sample `\(t\)`-tests We now know how to calculate a `\(t\)`-distribution based confidence interval * Let's look now at hypothesis tests for the mean * **T score** is a ratio of how the sample mean differs from the hypothesized mean as compared to how the observations vary. $$ T = \frac{\bar{x} - \mbox{null value}}{s/\sqrt{n}} $$ * When the null hypothesis is true and the conditions are met, T has a t-distribution with `\(df = n - 1.\)` Conditions: - Independent observations. - Large samples and no extreme outliers. ] -- .pull-right[ Example: testing whether runners in Washington, DC races are getting slower or faster over time * In 2006, DC Cherry Blossom Race (10 miles) had average of 93.29 minutes * Will use data from 100 participants from 2017 to determine whether runners are getting faster or slower (vs. possiblity of no change) - Null hypothesis `\(H_0\)`: average 10 mile time is same in 2006 as in 2017, so `\(\mu = 93.29\)` - Alternative hypothesis `\(H_A\)`: average 10 mile run time is different in 2017; `\(\mu \neq 93.29\)` * Data looks like: <img src="lec20_files/figure-html/unnamed-chunk-27-1.png" width="60%" /> * Large enough samples, no extreme outliers, can proceed ] --- .pull-left[ * To do the hypothesis test, same procedure as before - Normal setting: find Z-score using observed value, null value, standard error, then use normal distribution to calculate tail area / p value - This setting: T-score using observed value, null value, standard error, then use **t distribution** to calculate tail area / p value * For sample of 100 runners, 2017 data had average of 98.78 and s.d. of 16.59; average run time in 2006 was 93.29 * First calculate standard error: $$ SE = \frac{s}{\sqrt{n}} = \frac{16.59}{\sqrt{100}} = 1.66$$ * T score: $$ T = \frac{\text{observed}-\text{null}}{SE} = \frac{98.8 - 93.29}{1.66} = 3.32$$ ] -- .pull-right[ * We have `\(n=100\)` observations, so `\(df=100-1=99\)`. <img src="lec20_files/figure-html/unnamed-chunk-28-1.png" width="75%" /> * We can use `pt` to find this; by symmetry, area to right of 3.32 is same as area to left of -3.32, so ``` r pt(-3.32, df = 99) #> [1] 0.0006305322 ``` * Double this to get the p-value - 0.00126. * Since `\(p\)`-value is `\(<0.05\)`, we can reject the null hypothesis at 95% confidence level. - Can reject at even 99% confidence level. - Thus average run time in 2017 is different than 2006. ]