Inference for comparing two means

class: center, middle, inverse, title-slide

.title[
# Inference for comparing two means
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[

### Based on Ch 20 of IMS

Previously we used the bootstrap to create confidence intervals and hypothesis tests for a **single mean** `$\mu$`

- e.g. we want to understand a single numeric value about a population (height, speed, etc)

Now: confidence intervals and hypothesis tests for the **difference of two means** `$\mu_1 - \mu_2$`;<br> mean of two different populations. Examples:
- Do pregnant women who are smokers vs. non-smokers have differences in baby weight?
- Was one exam more difficult than another?
- Are Americans taller or shorter than Canadians?

]

.pull-right[

For `$\mu_1 - \mu_2$`:

- Point estimate `$\bar x_1 - \bar x_2$`,<br> where `$\bar x_i$` is sample mean from population `$i$`

Many previous ideas will carry over
- Randomization tests
- Bootstrap for difference of means
- Mathematical approach (Central Limit Theorem)
]

---

.pull-left[
### Randomization test for difference in means

Example: Two slight variations of an exam.

* Prior to passing out the exams, the instructor shuffled them so that each student received a random version (A and B).
* Anticipating complaints, the instructor wants to see if the difference observed between the groups is large enough to provide convincing evidence that one version was more difficult (on average) than the other version.

Summary statistics for how students performed on these two exams are shown below.

<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Group </th>
   <th style="text-align:center;"> n </th>
   <th style="text-align:center;"> Mean </th>
   <th style="text-align:center;"> SD </th>
   <th style="text-align:center;"> Min </th>
   <th style="text-align:center;"> Max </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;width: 5em; "> A </td>
   <td style="text-align:center;width: 5em; "> 58 </td>
   <td style="text-align:center;width: 5em; "> 75.10 </td>
   <td style="text-align:center;width: 5em; "> 13.87 </td>
   <td style="text-align:center;width: 5em; "> 44 </td>
   <td style="text-align:center;"> 100 </td>
  </tr>
  <tr>
   <td style="text-align:center;width: 5em; "> B </td>
   <td style="text-align:center;width: 5em; "> 55 </td>
   <td style="text-align:center;width: 5em; "> 71.96 </td>
   <td style="text-align:center;width: 5em; "> 13.77 </td>
   <td style="text-align:center;width: 5em; "> 38 </td>
   <td style="text-align:center;"> 100 </td>
  </tr>
</tbody>
</table>

]

.pull-right[

<img src="lec21_files/figure-html/fig-boxplotTwoVersionsOfExams-1.png" width="432" style="display: block; margin: auto;" />
* Hypotheses to evaluate whether observed difference in sample means is likely to have happened due to chance:
  - `$H_0$`: exams are equally difficult; `$\mu_A = \mu_B$`
  - `$H_A$`: one exam is more difficult; `$\mu_A \neq \mu_B$`
* Observations regarding setup:
  - Independence *within* each group and *between* groups since exams shuffled and randomly passed out
  - min/max values suggest no outliers 
* We'll use an `$\alpha = 0.05$` significance threshold
]

---
.pull-left[
#### Variability of the statistic
* We previously estimated the variability of difference in proportions `$\hat p_1 - \hat p_2$` by randomly assigning treatment to each observation
* Here we do something similar. 
* To simulate the null hypothesis, we 
    * randomly assign 58 of the observed exam scores to group A (the remaining 55 scores then get assigned to group B), then
    * examine the difference `$\bar x_{A,sim1} - \bar x_{B,sim1}$`

<img src="rand2means.png" width="95%" />
]
--
.pull-right[
* Repeating this 10,000 times, we estimate the natural variability in `$\bar x_A - \bar x_B$` when there is no dependence between group and exam score.

<img src="lec21_files/figure-html/fig-randexamspval-1.png" width="396" style="display: block; margin: auto;" />
* In our actual data, the observed difference (highlighted above) was 75.1 - 71.96 = 3.14. 
* 1195 out of 10,000 randomization trials produce difference greater than 3.14, 
* 1173 produce difference less than -3.14
* p-value is then `$\approx$` (1195 + 1173) / 10000 = 0.2368
* Larger than `$\alpha = 0.05$` threshold: fail to reject `$H_0$`
* **Conclude:** the exam-score difference we saw (3.14) likely occurred due to random chance.

]

---

.pull-left[
#### Bootstrap confidence interval for difference in means


* E.g. assessing 2 car lots, seeing which one has a cheaper average price, using a sample from each

]

.pull-right[
* We take *bootstrap samples* from **each** group
* then calculate sample means in each bootstrap sample, `$\bar x_{1, boot1}$` and `$\bar x_{2, boot2}$`,
* then use these to calculate estimated difference in means `$\bar x_{1,boot1} - \bar x_{2, boot2}$`
]

---

.pull-left[
#### Case study
Consider the following experiment that seeks to examine whether using embryonic stem cells (ESC) help improve heart function following a heart attack

* In experiment, people are randomly assigned to treatment (ESC) and control groups, and then had their heart pumping capacity measured
* Want to compute 95% confidence interval (CI) for effect of ESC on heart pumping capacity
* Summary statistics from experiment:

<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Group </th>
   <th style="text-align:center;"> n </th>
   <th style="text-align:center;"> Mean </th>
   <th style="text-align:center;"> SD </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 6em; "> ESC </td>
   <td style="text-align:center;width: 6em; "> 9 </td>
   <td style="text-align:center;width: 6em; "> 3.50 </td>
   <td style="text-align:center;width: 6em; "> 5.17 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 6em; "> Control </td>
   <td style="text-align:center;width: 6em; "> 9 </td>
   <td style="text-align:center;width: 6em; "> -4.33 </td>
   <td style="text-align:center;width: 6em; "> 2.76 </td>
  </tr>
</tbody>
</table>

* Point estimate of the difference in heart pumping capacity:

`$$\bar{x}_{esc} - \bar{x}_{control}\   =\ 3.50 - (-4.33)\   =\ 7.83$$`
]

--
.pull-right[

Use bootstrap to estimate the distribution of difference in sample means when repeatedly sampling

- Bootstrapped CI does not include 0
- Conclude: ESC increases heart pumping capacity

If the CI did include 0, then we would not have enough evidence to conclude that ESC increases heart pumping capacity
- We would **not** say that "we have evidence that ESC does not change heart pumping capacity"
]

---

.pull-left[
### Mathematical model for testing difference in means
We'll now describe a mathematical approach for testing difference in means

* We'll use: `openintro::births14` dataset.<br>  Consists of randomly sampled survey of mothers in the US.  First few rows below.
* Is there evidence that newborns from smokers have different birth weight than non-smokers?
]

.pull-right[
* Setting up hypotheses:
  - Let `$\mu_n$`: average birthweight of non-smokers,<br> `$\mu_s$`: smokers. 
  - `$H_0$`: no difference in average birthweight for newborns from smoking vs. non-smoking mothers: `$\mu_n - \mu_s =0$`; 
  - `$H_A$`: There is some difference: `$\mu_n -\mu_s \neq 0$`. 
* Summary statistics from the data:

<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Habit </th>
   <th style="text-align:center;"> n </th>
   <th style="text-align:center;"> Mean </th>
   <th style="text-align:center;"> SD </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> nonsmoker </td>
   <td style="text-align:center;width: 7em; "> 867 </td>
   <td style="text-align:center;width: 7em; "> 7.27 </td>
   <td style="text-align:center;width: 7em; "> 1.23 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> smoker </td>
   <td style="text-align:center;width: 7em; "> 114 </td>
   <td style="text-align:center;width: 7em; "> 6.68 </td>
   <td style="text-align:center;width: 7em; "> 1.60 </td>
  </tr>
</tbody>
</table>

]

<table class="table table-striped table-condensed" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> fage </th>
   <th style="text-align:center;"> mage </th>
   <th style="text-align:center;"> weeks </th>
   <th style="text-align:center;"> visits </th>
   <th style="text-align:left;"> gained </th>
   <th style="text-align:center;"> weight </th>
   <th style="text-align:center;"> sex </th>
   <th style="text-align:left;"> habit </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 34 </td>
   <td style="text-align:center;"> 34 </td>
   <td style="text-align:center;"> 37 </td>
   <td style="text-align:center;"> 14 </td>
   <td style="text-align:left;"> 28 </td>
   <td style="text-align:center;"> 6.96 </td>
   <td style="text-align:center;"> male </td>
   <td style="text-align:left;"> nonsmoker </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 36 </td>
   <td style="text-align:center;"> 31 </td>
   <td style="text-align:center;"> 41 </td>
   <td style="text-align:center;"> 12 </td>
   <td style="text-align:left;"> 41 </td>
   <td style="text-align:center;"> 8.86 </td>
   <td style="text-align:center;"> female </td>
   <td style="text-align:left;"> nonsmoker </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 37 </td>
   <td style="text-align:center;"> 36 </td>
   <td style="text-align:center;"> 37 </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:left;"> 28 </td>
   <td style="text-align:center;"> 7.51 </td>
   <td style="text-align:center;"> female </td>
   <td style="text-align:left;"> nonsmoker </td>
  </tr>
  <tr>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> 16 </td>
   <td style="text-align:center;"> 38 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:left;"> 29 </td>
   <td style="text-align:center;"> 6.19 </td>
   <td style="text-align:center;"> male </td>
   <td style="text-align:left;"> nonsmoker </td>
  </tr>
</tbody>
</table>

---
.pull-left[
#### Variability of the statistic
The **test statistic for comparing two means** is a<br> **T score**

`$$T = \frac{\text{point est.} - \text{null}}{SE} = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}$$`
- Ratio of how the groups differ compared to how the observations within a group vary.
- If observed difference is much larger than observed within-group variability, then the absolute value of the T score will be large

When the null hypothesis is true and the conditions below are met, T score has a t-distribution with `$df = \min(n_1, n_2) - 1.$`

Conditions:

-   Independent observations within and between groups.
-   Large samples and no extreme outliers.

]

.pull-right[

Back to smoking example: we want to model the difference in sample means using `$t$`-distribution;<br> check required assumptions
  1. **Independence**: since randomly sampled, samples are independent
  2. **Nearly-normal data**: both groups have `$>30$` obsns; does data show any extreme outliers?

* No apparent extreme outliers, so all conditions needed to satisfy `$t$` distribution assumptions hold
* So we can proceed with the analysis
]

---

.pull-left[
Let's now complete the hypothesis test

* Let's use `$\alpha=0.05$` (95% significance level)
* Summary statistics from before:

<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Habit </th>
   <th style="text-align:center;"> n </th>
   <th style="text-align:center;"> Mean </th>
   <th style="text-align:center;"> SD </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> nonsmoker </td>
   <td style="text-align:center;width: 7em; "> 867 </td>
   <td style="text-align:center;width: 7em; "> 7.27 </td>
   <td style="text-align:center;width: 7em; "> 1.23 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> smoker </td>
   <td style="text-align:center;width: 7em; "> 114 </td>
   <td style="text-align:center;width: 7em; "> 6.68 </td>
   <td style="text-align:center;width: 7em; "> 1.60 </td>
  </tr>
</tbody>
</table>

$$ SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{1.23^2}{867} + \frac{1.60^2}{114}} = 0.155$$
$$ T = \frac{\bar x_1 - \bar x_2 - 0}{SE} = \frac{6.68-7.27}{0.155} = -3.69$$
<img src="lec21_files/figure-html/unnamed-chunk-12-1.png" width="100%" />

]

.pull-right[

* degrees of freedom = `$\min(n_1, n_2)-1 = 113$`
* Compute the one-sided tail area:

``` r
pt(-3.69, df = 113)
#> [1] 0.0001733097
```
* Doubling this gives p-value of 0.00034.
* The p-value is much smaller than the significance value, 0.05, so we reject the null
hypothesis `$H_0$`.
* Conclude: The data provide is convincing evidence of a difference in the average
weights of babies born to mothers who smoked during pregnancy and those who did
not.
]

---

.pull-left[
#### Mathematical model for estimating difference in means (confidence interval)
**Using the** `$t$`**-distribution for a difference in means.**

The `$t$`-distribution can be used for inference when working with the standardized difference of two means if
-   *Independence* (extended). The data are independent within and between the two groups, e.g., the data come from independent random samples or from a randomized experiment.
-   *Normality*. We check the outliers for each group separately.

]

.pull-right[

The standard error may be computed as

`$$SE = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \approx \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$`

Degrees of freedom: `$\min(n_1, n_2)-1$`

The margin of error for `$\bar{x}_1 - \bar{x}_2$` can be directly obtained from `$SE(\bar{x}_1 - \bar{x}_2).$`

$$ \text{Margin of error} = t^\star_{df} \times \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}},$$

`$t^\star_{df}$`: calculated from percentile of t-distr w/ *df* d.o.f.
]

---

.pull-left[

Let's compute 95% confidence interval for effect of ESC on change in heart pump capacity:

`$$\begin{aligned}
\bar{x}_{esc} - \bar{x}_{control} &= 7.83 \\
SE &= \sqrt{\frac{5.17^2}{9} + \frac{2.76^2}{9}} = 1.95
\end{aligned}$$`

Degrees of freedom is `$\min\{9, 9\} - 1 = 8$`.

Critical value of `$t^{\star}_{8} = 2.31$` for a 95% conf. interval: `qt(0.025, 8)` returns -2.31

]

.pull-right[

95% conf. interval is then

$$
\begin{aligned}
\text{point estimate} \ \pm\ t^{\star}_8 \times SE \\ 
\end{aligned} 
$$

$$
\begin{aligned}
\implies 7.83 \ \pm\ 2.31\times 1.95 \\ 
\end{aligned} 
$$

$$
\begin{aligned}
\implies (3.32, 12.34)
\end{aligned} 
$$

Conclude: we are 95% confident that heart pumping function in those that received ESC treatment is between 3.32% and 12.34% higher than for those that did not receive ESC

]