Inference for two proportions

class: center, middle, inverse, title-slide

.title[
# Inference for two proportions
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[
### Based on Ch 17 of IMS
We will now use similar ideas from previous lectures to derive confidence intervals and hypothesis tests for differences in population proportions from two groups: `$p_1 - p_2$`.

* We will estimate `$p_1-p_2$` by `$\hat p_1 - \hat p_2$`
  - `$\hat p_1$` = proportion of successes in group 1
  - `$\hat p_2$` = proportion of successes in group 2
* To estimate variability / perform inference, we'll use three methods:
  - Randomization test
  - Bootstrap for confidence intervals
  - Normal approximation
]

.pull-right[
Experiment: two treatments on patients who underwent CPR for a heart attack and then admitted to a hospital

* Patients are randomly assigned to either receive blood thinner (treatment) or not (control)
* Outcome: whether patient survived for >= 24 hours post heart attack

``` r
str(openintro::cpr)
#> tibble [90 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ group  : Factor w/ 2 levels "control","treatment": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ outcome: Factor w/ 2 levels "died","survived": 2 2 2 2 2 2 2 2 2 2 ...
```

<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Group </th>
   <th style="text-align:center;"> Died </th>
   <th style="text-align:center;"> Survived </th>
   <th style="text-align:center;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> Control </td>
   <td style="text-align:center;width: 7em; "> 39 </td>
   <td style="text-align:center;width: 7em; "> 11 </td>
   <td style="text-align:center;width: 7em; "> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Treatment </td>
   <td style="text-align:center;width: 7em; "> 26 </td>
   <td style="text-align:center;width: 7em; "> 14 </td>
   <td style="text-align:center;width: 7em; "> 40 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Total </td>
   <td style="text-align:center;width: 7em; "> 65 </td>
   <td style="text-align:center;width: 7em; "> 25 </td>
   <td style="text-align:center;width: 7em; "> 90 </td>
  </tr>
</tbody>
</table>

]

---
.pull-left[

* Proportion of treatment patients who survive: `$\hat p_T = \frac{14}{40} = 0.35$`
* Proportion of control patients who survive: `$\hat p_C = \frac{11}{50} = 0.22$`
* Difference `$\hat p_T - \hat p_C = 0.13$` -- how likely would this occur due to random chance?
* Formulate a hypothesis test, and a randomization procedure to examine the hypothesis.
  - `$H_0$`: blood thinners after CPR<br> is independent of survival: `$p_1-p_2=0$`.
  - `$H_A$`: blood thinners after CPR<br> increase chance of survival: `$p_1-p_2>0$`
* Since the assignment of treatment/control was random initially, if we reject the null, we can say the difference we due to usage of blood thinners (causal statement)
]

.pull-right[
#### Randomization test

* Assume same number of people died (65) / survived (25), then see what happens if we randomly assign treatment or control to each
* For each simulated treatment/control group,<br> calculate proportion `$\hat p_{1,sim} - \hat p_{2,sim}$`

<img src="lec19_files/figure-html/fig-cpr-rand-dot-plot-1.png" width="432" />
* 12/100 simulations had a difference of >= 13%
* If `$H_0$` were true, about 12% chance of differences like we saw, so we cannot reject `$H_0$` under significance level `$\alpha = 0.05$`: not enough evidence to claim that  blood thinners improve survival rate

]

---

.pull-left[
#### Bootstrap confidence interval for difference in proportions
Last slide: randomization distribution to understand the distribution of `$\hat p_1 - \hat p_2$` under null hypothesis `$H_0: p_1-p_2=0$`.

* We previously applied bootstrap sampling to a single population
]

.pull-right[
* We now have samples from two different groups
    * Resample from group 1 to calculate `$\hat p_{1,boot1}$`
    * Resample from group 2 to calculate `$\hat p_{2,boot1}$`
* Each pair of bootstrap samples gives us an estimate of the difference `$\hat p_{1,boot1} - \hat p_{2,boot1}$`.
* We can use this to create a distribution of the difference in proportions.

]

---

---

.pull-left[

<img src="lec19_files/figure-html/fig-bootCPR1000-1.png" width="432" />
* We have already discussed one method for constructing confidence interval: suppose we saved the bootstrap data as

``` r
cpr_boot_dist |> as_tibble() |> head(n=4)
#> # A tibble: 4 × 2
#>   replicate   stat
#>       <int>  <dbl>
#> 1         1 0.0847
#> 2         2 0.102 
#> 3         3 0.167 
#> 4         4 0.0979
```

]

.pull-right[ 
* For 95% confidence interval, we want to find the 2.5th and 97.5th percentiles/quantile:

``` r
quantile(cpr_boot_dist$stat, c(0.025, 0.975))
#>        2.5%       97.5% 
#> -0.06631288  0.32429733
```
* These two values are the end points of the bootstrap 95% confidence interval:
  - 95% confidence that in the population, true difference in survival rate for people receiving blood thinners post CPR is between 0.066 lower and 0.324 higher than survival rate for non receivers
* Similar calculation for 90% conf interval:

``` r
quantile(cpr_boot_dist$stat, c(0.05, 0.95))
#>          5%         95% 
#> -0.03240614  0.28406692
```

* These two values are the end points of the bootstrap 90% confidence interval:
  - Even at 90% confidence, still includes 0
]

---
.pull-left[
#### Mathematical model for difference in proportions
Like with `$\hat p$` for a single proportion, difference in proportions `$\hat p_1 - \hat p_2$` can be modeled with a normal distribution under certain assumptions.

* Assumptions are similar to those from before, but more stringent:
  - **Independence**: data *within each group* and *between groups* are indepednent; generally satisfied if in randomized experiment
  - **Success-failure condition**: at least 10 successes and 10 failure *within each group*
* When these are satisfied, the standard error of `$\hat p_1 - \hat p_2$` is

`$$SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$$`
* `$p_1$` and `$p_2$` represent the population proportions, 
* `$n_1$` and `$n_2$` represent the sample sizes.
]

.pull-right[
* In most cases, the standard error is approximated using the observed data:

`$$\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$`

where `$\hat{p}_1$` and `$\hat{p}_2$` represent the observed sample proportions
* NOTE: `$\sqrt{a+b} \neq \sqrt{a} + \sqrt{b}$`!  Think of `$\sqrt{2 + 2} = \sqrt{4}=2$`, but `$\sqrt{2} + \sqrt{2} = 2 \sqrt{2} \approx 2.82$`.

]

---

.pull-left[

* **Margin of error** defined in terms of standard error: 
`$$z^\star \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$`
  - where `$z^\star$` is calculated from a specified percentile on the normal distribution (e.g. 1.96 for 95%)
  - this describes the variability of the difference in proportions (when assumptions for normal model hold)
* Can then use standard approach for defining confidence interval:

$$
`\begin{aligned}
\text{point estimate} \ &\pm \  z^{\star} \ \times \ SE \\
(\hat{p}_1 - \hat{p}_2) \ &\pm \ z^{\star} \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
\end{aligned}`
$$

* Let's return to the CPR dataset

]

.pull-left[
<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Group </th>
   <th style="text-align:center;"> Died </th>
   <th style="text-align:center;"> Survived </th>
   <th style="text-align:center;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> Control </td>
   <td style="text-align:center;width: 7em; "> 39 </td>
   <td style="text-align:center;width: 7em; "> 11 </td>
   <td style="text-align:center;width: 7em; "> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Treatment </td>
   <td style="text-align:center;width: 7em; "> 26 </td>
   <td style="text-align:center;width: 7em; "> 14 </td>
   <td style="text-align:center;width: 7em; "> 40 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Total </td>
   <td style="text-align:center;width: 7em; "> 65 </td>
   <td style="text-align:center;width: 7em; "> 25 </td>
   <td style="text-align:center;width: 7em; "> 90 </td>
  </tr>
</tbody>
</table>

* Are conditions met?
  - Random assignment of treatment/control, so independence is satisfied
  - Success-failure condition in each group: 11, 14, 39, 26 in each outcome so condition is satisfied - all satisfied
  
* Let's create 90% confidence interval for difference in survival rates. 
    * Call `$p_T$` for the survival rate in the treatment group; `$p_C$` for the control group:
    * We estimate these population rates by:

`$$\hat{p}_{T} = \frac{14}{40} = 0.35 \qquad \hat{p}_{C} = \frac{11}{50}  = 0.22$$`

]

---

.pull-left[

`$$\hat{p}_{T} - \hat{p}_{C} = \frac{14}{40} - \frac{11}{50}  = 0.35 - 0.22 = 0.13$$`
* Formula for the SE:

`$$SE \approx \sqrt{\frac{0.35 (1 - 0.35)}{40} + \frac{0.22 (1 - 0.22)}{50}}  = 0.095$$`

* Need to find `$z^*$` for 90% confidence interval:

``` r
qnorm(0.05)
#> [1] -1.644854
```
* `$z^* = 1.64$`, so

$$
`\begin{aligned}
\text{point estimate} \ &\pm \ z^{\star} \ \times \ SE \\
0.13 \ &\pm \ 1.64 \ \times \ 0.095 \\
(-0.027 \ &, \ 0.287)
\end{aligned}`
$$
]

.pull-right[
Conclusion: We are 90% confident that individuals receiving blood thinners have between a 2.7% smaller chance of survival to 28.7% greater chance of survival

* Since 0% is in the interval, we cannot conclude that blood thinners have an effect on CPR
* Recall: our bootstrap estimate for a 90% confidence interval was `$(-0.029, 0.280)$`
* Not exactly the same - we are using simulations for bootstrap, mathematical approximation for above - but roughly similar ideas. 
* Similarly, using randomization test, we found around 12% likelihood that difference we observed in the original dataset (13%) would occur due to random chance if blood thinner treatment is independent of survival. 
]

---

.pull-left[
### Hypothesis test
For hypothesis test for *difference in proportions* being not equal (null `$H_0$`: `$p_1-p_2=0$`), similar ideas from before: calculate SE and check technical conditions like independence and success-failure

* One difference: success-failure is defined in terms of **pooled proportion**,
$$ \hat p_{pool} = \frac{\text{number successes}}{\text{number cases}} = \frac{\hat p_1 n_1 + \hat p_2 n_2}{n_1 + n_2}. $$
Note that
$$ \hat p_1 = \frac{\text{number successes in sample 1}}{n_1}.$$
* From this we can formulate a test statistic for assessing difference in two proportions - a Z score

]

.pull-right[

* Standard error: approximation for the standard deviation of the sampling distribution for `$\hat p_1 - \hat p_2$`
`$$\sqrt{\hat p_{pool}(1-\hat p_{pool}) \left( \frac{1}{n_1} + \frac{1}{n_2}\right)}$$`

* General form for Z-score:

$$ 
`\begin{align*}
Z &= \frac{\text{point estimate} - \text{null value}}{SE} \\
&= \frac{\hat p_1 - \hat p_2 - 0}{ \sqrt{\hat p_{pool}(1-\hat p_{pool}) \left( \frac{1}{n_1} + \frac{1}{n_2}\right) }}
\end{align*}`
$$
* When `$H_0$` is true and following conditions hold, `$Z$` has approximately standard normal distribution.
  - Independent observations
  - Sufficiently large samples: <br> `$n_1\,\hat p_{pool} \geq 10, \quad n_1 (1-\hat p_{pool}) \geq 10$`, `$n_2\,\hat p_{pool} \geq 10, \quad n_2 (1-\hat p_{pool}) \geq 10$`
]

---
.pull-left[
#### Example: mammogram study
We'll look at data for whether getting mammograms as a part of regular breast cancer exams is effective at reducing death from breast cancer

- Control: non-mammogram breast cancer exam
- Treatment mammograms included in breast cancer exam

Randomly assigned ~90,000 people to each of control/treatment groups

- `mammogram` dataset from openintro package

<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Death from breast cancer?</div></th>
</tr>
  <tr>
   <th style="text-align:left;"> Treatment </th>
   <th style="text-align:right;"> Yes </th>
   <th style="text-align:right;"> No </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> control </td>
   <td style="text-align:right;"> 505 </td>
   <td style="text-align:right;"> 44,405 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> mammogram </td>
   <td style="text-align:right;"> 500 </td>
   <td style="text-align:right;"> 44,425 </td>
  </tr>
</tbody>
</table>

]

.pull-right[
We want to test whether mammograms are effective

* Formulate hypotheses:
  - `$H_0$`: mammograms do not affect survival likelihood (`$p_M = p_C$`)
  - `$H_A$`: mammograms improve likelihood of survival (`$p_M > p_C$`)
* Check conditions for being able to use Z-score:
  - Indep: random treatment assignment
  - Success-failure: need to calculate `$\hat p_{pool}$`,
`\begin{align}
\hat p_{pool} 
&= \frac{\text{# of patients died from breast cancer}}{\text{# of patients in study}} \\
&= \frac{500 + 505}{500 + 44425 + 505 + 44405} 
= 0.0112
\end{align}`






$$
`\begin{align*}
\hat{p}_{\textit{pool}} \times n_{M}
      &= 0.0112 \times \text{44,925} = 503\\
 (1 - \hat{p}_{\textit{pool}}) \times n_{M}
      &= 0.9888 \times \text{44,925} = \text{44,422} \\
  \hat{p}_{\textit{pool}} \times n_{C}
      &= 0.0112 \times \text{44,910} = 503\\
  (1 - \hat{p}_{\textit{pool}}) \times n_{C}
      &= 0.9888 \times \text{44,910} = \text{44,407}
\end{align*}`
$$
        - All are at least 10
    * Z score is approximately standard normal

]

---
.pull-left[

* To calculate a confidence interval, we need the point estimate as well as the standard error
* `$\hat p_{pool} = 0.0112$`  is our best estimate of `$p_M$` and `$p_C$` *if the null hypothesis is true* that `$p_M = p_C$`; we will use this to compute the standard error

]

.pull-right[

The breast cancer death rate in the mammogram group was 0.012% less than in the control group.
Next, the standard error is calculated *using the pooled proportion* `$\hat{p}_{\textit{pool}}:$`

`$$\hat{p}_{M} - \hat{p}_{C}= \frac{500}{500 + 44425} - \frac{505}{505 + 44405} = -0.00012$$`

`$$SE = \sqrt{\frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}{n_{M}} + \frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}{n_{C}}}= 0.00070$$`

Just like in past tests, we first compute a test statistic and draw a picture:

`$$Z = \frac{\text{point est} - \text{null val}}{SE} = \frac{-0.00012 - 0}{0.00070} = -0.17$$`

]

---

<img src="lec19_files/figure-html/unnamed-chunk-15-1.png" width="75%" />
* Z score: -0.17
* Right tail area: 0.4325, via `pnorm(-0.17)`
* Double to get the p-value: 0.8650
* Can we conclude mammograms have no benefits or do not cause harm?
  - We do not reject the null given the evidence - if mammograms help or hurt, the data suggests the effect is not too large
  - We do not "accept" the null hypothesis - we just do not have enough evidence to reject it