class: center, middle, inverse, title-slide .title[ # Inference for two proportions ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Akira Horiguchi
Figures taken from [IMS], Cetinkaya-Rundel and Hardin text ] --- .pull-left[ ### Based on Ch 17 of IMS We will now use similar ideas from previous lectures to derive confidence intervals and hypothesis tests for differences in population proportions from two groups: `\(p_1 - p_2\)`. * We will estimate `\(p_1-p_2\)` by `\(\hat p_1 - \hat p_2\)` - `\(\hat p_1\)` = proportion of successes in group 1 - `\(\hat p_2\)` = proportion of successes in group 2 * To estimate variability / perform inference, we'll use three methods: - Randomization test - Bootstrap for confidence intervals - Normal approximation ] .pull-right[ Experiment: two treatments on patients who underwent CPR for a heart attack and then admitted to a hospital <!-- * Example dataset: `openintro::cpr` --> * Patients are randomly assigned to either receive blood thinner (treatment) or not (control) * Outcome: whether patient survived for >= 24 hours post heart attack ``` r str(openintro::cpr) #> tibble [90 × 2] (S3: tbl_df/tbl/data.frame) #> $ group : Factor w/ 2 levels "control","treatment": 1 1 1 1 1 1 1 1 1 1 ... #> $ outcome: Factor w/ 2 levels "died","survived": 2 2 2 2 2 2 2 2 2 2 ... ``` <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Group </th> <th style="text-align:center;"> Died </th> <th style="text-align:center;"> Survived </th> <th style="text-align:center;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 7em; "> Control </td> <td style="text-align:center;width: 7em; "> 39 </td> <td style="text-align:center;width: 7em; "> 11 </td> <td style="text-align:center;width: 7em; "> 50 </td> </tr> <tr> <td style="text-align:left;width: 7em; "> Treatment </td> <td style="text-align:center;width: 7em; "> 26 </td> <td style="text-align:center;width: 7em; "> 14 </td> <td style="text-align:center;width: 7em; "> 40 </td> </tr> <tr> <td style="text-align:left;width: 7em; "> Total </td> <td style="text-align:center;width: 7em; "> 65 </td> <td style="text-align:center;width: 7em; "> 25 </td> <td style="text-align:center;width: 7em; "> 90 </td> </tr> </tbody> </table> ] --- .pull-left[ <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Group </th> <th style="text-align:center;"> Died </th> <th style="text-align:center;"> Survived </th> <th style="text-align:center;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 7em; "> Control </td> <td style="text-align:center;width: 7em; "> 39 </td> <td style="text-align:center;width: 7em; "> 11 </td> <td style="text-align:center;width: 7em; "> 50 </td> </tr> <tr> <td style="text-align:left;width: 7em; "> Treatment </td> <td style="text-align:center;width: 7em; "> 26 </td> <td style="text-align:center;width: 7em; "> 14 </td> <td style="text-align:center;width: 7em; "> 40 </td> </tr> <tr> <td style="text-align:left;width: 7em; "> Total </td> <td style="text-align:center;width: 7em; "> 65 </td> <td style="text-align:center;width: 7em; "> 25 </td> <td style="text-align:center;width: 7em; "> 90 </td> </tr> </tbody> </table> * Proportion of treatment patients who survive: `\(\hat p_T = \frac{14}{40} = 0.35\)` * Proportion of control patients who survive: `\(\hat p_C = \frac{11}{50} = 0.22\)` * Difference `\(\hat p_T - \hat p_C = 0.13\)` -- how likely would this occur due to random chance? * Formulate a hypothesis test, and a randomization procedure to examine the hypothesis. - `\(H_0\)`: blood thinners after CPR<br> is independent of survival: `\(p_1-p_2=0\)`. - `\(H_A\)`: blood thinners after CPR<br> increase chance of survival: `\(p_1-p_2>0\)` * Since the assignment of treatment/control was random initially, if we reject the null, we can say the difference we due to usage of blood thinners (causal statement) ] -- .pull-right[ #### Randomization test * Assume same number of people died (65) / survived (25), then see what happens if we randomly assign treatment or control to each * For each simulated treatment/control group,<br> calculate proportion `\(\hat p_{1,sim} - \hat p_{2,sim}\)` <!-- * We plot the difference in 100 randomizations below. --> <img src="lec19_files/figure-html/fig-cpr-rand-dot-plot-1.png" width="432" /> * 12/100 simulations had a difference of >= 13% * If `\(H_0\)` were true, about 12% chance of differences like we saw, so we cannot reject `\(H_0\)` under significance level `\(\alpha = 0.05\)`: not enough evidence to claim that blood thinners improve survival rate ] --- .pull-left[ #### Bootstrap confidence interval for difference in proportions Last slide: randomization distribution to understand the distribution of `\(\hat p_1 - \hat p_2\)` under null hypothesis `\(H_0: p_1-p_2=0\)`. * We previously applied bootstrap sampling to a single population ] -- .pull-right[ * We now have samples from two different groups * Resample from group 1 to calculate `\(\hat p_{1,boot1}\)` * Resample from group 2 to calculate `\(\hat p_{2,boot1}\)` * Each pair of bootstrap samples gives us an estimate of the difference `\(\hat p_{1,boot1} - \hat p_{2,boot1}\)`. * We can use this to create a distribution of the difference in proportions. <!-- * We use this to tally up different estimates of the proportions in each bootstrap sample, which gives us an estimate of how much variability there is in the sample proportions (and thus the difference in sample proportions) --> <!-- and we will re-sample from each group to calculate bootstrap proportions `\(\hat p_{1,boot1}, \hat p_{2,boot2}\)` from each group. --> ] <img src="boot2propresamps.png" width="110%" /> --- <img src="boot2prop1.png" width="100%" /> --- .pull-left[ <!-- #### Confidence intervals from bootstrap samples --> <img src="lec19_files/figure-html/fig-bootCPR1000-1.png" width="432" /> * We have already discussed one method for constructing confidence interval: suppose we saved the bootstrap data as ``` r cpr_boot_dist |> as_tibble() |> head(n=4) #> # A tibble: 4 × 2 #> replicate stat #> <int> <dbl> #> 1 1 0.0847 #> 2 2 0.102 #> 3 3 0.167 #> 4 4 0.0979 ``` ] -- .pull-right[ * For 95% confidence interval, we want to find the 2.5th and 97.5th percentiles/quantile: ``` r quantile(cpr_boot_dist$stat, c(0.025, 0.975)) #> 2.5% 97.5% #> -0.06631288 0.32429733 ``` * These two values are the end points of the bootstrap 95% confidence interval: - 95% confidence that in the population, true difference in survival rate for people receiving blood thinners post CPR is between 0.066 lower and 0.324 higher than survival rate for non receivers * Similar calculation for 90% conf interval: ``` r quantile(cpr_boot_dist$stat, c(0.05, 0.95)) #> 5% 95% #> -0.03240614 0.28406692 ``` <!-- * 90% conf interval: (-0.029, 0.280) --> * These two values are the end points of the bootstrap 90% confidence interval: - Even at 90% confidence, still includes 0 ] --- .pull-left[ #### Mathematical model for difference in proportions Like with `\(\hat p\)` for a single proportion, difference in proportions `\(\hat p_1 - \hat p_2\)` can be modeled with a normal distribution under certain assumptions. * Assumptions are similar to those from before, but more stringent: - **Independence**: data *within each group* and *between groups* are indepednent; generally satisfied if in randomized experiment - **Success-failure condition**: at least 10 successes and 10 failure *within each group* * When these are satisfied, the standard error of `\(\hat p_1 - \hat p_2\)` is `$$SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$$` * `\(p_1\)` and `\(p_2\)` represent the population proportions, * `\(n_1\)` and `\(n_2\)` represent the sample sizes. ] -- .pull-right[ * In most cases, the standard error is approximated using the observed data: <!-- SE(\hat{p}_1 - \hat{p}_2) \approx --> `$$\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$` where `\(\hat{p}_1\)` and `\(\hat{p}_2\)` represent the observed sample proportions * NOTE: `\(\sqrt{a+b} \neq \sqrt{a} + \sqrt{b}\)`! Think of `\(\sqrt{2 + 2} = \sqrt{4}=2\)`, but `\(\sqrt{2} + \sqrt{2} = 2 \sqrt{2} \approx 2.82\)`. <!-- * **Margin of error** defined in terms of standard error: --> <!-- `$$z^\star \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$` --> <!-- where `\(z^\star\)` is calculated from a specified percentile on the normal distribution (e.g. 1.96 for 95%) --> ] --- .pull-left[ * **Margin of error** defined in terms of standard error: `$$z^\star \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$` - where `\(z^\star\)` is calculated from a specified percentile on the normal distribution (e.g. 1.96 for 95%) - this describes the variability of the difference in proportions (when assumptions for normal model hold) * Can then use standard approach for defining confidence interval: $$ `\begin{aligned} \text{point estimate} \ &\pm \ z^{\star} \ \times \ SE \\ (\hat{p}_1 - \hat{p}_2) \ &\pm \ z^{\star} \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \end{aligned}` $$ * Let's return to the CPR dataset ] -- .pull-left[ <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Group </th> <th style="text-align:center;"> Died </th> <th style="text-align:center;"> Survived </th> <th style="text-align:center;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 7em; "> Control </td> <td style="text-align:center;width: 7em; "> 39 </td> <td style="text-align:center;width: 7em; "> 11 </td> <td style="text-align:center;width: 7em; "> 50 </td> </tr> <tr> <td style="text-align:left;width: 7em; "> Treatment </td> <td style="text-align:center;width: 7em; "> 26 </td> <td style="text-align:center;width: 7em; "> 14 </td> <td style="text-align:center;width: 7em; "> 40 </td> </tr> <tr> <td style="text-align:left;width: 7em; "> Total </td> <td style="text-align:center;width: 7em; "> 65 </td> <td style="text-align:center;width: 7em; "> 25 </td> <td style="text-align:center;width: 7em; "> 90 </td> </tr> </tbody> </table> * Are conditions met? - Random assignment of treatment/control, so independence is satisfied - Success-failure condition in each group: 11, 14, 39, 26 in each outcome so condition is satisfied - all satisfied * Let's create 90% confidence interval for difference in survival rates. * Call `\(p_T\)` for the survival rate in the treatment group; `\(p_C\)` for the control group: * We estimate these population rates by: `$$\hat{p}_{T} = \frac{14}{40} = 0.35 \qquad \hat{p}_{C} = \frac{11}{50} = 0.22$$` <!-- `$$\hat{p}_{T} - \hat{p}_{C} = \frac{14}{40} - \frac{11}{50} = 0.35 - 0.22 = 0.13$$` --> ] --- .pull-left[ `$$\hat{p}_{T} - \hat{p}_{C} = \frac{14}{40} - \frac{11}{50} = 0.35 - 0.22 = 0.13$$` * Formula for the SE: `$$SE \approx \sqrt{\frac{0.35 (1 - 0.35)}{40} + \frac{0.22 (1 - 0.22)}{50}} = 0.095$$` * Need to find `\(z^*\)` for 90% confidence interval: ``` r qnorm(0.05) #> [1] -1.644854 ``` * `\(z^* = 1.64\)`, so $$ `\begin{aligned} \text{point estimate} \ &\pm \ z^{\star} \ \times \ SE \\ 0.13 \ &\pm \ 1.64 \ \times \ 0.095 \\ (-0.027 \ &, \ 0.287) \end{aligned}` $$ ] -- .pull-right[ Conclusion: We are 90% confident that individuals receiving blood thinners have between a 2.7% smaller chance of survival to 28.7% greater chance of survival * Since 0% is in the interval, we cannot conclude that blood thinners have an effect on CPR * Recall: our bootstrap estimate for a 90% confidence interval was `\((-0.029, 0.280)\)` * Not exactly the same - we are using simulations for bootstrap, mathematical approximation for above - but roughly similar ideas. * Similarly, using randomization test, we found around 12% likelihood that difference we observed in the original dataset (13%) would occur due to random chance if blood thinner treatment is independent of survival. ] --- .pull-left[ ### Hypothesis test For hypothesis test for *difference in proportions* being not equal (null `\(H_0\)`: `\(p_1-p_2=0\)`), similar ideas from before: calculate SE and check technical conditions like independence and success-failure * One difference: success-failure is defined in terms of **pooled proportion**, $$ \hat p_{pool} = \frac{\text{number successes}}{\text{number cases}} = \frac{\hat p_1 n_1 + \hat p_2 n_2}{n_1 + n_2}. $$ Note that $$ \hat p_1 = \frac{\text{number successes in sample 1}}{n_1}.$$ * From this we can formulate a test statistic for assessing difference in two proportions - a Z score ] -- .pull-right[ * Standard error: approximation for the standard deviation of the sampling distribution for `\(\hat p_1 - \hat p_2\)` `$$\sqrt{\hat p_{pool}(1-\hat p_{pool}) \left( \frac{1}{n_1} + \frac{1}{n_2}\right)}$$` <!-- ```{r} --> <!-- #| echo: false --> <!-- #| out.width: 80% --> <!-- knitr::include_graphics("se_pool.png") --> <!-- ``` --> * General form for Z-score: $$ `\begin{align*} Z &= \frac{\text{point estimate} - \text{null value}}{SE} \\ &= \frac{\hat p_1 - \hat p_2 - 0}{ \sqrt{\hat p_{pool}(1-\hat p_{pool}) \left( \frac{1}{n_1} + \frac{1}{n_2}\right) }} \end{align*}` $$ * When `\(H_0\)` is true and following conditions hold, `\(Z\)` has approximately standard normal distribution. - Independent observations - Sufficiently large samples: <br> `\(n_1\,\hat p_{pool} \geq 10, \quad n_1 (1-\hat p_{pool}) \geq 10\)`, `\(n_2\,\hat p_{pool} \geq 10, \quad n_2 (1-\hat p_{pool}) \geq 10\)` ] --- .pull-left[ #### Example: mammogram study We'll look at data for whether getting mammograms as a part of regular breast cancer exams is effective at reducing death from breast cancer - Control: non-mammogram breast cancer exam - Treatment mammograms included in breast cancer exam Randomly assigned ~90,000 people to each of control/treatment groups - `mammogram` dataset from openintro package <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Death from breast cancer?</div></th> </tr> <tr> <th style="text-align:left;"> Treatment </th> <th style="text-align:right;"> Yes </th> <th style="text-align:right;"> No </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control </td> <td style="text-align:right;"> 505 </td> <td style="text-align:right;"> 44,405 </td> </tr> <tr> <td style="text-align:left;"> mammogram </td> <td style="text-align:right;"> 500 </td> <td style="text-align:right;"> 44,425 </td> </tr> </tbody> </table> ] -- .pull-right[ We want to test whether mammograms are effective * Formulate hypotheses: - `\(H_0\)`: mammograms do not affect survival likelihood (`\(p_M = p_C\)`) - `\(H_A\)`: mammograms improve likelihood of survival (`\(p_M > p_C\)`) * Check conditions for being able to use Z-score: - Indep: random treatment assignment - Success-failure: need to calculate `\(\hat p_{pool}\)`, `\begin{align} \hat p_{pool} &= \frac{\text{# of patients died from breast cancer}}{\text{# of patients in study}} \\ &= \frac{500 + 505}{500 + 44425 + 505 + 44405} = 0.0112 \end{align}` <!-- ```{r} --> <!-- #| echo: false --> <!-- #| out.width: 80% --> <!-- knitr::include_graphics("breast_cancer.png") --> <!-- ``` --> <!-- * Now to check success-failure conditions: --> $$ `\begin{align*} \hat{p}_{\textit{pool}} \times n_{M} &= 0.0112 \times \text{44,925} = 503\\ (1 - \hat{p}_{\textit{pool}}) \times n_{M} &= 0.9888 \times \text{44,925} = \text{44,422} \\ \hat{p}_{\textit{pool}} \times n_{C} &= 0.0112 \times \text{44,910} = 503\\ (1 - \hat{p}_{\textit{pool}}) \times n_{C} &= 0.9888 \times \text{44,910} = \text{44,407} \end{align*}` $$ - All are at least 10 * Z score is approximately standard normal <!-- , so this (plus independence) means we can model the difference of proportions using a normal distribution, and we know that the --> ] --- .pull-left[ <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Death from breast cancer?</div></th> </tr> <tr> <th style="text-align:left;"> Treatment </th> <th style="text-align:right;"> Yes </th> <th style="text-align:right;"> No </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control </td> <td style="text-align:right;"> 505 </td> <td style="text-align:right;"> 44,405 </td> </tr> <tr> <td style="text-align:left;"> mammogram </td> <td style="text-align:right;"> 500 </td> <td style="text-align:right;"> 44,425 </td> </tr> </tbody> </table> * To calculate a confidence interval, we need the point estimate as well as the standard error * `\(\hat p_{pool} = 0.0112\)` is our best estimate of `\(p_M\)` and `\(p_C\)` *if the null hypothesis is true* that `\(p_M = p_C\)`; we will use this to compute the standard error ] .pull-right[ <!-- 500 / (500 + 44425) - 505 / (505 + 44405) --> The breast cancer death rate in the mammogram group was 0.012% less than in the control group. Next, the standard error is calculated *using the pooled proportion* `\(\hat{p}_{\textit{pool}}:\)` `$$\hat{p}_{M} - \hat{p}_{C}= \frac{500}{500 + 44425} - \frac{505}{505 + 44405} = -0.00012$$` `$$SE = \sqrt{\frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}{n_{M}} + \frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}{n_{C}}}= 0.00070$$` Just like in past tests, we first compute a test statistic and draw a picture: `$$Z = \frac{\text{point est} - \text{null val}}{SE} = \frac{-0.00012 - 0}{0.00070} = -0.17$$` ] --- <img src="lec19_files/figure-html/unnamed-chunk-15-1.png" width="75%" /> * Z score: -0.17 * Right tail area: 0.4325, via `pnorm(-0.17)` * Double to get the p-value: 0.8650 * Can we conclude mammograms have no benefits or do not cause harm? - We do not reject the null given the evidence - if mammograms help or hurt, the data suggests the effect is not too large - We do not "accept" the null hypothesis - we just do not have enough evidence to reject it <!-- --- --> <!-- .pull-left[ --> <!-- #### Midterm prep: `ggplot`, `dplyr` and functions --> <!-- * Know and understand the core ggplot functions (and the dplyr basics: `mutate`, `group_by`, `summarize`, ... ) --> <!-- - `ggplot`, `aes()`, `geom_point()`, `geom_smooth()`, `geom_bar()`, `geom_histogram()`, `geom_boxplot()`, `geom_density()`, `geom_freqpoly()` --> <!-- - `facet_wrap` / `facet_grid`, --> <!-- - `labs`, `guides`, `theme` --> <!-- * factors, how to relevel them, how to use these in conjunction with ggplot --> <!-- * functions: general syntax, how default values are used, how to use them, embracing (for use with dplyr), `across()`, --> <!-- ] --> <!-- .pull-right[ --> <!-- #### Linear regression --> <!-- * Slope-intercept / point-slope equations for lines --> <!-- * Residuals, residual plots, correlation, R^2 (coefficient of determination), SST/SSE --> <!-- * Understanding outputs of `lm()` in R, translating these into predictions given `\(x\)` values --> <!-- * Understanding how outliers affect least squares line --> <!-- * Plotting data from residuals and vice-versa --> <!-- * Multiple linear regression; categorical variables with 2+ categories; adjusted R^2; model selection with stepwise selection --> <!-- #### Inference --> <!-- * Permutation/randomization tests, `sample()` --> <!-- * Bootstrap samples, bootstrap statistics, bootstrap confidence intervals --> <!-- * Z score / normal distribution approximation: `qnorm()`, `pnorm()`, `quantile()`, Z table --> <!-- * Hypothesis tests, standard error, margin of error --> <!-- * This for proportions; difference of proportions --> <!-- ] -->