class: center, middle, inverse, title-slide .title[ # Foundations of inference:
recap and application ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Akira Horiguchi
Figures taken from [IMS], Cetinkaya-Rundel and Hardin text ] --- Let's summarize what we've described so far. <img src="summary.png" width="56%" /> <!-- | | Randomization | Bootstrapping | Mathematical models | --> <!-- | --------: | --------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --> <!-- | What does it do? | Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment | Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data from a population | Uses theory (primarily the CLT) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples | --> <!-- | What is the random process described? | Randomized experiment | Random sampling from a population | Randomized experiment or random sampling | --> <!-- | What other random process can be approximated? | Describe random sampling in an observational model | Describe random allocation in an experiment | Describe random sampling in an observational model or random allocation in an experiment | --> <!-- | What is it best for? | Hypothesis testing | Confidence intervals (can also be used for bootstrap hypothesis testing for one proportion) | Quick analyses through, for example, calculating a Z score | --> <!-- | Analagous physical action? | Shuffling cards | Pulling marbles from a bag with replacement | Not applicable | --> --- #### Clarifying definitions A **distribution** always describes the shape, center, and variability of values, but what exactly is varying? .pull-left[ - **Data distribution**: the **observed data**. - **Population distribution**: the entire **population of data**. Typically not observed. - **Sampling distribution**: all possible values of a **sample statistic** from samples of a given size from a given population. Since the population is never observed, it's never possible to observe the true sampling distribution. But when certain conditions hold, the Central Limit Theorem tells us what the sampling distribution is. <!-- but Central Limit Theorem can help. --> ] .pull-right[ - **Randomization distribution**: all possible values of a **sample statistic** from *random allocations of the treatment variable*. Random allocations enforce treatment variable being independent of response variable. Thus typically describes the null hypothesis. <!-- Typically approximate this distribution by simulating many sample statistic values --> <!-- Typically approximate this distribution by simulating many randomizations. Typically describes the null hypothesis. --> <!-- do not know due to computational limitations; often we sample a large number and use this as estimate. Typically describes the null hypothesis. --> - **Bootstrap distribution**: all possible values of a **sample statistic** from resamples of the observed data. Since bootstrap distributions are generated by randomly resampling the observed data, they are centered at the sample statistic. Often used for constructing confidence intervals. <!-- Typically approximate this distribution by simulating many resamples. --> <!-- Typically do not know due to computational limitations; often we sample a large number and use this as estimate. --> ] --- .pull-left[ ### Case study: Malaria vaccine Volunteer patients were randomized into one of two groups - 14 patients receive experimental vaccine PfSPZ - 6 patients receive placebo vaccine 4 months later, all 20 were exposed to a "drug-sensitive" (easily treatable) malaria virus strain. Which patients got an infection? <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> treatment </th> <th style="text-align:right;"> infection </th> <th style="text-align:right;"> no infection </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> placebo </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> vaccine </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> * 6/6 = 100% of placebo patients got infection * 5/14 = 35.7% of vaccinated patients got infection ] .pull-right[ <!-- Q: Is vaccination independent of infection rate? --> Does the data provides **convincing enough evidence** that the vaccine is effective? * Pretty small sample size (only 20 patients); perhaps the large proportion difference is due to random chance * (Since this is a *randomized experiment*, we might be able to reason that the vaccine *caused* decrease in infection rate.) Let's use a randomization test to see how likely this outcome would under null hypothesis `\(H_0\)` <!-- *if vaccine were independent of infection* --> - `\(H_0\)`: vaccination is **independent** of infection rate - `\(H_A\)`: vaccination and infection rate are **dependent**; since only difference is random assignment to vaccine or placebo, we know vaccine caused lower infection <!-- If null hypothesis `\(H_0\)` is true, then the difference in proportions we saw was just due to random chance. --> ] --- .pull-left[ #### Randomization test <!-- For randomization test, our starting place is the o --> Original observed data: <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> treatment </th> <th style="text-align:right;"> infection </th> <th style="text-align:right;"> no infection </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> placebo </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> vaccine </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> <!-- * We start by assuming that of these 20 patients, 11 were infected and 9 were not infected --> * Under `\(H_0\)`, 11 patients were going to develop an infection and 9 patients would not develop an infection, regardless of which treatment (vaccine vs placebo) they were given * We then repeatedly randomly assign treatment label to each patient - "Shuffle card" idea: 20 cards, 11 labeled "infection", 9 labeled "no infection", then shuffle and split into two decks: "vaccine" (14 cards) and "placebo" (6 cards) * Then observe how many in each vaccine / placebo group are infected vs not infected ] .pull-right[ Output of first randomization: <table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> treatment </th> <th style="text-align:right;"> infection </th> <th style="text-align:right;"> no infection </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> placebo </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> vaccine </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> <!-- * 4/6 = 66% of placebo infected, --> <!-- * 7/14 = 50% of vaccine infected --> <!-- * difference of 16.7% --> * This randomization has proportion difference of `\(4/6 - 7/14 = 1/6 \approx 0.167\)`. * Compare to original proportion difference `\(6/6 - 5/14 = 9/14 \approx 0.643\)` * This is just one randomization -- need to do many to see how it works out <!-- * Original data had proportion difference of --> <!-- 6/6 = 100% of placebo infected, 5/14 = 35.7% infected, difference of 64.3% --> ] --- .pull-left[ #### Randomization test <!-- Original observed data: --> <!-- ```{r} --> <!-- #| echo: false --> <!-- malaria |> --> <!-- count(treatment, outcome, .drop = FALSE) |> --> <!-- pivot_wider(names_from = outcome, values_from = n) |> --> <!-- adorn_totals(where = c("row", "col")) |> --> <!-- kbl(linesep = "", booktabs = TRUE) |> --> <!-- kable_styling( --> <!-- bootstrap_options = c("striped", "condensed"), --> <!-- latex_options = c("striped", "hold_position"), --> <!-- full_width = FALSE --> <!-- ) --> <!-- ``` --> <!-- - Proportion difference is `\(9/14 \approx 0.643\)`. --> Output of prop. diff. for 10,000 randomizations: <img src="lec17_files/figure-html/fig-malaria-rand-dot-plot-1.png" width="432" /> * 127 of the 10,000 randomizations produced a proportion difference at least as large as the observed proportion difference `\(9/14\)` * So about 1.27% chance of seeing something as extreme as the difference in the original data Based on this, we will either reject or not reject `\(H_0\)`. ] .pull-right[ In statistical inference, data scientists evaluate which model is most reasonable given the data. - Generally, errors occur, just like rare events, and we might choose the wrong model. - Statistical inference gives us tools to **control and evaluate how often decision errors occur**. ]