Hypothesis testing with randomization

class: center, middle, inverse, title-slide

.title[
# Hypothesis testing with randomization
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[ 
**Based on Ch 11 of IMS**

*Statistical inference* is concerned with understanding and quantifying *uncertainty* in parameter estimates

* Data is typically collected to answer a research question about a larger group called a *population*
    * Polls: voters who were polled vs entire population of voters
    * Clinical trials: a study's subjects vs entire population who has a particular disease
* Recall: in linear regression we estimated a slope and intercept from data.
* Do these well represent the slope and intercept of the population?
* Depends on if the data is a representative subset of the population.

]

.pull-right[

How representative of the population is the data?

- Consider two datasets collected from the same population using the same (randomized) methods
- These two datasets will typically not be identical; this is due to *randomness*
- Quantifying the variability in the data is neither obvious nor easy to do, i.e., answering the question “how different is one dataset from another?” is not trivial.
- Studying randomness of this form is a key focus of statistics.

]

---
.pull-left[ 
The goal of inference is to quantify how likely certain outcomes are due to random chance vs. due to real differences.

* Think through what would happen if we repeatedly took different surveys of people's opinions on support for some policy.
* There will be some natural sample-to-sample variation, but if there is a big difference in support for vs. against the policy, this randomness will be drowned out by the true difference in the population preference.
* The coming lectures will discuss the *hypothesis testing* framework: allows for formally evaluating claims about populations

Let's look at two motivating examples.

]

.pull-right[

]

---
.pull-left[ 
### 1970s Discrimination Study
This study investigated sex discrimination in 1970s

* Question we investigate: "Are female employees discriminated against in promotion decisions made by their male managers?"

``` r
openintro::sex_discrimination |> str()
#> tibble [48 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ sex     : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ decision: Factor w/ 2 levels "promoted","not promoted": 1 1 1 1 1 1 1 1 1 1 ...
```

``` r
table(sex_discrimination)
#>         decision
#> sex      promoted not promoted
#>   male         21            3
#>   female       14           10
```
- Proportion of promoted males: `$\hat{p}_M = 21 / 24$`
- Proportion of promoted females: `$\hat{p}_F = 14 / 24$`
- Difference of proportions: `$\hat{p}_M - \hat{p}_F = 0.292$`
]

.pull-right[

* We can formulate two competing claims about the relationship between sex and promotions:
  - `$H_0$`, **Null hypothesis**: variables `sex` and `decision` are independent.  Any observed differences in proportions promoted are due to natural variability.
        - E.g.: `$p_M = p_F$`.
  - `$H_A$`, **Alternative hypothesis**: variables `sex` and `decision` are *dependent*.  Observed differences in proportions are due to dependence between the two variables.
        - E.g.: `$p_M > p_F$`.
* Here we only have a single predictor variable, and this is *observational data*. We cannot say that sex *caused* difference in hiring outcomes.
    * Correlation does not imply causation!
    * To infer causality, would need a truly randomized study, with care taken to ensure that the two groups (here: male/female) are equal among other characteristics (e.g. seniority, number of years worked, etc.).

]

---
.pull-left[

### Variability of the statistic

We can examine whether the null hypothesis `$H_0$` is true by using a **permutation test**.

* If the two variables `sex` and `decision` were independent, then if we shuffled all of the labels of "male" and "female", then the proportion with each `decision` would be roughly the same
* If the proportion within each decision are very different than in the sample that we saw, then there is evidence that `$H_0$` is *not* true

Let's now shuffle the labels of male / female among the 48 study subjects, but keep fixed the number of males, females, number of promoted / not promoted.

- See figures on right: 35 red cards ("promoted") and 13 white cards ("not promoted") get shuffled; resulting card deck then split in half.

]

.pull-right[
<img src="sex-rand-02-shuffle-1.png" width="1000%" />

<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">decision</div></th>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
</tr>
  <tr>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> promoted </th>
   <th style="text-align:right;"> not promoted </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> male </td>
   <td style="text-align:right;width: 7em; "> 18 </td>
   <td style="text-align:right;width: 7em; "> 6 </td>
   <td style="text-align:right;width: 7em; "> 24 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> female </td>
   <td style="text-align:right;width: 7em; "> 17 </td>
   <td style="text-align:right;width: 7em; "> 7 </td>
   <td style="text-align:right;width: 7em; "> 24 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Total </td>
   <td style="text-align:right;width: 7em; "> 35 </td>
   <td style="text-align:right;width: 7em; "> 13 </td>
   <td style="text-align:right;width: 7em; "> 48 </td>
  </tr>
</tbody>
</table>

]

---
.pull-left[
### Observed statistic vs. null statistics

- Shuffle `$\#1$` results in proportion difference of `$\hat{p}_{M,1} - \hat{p}_{F,1} = 0.042$`.
- We can perform multiple shuffles to see the distribution of the proportion differences under the null hypothesis `$H_0$`

]

.pull-right[
<div class="figure">
<img src="lec14_files/figure-html/fig-sex-rand-dot-plot-1.png" alt="A stacked dot plot of the 100 simulated differences between the proportion of male and female files recommended for promotion.  The differences were simulated under the null hypothesis that there was no discrimination. Two of the 100 simulations had a difference of 29.2% and are colored in blue to indicate that they are as or more extreme than the observed difference." width="100%" />
<p class="caption">A stacked dot plot of proportion differences from 100 simulations produced under 
the null hypothesis, `$H_0,$` where the simulated sex and decision are 
independent. Two of the 100 simulations had a difference of at least 
29.2%, the difference observed in the study, and are shown as solid dots.
</p>
</div>
* Under null hypothesis, expect average of 0 diff.
* The 0.292 observed is unlikely, suggests that hiring and sex were *not* independent 
]

---

.pull-left[
### Example: college student savings

Let's consider a study where we ask whether telling a college student that they can save money for later purchases will make them spend less now

-   `$H_0:$` **Null hypothesis**. Reminding students that they can save money for later purchases will not have any impact on students' spending decisions.
-   `$H_A:$` **Alternative hypothesis**. Reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase.
* Dataset: `opportunity_cost` in *openintro*

``` r
opportunity_cost
#> # A tibble: 150 × 2
#>   group   decision 
#>   <fct>   <fct>    
#> 1 control buy video
#> 2 control buy video
#> 3 control buy video
#> 4 control buy video
#> 5 control buy video
#> 6 control buy video
#> # ℹ 144 more rows
```

]

.pull-right[
*"Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99. What would you do in this situation? Please circle one of the options below."*

Half of the 150 students were randomized into a control group and given the following options:

> (A) Buy this entertaining video.

> (B) Not buy this entertaining video.

Remaining 75 students were placed in treatment group, they saw:

> (A) Buy this entertaining video.

> (B) Not buy this entertaining video. Keep the $14.99 for other purchases.

``` r
table(opportunity_cost)
#>            decision
#> group       buy video not buy video
#>   control          56            19
#>   treatment        41            34
```

]

---
.pull-left[

|group     |buy video   |not buy video |
|:---------|:-----------|:-------------|
|control   |74.67% (56) |25.33% (19)   |
|treatment |54.67% (41) |45.33% (34)   |

]

.pull-right[
* See that under treatment, about 20 percentage points higher choose not to buy the video
* How much variability would one expect if the treatment had no effect?  
* We can do the same type of analysis from the previous setting
  - Assume we have 53 people labeled "not buy video", 97 labeled "buy video"
  - Imagine we have index cards with these labels, then we shuffle them and divide into two stacks of 75 people each
  - We imagine each stack is a new "control" and "treatment" group
  - Any difference between proportions of "buy" and "not buy" cards will be due entirely to random chance 
  - We should generally expect each stack to have `$\approx 53/2 = 26.5$` "not buy" cards each
  
]

---

.pull-left[
* Let's look at a single randomization
<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">decision</div></th>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
</tr>
  <tr>
   <th style="text-align:left;"> group </th>
   <th style="text-align:right;"> buy video </th>
   <th style="text-align:right;"> not buy video </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> control </td>
   <td style="text-align:right;width: 7em; "> 46 </td>
   <td style="text-align:right;width: 7em; "> 29 </td>
   <td style="text-align:right;width: 7em; "> 75 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> treatment </td>
   <td style="text-align:right;width: 7em; "> 51 </td>
   <td style="text-align:right;width: 7em; "> 24 </td>
   <td style="text-align:right;width: 7em; "> 75 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Total </td>
   <td style="text-align:right;width: 7em; "> 97 </td>
   <td style="text-align:right;width: 7em; "> 53 </td>
   <td style="text-align:right;width: 7em; "> 150 </td>
  </tr>
</tbody>
</table>

From this table, we can compute a difference that occurred from the first shuffle of the data (i.e., from chance alone):

`$$\hat{p}_{T, shfl1} - \hat{p}_{C, shfl1} = \frac{24}{75} - \frac{29}{75} = - 0.067$$`
* Compare this to the 20 percentage points (0.2) that we saw before.
* Repeat 1000 times; plot results on the right.
]

.pull-right[

<div class="figure">
<img src="lec14_files/figure-html/fig-opportunity-cost-rand-hist-1.png" alt="A histogram of 1,000 chance differences produced under the null hypothesis." width="432" />
<p class="caption">A histogram of 1,000 chance differences produced under the null hypothesis.</p>
</div>
* Thus getting differences of 0.2 are very unlikely if the "buy" and "not buy" outcomes were independent of the treatment (being reminded that you can save money for later)
* Only 6 of the 1,000 randomizations had as extreme a result 
]

---

.pull-left[
### Hypothesis testing
In previous problems, we described a *hypothesis test*:

- **Null hypothesis**: default assumption, skeptical, belief that things could have happened due to chance
- **Alternative hypothesis**: hypothesis that there is some relationship between variables
  
Way to think about it: trial by jury
- Null hypothesis: not guilty.
- Alternative hypothesis: guilty.
- We might reject the null in favor of the alternative *if* there is significant evidence in favor of this claim.
- Failure to reject the null does NOT mean null is true, just that we don't have enough evidence to reject the null. (Innocent until proven guilty.)
]

.pull-right[
### p-values and statistical significance
Recall example about sex discrimination: only 2 of the 100 randomizations resulted in promotion rates as extreme as what we observed in the original data
<img src="lec14_files/figure-html/unnamed-chunk-12-1.png" width="504" />

]

---
.pull-left[
### p-values and statistical significance

More generally, a `$p$`-value represents the probability that, if the null hypothesis is true and a linear model is the correct model of the data, that we would obtain data that is at least as extreme as the result actually observed

* When the `$p$`-value is smaller than a threshold (**significance level** `$\alpha$`), then we say results are *statistically significant at level `$\alpha$`*, and we reject the null hypothesis in favor of the alternative.
* Example: in video experiment, we showed that participants had a 20% drop in likelihood of purchasing if we remind about saving money.
  - Only 6 of 1,000 had as extreme a result, so the `$p$`-value is 0.006. 
  - Statistically significant at level `$\alpha=0.05$` and at level `$\alpha=0.01$`.
]

--
.pull-right[
### Randomization tests summary
* Frame research qeustion in terms of hypotheses
  - Null hypothesis `$H_0$`: skeptical of any relationship between variables
  - Alternative hypothesis `$H_A$`: posits a relationship between variables
* Collect data
* Model randomness that would occur if null hypothesis were true
  - Randomize treatments
* Analyze data and identify `$p$`-value
* Form conclusion about hypotheses using `$p$`-value 
]

---

#### Examples
Let's describe null and alternative hypotheses in words and symbols for the following:

- Starting in 2008, chain restaurants in CA have displayed calorie counts for each menu item.
- Before 2008, we randomly sampled restaurants and recorded whether a person consumed > 2000 calories at the chain restaurant.
- After 2008, we returned and again randomly sampled restaurants and recorded whether a person consumed > 2000 calories at the chain restaurant.
- We want to see if the data provide convincing evidence of a difference in average calorie intake.

**Null hypothesis** `$H_0$`: calorie intake is the independent of whether we display number of calories per item.

**Alternative hypothesis** `$H_A$`: calorie intake is affected by display of number of calories per item.

---

.pull-left[

Let's suppose we had data as follows:

|  | No cal. display | Cal. display | Total |
| -------- | --------- | ------- | ----- |
| >2000 cal | 825 | 350 | 1175 |
| <2000 cal | 1605 | 1800 | 3605 | 
| Total | 2430 | 2150 | 4580 |

Convert this into proportions:

|  | No calorie display | Calorie display | 
| -------- | --------- | ------- |
| >2000 cal | 0.339 | 0.163 | 
| <2000 cal | 0.661 | 0.837 |

* *Treatment* here is having calorie display
* We see a difference in proportion eating >2000 calories as:
$$ 0.163 - 0.339 = -0.176 $$
* How likely would this have happened if the calorie display did not affect how much people ate? (i.e., if `$H_0$` were true?)
]

.pull-right[
* We simulate below 200 different randomizations of the data, keeping the total number of >2000 and <2000 cal fixed but shuffling "no calorie display" and "display" to the groups.
<img src="lec14_files/figure-html/unnamed-chunk-13-1.png" width="360" />

* We see that the typical difference in proportions is never as much as what we observed (-0.176), so there is significant evidence to reject `$H_0$`.
* None of the simulation's 200 instances result in a difference `$< -0.176$`, so the `$p$`-value is `$< 1/200$`. 
]