Inference for comparing many means

class: center, middle, inverse, title-slide

.title[
# Inference for comparing many means
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[
### Based on Ch 22 of IMS

* We saw last class how to compare population means of two different populations 
* Suppose we want to compare means across many populations (e.g., SAT scores by city)
* Can imagine doing all pairwise comparisons:<br> LA vs SF, SF vs San Jose, Davis vs San Jose, ... 
* However, as we do more and more comparisons, likely that we will see differences in data that are solely due to random chance
* This is the goal of the **analysis of variance** (ANOVA) technique which we will talk about today
]

.pull-right[
ANOVA uses a *single* hypothesis test to check whether means across many groups are equal.  If `$k$` groups,

-   `$H_0$`: All `$k$` groups share the same mean outcome: `$$\mu_1 = \mu_2 = \cdots = \mu_k$$` where `$\mu_j$` represents the mean of the outcome for observations in group `$j.$`
-   `$H_A$`: At least one mean is different.

Must check three conditions on the data before performing ANOVA:

-   the observations are independent within and between groups,
-   the responses within each group are nearly normal, and
-   the variability across the groups is about equal.
]

---

.pull-left[
### Example
College departments often run multiple sections of the same intro course each semester due to high demand.

* Consider three sections of an intro stats course.
* Are there substantial differences in first exam scores in these three sections (A, B, and C)?
* Describe appropriate hypotheses to determine if there are any differences between the three sections.

]

--
.pull-right[
The hypotheses may be written in the following form:

-   `$H_0$`: The average score is identical in all sections, `$$\mu_A = \mu_B = \mu_C.$$` Assuming each section is equally difficult, the observed difference in exam scores is due to chance.
-   `$H_A$`: The average score varies by class. We would reject the null hypothesis in favor of the alternative hypothesis if there were larger differences among the class averages than what we might expect from chance alone.

In ANOVA, strong evidence favoring the alternative hypothesis typically comes from unusually large differences in the group means.

* Key to assessing this: look at how much the means differ relative to the variability of individual observations within each group.
    
]

---

.pull-left[
Look at data below; does it look like the differences in group means could come from random chance?

* Compare I vs II vs III, then compare IV vs V vs VI

<div class="figure" style="text-align: center">
<img src="lec22_files/figure-html/fig-toyANOVA-1.png" alt="Data shown as points. Crosses indicate group means." width="432" />
<p class="caption">Data shown as points. Crosses indicate group means.</p>
</div>

- group I has same center as group IV;<br> group II has same center as group V;<br> group III has same center as group VI.
]

.pull-right[

Consider the ratio
`$$\frac{\text{between-group variability}}{\text{within-group variability}}$$`

- What does this look like for Groups I/II/III?<br> For groups IV/V/VI?
- What does a **large** ratio value suggest?<br> What does a **small** ratio value suggest?

----------------------

.pull-left[
Groups I/II/III:<br> it is plausible that difference in group centers is due to random chance.
]

.pull-right[
Groups IV/V/VI: difference in group centers seem large relative to variability within each group, possibly due to true differences across groups.
]

]

---

.pull-left[
#### F statistic

**Goal**: assess whether the observed variability in sample group means is due to random chance

For between-group variability, consider the<br> **sum of squares between groups (SSG)**:

`$$SSG = \sum_{j=1}^k n_j (\bar x_j - \bar x)^2$$`

- `$k$` is the number of means we are comparing;
- `$n_j$` is the number of observations in group `$j$`; 
- `$\bar x_j$` is the sample mean of group `$j$`;
- `$\bar x$` is the overall sample mean across *all* observations.

For within-group variability, consider the<br> **sum of squares errors (SSE)**:

`$$\begin{align*}SSE&= \sum_{j=1}^k \sum_{i=1}^{n_j} ( x_{j,i} - \bar x_j)^2\end{align*}$$`

]

.pull-right[

SSG: variability of group means (weighted by group size)
<img src="lec22_files/figure-html/unnamed-chunk-2-1.png" width="396" style="display: block; margin: auto;" />

SSE: variability within each group

<div class="figure" style="text-align: center">
<img src="lec22_files/figure-html/unnamed-chunk-3-1.png" alt="Data shown as points. Crosses indicate group means." width="396" />
<p class="caption">Data shown as points. Crosses indicate group means.</p>
</div>

]

---

.pull-left[
#### F statistic

It turns out that the **sum of squares total (SST)**
`$$SST = \sum_{i=1}^n (x_i - \bar x)^2$$`
is the sum of SSG and SSE:
`$$SST = SSG + SSE$$`
which can be shown using a bunch of algebra.

(Note: I am using the textbook's abbreviations for SSG, SSE, and SST. For this course, I would like you to do the same. But other texts may use different abbreviations, such as total sum of squares (TSS), and SST might refer to sum of squares for treatment, which is our SSG. Therefore, it is very important that you understand what each abbreviation means and the intuition behind them, and not just memorize the abbreviations.)

]
--
.pull-right[

For reasons we'll see later, we'll want to use *scaled* versions of SSG and SSE:

* The **mean square between groups (MSG)** is a scaled version of SSG; <br> the **mean squared errors (MSE)** is a scaled version of SSE:

`$$\begin{align*} MSG &= \frac{1}{\mathsf{df}_G} SSG = \frac{1}{k-1} SSG,\newline MSE&=\frac{1}{\mathsf{df}_E} SSE =\frac{1}{n-k} SSE.\end{align*}$$`

The **F statistic** is defined to be ratio of these two:
`$$F = \frac{MSG}{MSE}.$$`

- Typically use software to compute. Not worth doing by hand.

Let's now do a randomization test using this F statistic to see if the differences in means are likely due to random chance or not.

]

---

.pull-left[
#### Example: class data

A teacher gave three versions (A, B, C) of an exam.

* Data is summarized in table/plot to the right.
* Is the exam difficulty the same across the three versions?

Hypothesis test:

-   `$H_0$`: All three versions are equally difficult. `$$\mu_A = \mu_B = \mu_C$$`
-   `$H_A$`: not `$H_0.$` At least one version is inherently more (or less) difficult than the others.

]

.pull-right[
<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Version </th>
   <th style="text-align:center;"> n </th>
   <th style="text-align:center;"> Mean </th>
   <th style="text-align:center;"> SD </th>
   <th style="text-align:center;"> Min </th>
   <th style="text-align:center;"> Max </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 5em; "> A </td>
   <td style="text-align:center;width: 5em; "> 58 </td>
   <td style="text-align:center;width: 5em; "> 75.1 </td>
   <td style="text-align:center;width: 5em; "> 13.9 </td>
   <td style="text-align:center;width: 5em; "> 44 </td>
   <td style="text-align:center;width: 5em; "> 100 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 5em; "> B </td>
   <td style="text-align:center;width: 5em; "> 55 </td>
   <td style="text-align:center;width: 5em; "> 72.0 </td>
   <td style="text-align:center;width: 5em; "> 13.8 </td>
   <td style="text-align:center;width: 5em; "> 38 </td>
   <td style="text-align:center;width: 5em; "> 100 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 5em; "> C </td>
   <td style="text-align:center;width: 5em; "> 51 </td>
   <td style="text-align:center;width: 5em; "> 78.9 </td>
   <td style="text-align:center;width: 5em; "> 13.1 </td>
   <td style="text-align:center;width: 5em; "> 45 </td>
   <td style="text-align:center;width: 5em; "> 100 </td>
  </tr>
</tbody>
</table>

]

---

.pull-left[
Randomization test with *many* means:

- Idea is the same as in *two* means.
- Null assumption: versions are equally difficult.
- Exam version (A or B or C) is randomly allocated to the exam scores, under the null assumption.

- For each simulation, we calculate the F statistic.

]

.pull-right[

<img src="lec22_files/figure-html/fig-rand3exams-1.png" width="90%" style="display: block; margin: auto;" />
- 297 of 10,000 simulations had an F statistic greater than or equal to the observed value 3.48.
- p-value 0.0297 is below 0.05, so we reject `$H_0$`.
- Conclusion: our observed value is unlikely due to chance if null assumption is true.

]

---

### Test statistic for three or more means is an F

.pull-left[ 
Recall: the **F statistic** is a ratio of how the groups differ (MSG) as compared to how the observations within a group vary (MSE).

`$$F = \frac{MSG}{MSE}$$`

When the null hypothesis is true and the conditions below are met, the F statistic follows an **F distribution** with parameters `$df_1 = k-1$` and `$df_2 = n-k.$`

Conditions:

- the observations are independent within and between groups,
- the responses within each group are nearly normal, and
- the variability across the groups is about equal.

]

.pull-right[
* F distribution is determined by two parameters, `$df_1$` and `$df_2$`, representing degrees of freedom
* F distribution in general does NOT look like t distribution or normal distribution - it is always non-negative, so cannot be something which can take negative values
<img src="F.png" width="85%" />

]

---

.pull-left[ 
#### Mathematical model, test for comparing 3+ means

Why do we care about the F-distribution?

* It allows us to use a mathematical approach to evaluate null hypothesis

* The `$p$`-value for the problem is given by the area under the F-distribution curve to the right of the observed F-statistic value.

* Also known as the "right-tail area"

* Similar functions in R can be used to calculate percentiles and areas for F statistics
  - `pnorm()`, `pt()`, `pf()` - percent of data to the left of a certain value
  - `qnorm()`, `qt()`, `qf()` - value corresponding to data at a certain percentile
]
--
.pull-right[
### Examples: Exams

* So if we return to the data here
<table class="table table-striped table-condensed" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Version </th>
   <th style="text-align:center;"> n </th>
   <th style="text-align:center;"> Mean </th>
   <th style="text-align:center;"> SD </th>
   <th style="text-align:center;"> Min </th>
   <th style="text-align:center;"> Max </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 5em; "> A </td>
   <td style="text-align:center;width: 5em; "> 58 </td>
   <td style="text-align:center;width: 5em; "> 75.1 </td>
   <td style="text-align:center;width: 5em; "> 13.9 </td>
   <td style="text-align:center;width: 5em; "> 44 </td>
   <td style="text-align:center;width: 5em; "> 100 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 5em; "> B </td>
   <td style="text-align:center;width: 5em; "> 55 </td>
   <td style="text-align:center;width: 5em; "> 72.0 </td>
   <td style="text-align:center;width: 5em; "> 13.8 </td>
   <td style="text-align:center;width: 5em; "> 38 </td>
   <td style="text-align:center;width: 5em; "> 100 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 5em; "> C </td>
   <td style="text-align:center;width: 5em; "> 51 </td>
   <td style="text-align:center;width: 5em; "> 78.9 </td>
   <td style="text-align:center;width: 5em; "> 13.1 </td>
   <td style="text-align:center;width: 5em; "> 45 </td>
   <td style="text-align:center;width: 5em; "> 100 </td>
  </tr>
</tbody>
</table>

* The F statistic for this data was 3.48

* Mathematical approach for calculating `$p$`-value:
    
    ``` r
    df1 <- 3 - 1  # k-1
    df2 <- 58+55+51 - 3  # n-k
    pf(3.48, df1 = df1, df2 = df2)
    #> [1] 0.9668556
    ```
* Only 3.3% of data is to the right of 3.48 
* `$p$`-value of 0.033 < 0.05 - can reject the null hypothesis

]

---

.pull-left[
Alternatively, we can perform analysis of variance using `lm()` and `anova()`:

``` r
(lm_class <- lm(m1 ~ exam, data = classdata))
#> 
#> Call:
#> lm(formula = m1 ~ exam, data = classdata)
#> 
#> Coefficients:
#> (Intercept)        examB        examC  
#>      75.103       -3.140        3.838
anova(lm_class) |> tidy() |> kable(digits=c(0, 0, 0, 0, 3, 3))
```

|term      |  df| sumsq| meansq| statistic| p.value|
|:---------|---:|-----:|------:|---------:|-------:|
|exam      |   2|  1290|    645|     3.484|   0.033|
|Residuals | 161| 29810|    185|        NA|      NA|

* `df`: degrees of freedom `$df_1=k-1$` and `$df_2=n-k$`
* `sumsq`: SSG (top) and SSE (bottom)
* `meansq`: MSG (top) and MSE (bottom)
* `statistic`: F statistic: `$F = \frac{MSG}{MSE}$`
* `p.value`: p-value for the F statistic
]
--
.pull-right[
Is this p-value valid? Need to check conditions:

- Independence: if data comes from simple random sample, this holds. For students we aren't sure, so this might present problems, but let's assume it's OK
- Approximate normality - when sample size is large and no extreme outliers, then this is ok
- Approximately constant variance - each group has approximately the same variance

* Thus, we can use to reject `$H_0$`, which we do because it is smaller than 0.05

]

---

.pull-left[
### Example: Batting in MLB
* Let's see whether batting performance of baseball players differs according to position
  - Outfielder (OF)
  - Infielder (OF)
  - Catcher (C)
* Dataset: 429 Major League Baseball (MLB) players from 2018, each >= 100 at bats
* Does the *on-base percentage* (OBP) differ across these 3 groups?

* Some variables and descriptions:

```
#> # A tibble: 4 × 2
#>   variable col1                                       
#>   <chr>    <chr>                                      
#> 1 name     Player name                                
#> 2 team     abbreviated name of the player's team      
#> 3 position player's primary field position (OF, IF, C)
#> 4 OBP      On-base percentage
```
]

.pull-right[

* First few rows:
<table class="table table-striped table-condensed" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> team </th>
   <th style="text-align:left;"> position </th>
   <th style="text-align:right;"> OBP </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Abreu, J </td>
   <td style="text-align:left;"> CWS </td>
   <td style="text-align:left;"> IF </td>
   <td style="text-align:right;"> 0.325 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Acuna Jr., R </td>
   <td style="text-align:left;"> ATL </td>
   <td style="text-align:left;"> OF </td>
   <td style="text-align:right;"> 0.366 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Adames, W </td>
   <td style="text-align:left;"> TB </td>
   <td style="text-align:left;"> IF </td>
   <td style="text-align:right;"> 0.348 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Adams, M </td>
   <td style="text-align:left;"> STL </td>
   <td style="text-align:left;"> IF </td>
   <td style="text-align:right;"> 0.309 </td>
  </tr>
</tbody>
</table>

* `position` has three levels:
    
    ``` r
    unique(mlb_players_18$position)
    #> [1] OF IF C 
    #> Levels: OF IF C
    ```

Null and alternative hypotheses:

- `$H_0$`: the OBP for these three groups (outfielders, in-fielders, and catchers) is the same; `$$\mu_{OF} = \mu_{IF} = \mu_C$$`
- `$H_A$`: there is some difference among these three groups: `$\mu_{OF} \neq \mu_{IF}$`, `$\mu_{IF} \neq \mu_C$`, or `$\mu_{OF} \neq \mu_C$`
]

---
.pull-left[
#### ANOVA analysis

First, we need to check the conditions for F statistic / ANOVA analysis to hold

Conditions for F statistic/ANOVA analysis

- Independence: if data comes from simple random sample, this holds.  For MLB we aren't sure, so this might present problems, but let's assume it's OK
- Approximate normality - when sample size is large and no extreme outliers, then this is ok
- Approximately constant variance - variance within each group is approximately the same across groups
]
.pull-right[
<img src="lec22_files/figure-html/unnamed-chunk-16-1.png" width="432" style="display: block; margin: auto;" />
* Does appear to be approximately normal, no extreme outliers
* We see that variability across the groups is quite similar
* So our F statistic analysis can proceed

]

---

.pull-left[

To do analysis of variance, we use `lm()` and `anova()`:

``` r
(mod <- lm(OBP~position, data=mlb_players_18))
#> 
#> Call:
#> lm(formula = OBP ~ position, data = mlb_players_18)
#> 
#> Coefficients:
#> (Intercept)   positionIF    positionC  
#>    0.319819    -0.001433    -0.017881

anova(mod) |> 
    tidy() |> 
    kable(digits=c(0,0,3,4,2,3))
```

|term      |  df| sumsq| meansq| statistic| p.value|
|:---------|---:|-----:|------:|---------:|-------:|
|position  |   2| 0.016| 0.0080|      5.08|   0.007|
|Residuals | 426| 0.674| 0.0016|        NA|      NA|

]

.pull-right[

How to read ANOVA output table?

* `df`: degrees of freedom `$df_1=k-1$` and `$df_2=n-k$`
* `sumsq`: SSG (top) and SSE (bottom)
* `meansq`: MSG (top) and MSE (bottom)
* `statistic`: F statistic: `$F = \frac{MSG}{MSE}$`
* `p.value`: p-value for the F statistic
    * p-value is smaller than 0.05, so we can reject the null hypothesis

Can derive this entire tibble by using just `df` and `sumsq` (and `pf()` to calc p-value from the statistic)

* `meansq`: `sumsq` divided by `df`
* `statistic`: quotient of the two `meansq` values
* For `p.value`, can use `pf`:
    
    ``` r
    1 - pf(5.077, df1 = 2, df2 = 426)
    #> [1] 0.006621471
    ```
]

---

.pull-left[
Can derive this entire tibble by using just `df` and `sumsq` (and `pf()` to calc p-value from the statistic)

* `meansq`: `sumsq` divided by `df`
* `statistic`: quotient of the two `meansq` values
* For `p.value`, can use `pf`:
    
    ``` r
    1 - pf(statistic, df1 = df1, df2 = df2)
    ```
    
]

.pull-right[
Let's try it

|term      |  df| sumsq|meansq |statistic |p.value |
|:---------|---:|-----:|:------|:---------|:-------|
|position  |   2| 0.016|X      |X         |X       |
|Residuals | 426| 0.674|X      |NA        |NA      |

]

---

.pull-left[
### Practice 
* IMS 22.9(a, b)
* IMS 22.9c
* IMS 22.12(a,b)
* IMS 22.12c
* IMS 22.13(a,b)
* Take 3 minutes to think about the problem, discuss with your neighbors
* I'll call on one of you and ask what your thoughts are for how to approach the problem
* Not expecting you to have the right answer!  Just want to hear how you approach the problem, then help guide you through thinking about the next steps 
    
]

.pull-right[

]