Multiple Hypothesis Testing

STA35B: Statistical Data Science 2

Akira Horiguchi
library(tidyverse)
options(pillar.print_min = 9)  # to avoid annoying scroll behavior
knitr::opts_chunk$set(out.height = "100%")
theme_set(theme_bw() + theme(axis.text = element_text(size = 14), 
                             axis.title = element_text(size = 16), 
                             ))

Green Jelly Beans Linked to Acne! xkcd comic (882, Significant)

Other common sayings

  • A broken clock is right twice a day.
  • Torture the data until it confesses.

Motivating Example: Suppose a researcher compares 20 outcomes between a treatment and control group:

  • anxiety score, sleep quality, blood pressure, cholesterol, concentration, and 15 more

For each outcome, they test at \(\alpha = 0.05\).

  • If nothing truly differs, how surprising is one “significant” result?

One Test vs. Many Tests

If all \(m \geq 1\) null hypotheses are true and all \(m \geq 1\) tests are independent:

\[ P(\text{no false positives}) = (1 - \alpha)^m\\ \Longrightarrow P(\text{at least one false positive}) = 1 - (1 - \alpha)^m \]

False Positive Risk With \(\alpha = 0.05\):

Number of tests \(P(\text{at least one false positive})\)
1 0.05
5 0.23
10 0.40
20 0.64
100 0.99
a_vec <- c(0.05, 0.005)
m_vec <- 1:800
fpr_fun <- function(m, a) 1 - (1-a)^m
my_df <- lapply(1:2, \(i) {
    data.frame= a_vec[i], num_test = m_vec, false_positive_rate = sapply(m_vec, fpr_fun, a=a_vec[i]))
}) |> purrr::reduce(rbind)
my_df |> 
    ggplot(aes(num_test, false_positive_rate)) + 
    geom_line() + 
    facet_wrap(~α, labeller='label_both') +
    labs(x='m (number of tests)', y='1 - (1-α)^m', title='False positive rate')

The problem is not a bad test. The problem is repeated opportunity for a false positive.

The Language of Multiple Testing

A family is the collection of tests we want to treat as one inferential unit. E.g.,

  • all primary outcomes in a clinical trial
  • all pairwise group comparisons in an experiment
  • all genes tested in an expression study
  • all A/B metrics used to decide whether to ship a feature

Three Error Rates

Error rate Question answered
Per-test error rate How often does each individual test falsely reject?
Family-wise error rate How often do we make at least one false rejection in the family?
False discovery rate Among rejected hypotheses, what fraction are false discoveries on average?

Family-Wise Error Rate

The family-wise error rate (FWER) is:

\[ P(\text{at least one Type I error in the family}) \]

When even one false positive is a serious problem, then we want to ensure that

\[ \text{FWER} \le \alpha \]

Two approaches to ensure that \(\text{FWER} \le \alpha\):

  1. Bonferroni Correction
    • easier to understand
  2. Holm Correction
    • uniformly more powerful than the Bonferroni Correction
    • slightly more complex, but just use code.

Bonferroni Correction

If testing \(m\) hypotheses \(H_1, \ldots, H_m\) and you want FWER \(\le \alpha\):

\[ \text{reject } H_i \text{ if } p_i \le \frac{\alpha}{m}, \qquad i=1,\ldots,m \]

Example: Five outcomes are tested with \(\alpha = 0.05\).

  • To ensure that FWER \(\le \alpha\), we will

\[ \text{reject } H_i \text{ if } p_i \le \frac{0.05}{5}, \qquad i=1,\ldots,5 \]

Test p-value \(p_i\) Bonferroni decision
1 0.001 reject
2 0.012 fail to reject
3 0.019 fail to reject
4 0.041 fail to reject
5 0.200 fail to reject

Why does this ensure that FWER \(\le \alpha\)? If hypotheses \(H_1, \ldots, H_{m_0}\) are true, then

\[\begin{align} \text{FWER } &=\; P(\text{at least one Type I error in the family})\\ &=\; P\bigg(\bigcup_{i=1}^{m_0} \Big[p_i \leq \frac{\alpha}{m}\Big]\bigg)\\ &\leq\; \sum_{i=1}^{m_0} P\bigg( p_i \leq \frac{\alpha}{m}\bigg)\\ &\leq\; \sum_{i=1}^{m_0} \frac{\alpha}{m}\\ &=\; m_0 \frac{\alpha}{m}\\ &\leq\; \alpha. \end{align}\]

(We don’t even have to know what \(m_0\) is.)

False Discovery Rate

The false discovery rate is the expected fraction of discoveries that are false:

\[ FDR = E\left(\frac{\text{false rejections}}{\text{all rejections}}\right) \]

  • FWER asks: Did we make any false discovery?
  • FDR asks: How noisy is our list of discoveries?

When FDR Makes Sense

FDR is common when:

  • many hypotheses are tested,
  • follow-up studies are possible,
  • a short list of candidates is useful,
  • missing real effects is also costly.

Examples:

  • genomics
  • brain imaging
  • large-scale survey analysis
  • exploratory product experiments

Benjamini-Hochberg Procedure

To control FDR at level \(q\):

  1. Sort p-values: \(p_{(1)} \le \cdots \le p_{(m)}\).
  2. Compute BH thresholds:

\[ \frac{1}{m}q,\ \frac{2}{m}q,\ \ldots,\ \frac{m}{m}q \]

  1. Find the largest rank \(k\) where:

\[ p_{(k)} \le \frac{k}{m}q \]

  1. Reject all hypotheses ranked \(1\) through \(k\).

Example: Ten p-values, controlling FDR at \(q = 0.10\):

Rank p-value BH threshold
1 0.002 0.01
2 0.009 0.02
3 0.015 0.03
4 0.022 0.04
5 0.048 0.05
6 0.071 0.06

Largest passing rank is \(k = 5\), so reject the first 5 hypotheses.

Comparing the Methods

Method Controls Typical use Main tradeoff
No adjustment per-test error single planned test false positives accumulate
Bonferroni FWER high-stakes confirmatory testing conservative
Holm FWER confirmatory testing with more power slightly more complex
BH FDR large-scale discovery allows some false discoveries

Practical Workflow

Before running the analysis:

  1. Define the family of tests.
  2. Decide whether the goal is FWER control or FDR control.
    • (depends on the scientific goal)
  3. Choose \(\alpha\) or \(q\).
  4. Run all planned tests.
  5. Report adjusted results and the method used.