Multiple Hypothesis Testing

library(tidyverse)
options(pillar.print_min = 9)  # to avoid annoying scroll behavior
knitr::opts_chunk$set(out.height = "100%")
theme_set(theme_bw() + theme(axis.text = element_text(size = 14), 
                             axis.title = element_text(size = 16), 
                             ))

Green Jelly Beans Linked to Acne! xkcd comic (882, Significant)

Other common sayings

A broken clock is right twice a day.
Torture the data until it confesses.

Motivating Example: Suppose a researcher compares 20 outcomes between a treatment and control group:

anxiety score, sleep quality, blood pressure, cholesterol, concentration, and 15 more

For each outcome, they test at \(\alpha = 0.05\).

If nothing truly differs, how surprising is one “significant” result?

One Test vs. Many Tests

If all \(m \geq 1\) null hypotheses are true and all \(m \geq 1\) tests are independent:

\[ P(\text{no false positives}) = (1 - \alpha)^m\\ \Longrightarrow P(\text{at least one false positive}) = 1 - (1 - \alpha)^m \]

False Positive Risk With \(\alpha = 0.05\):

Number of tests	\(P(\text{at least one false positive})\)
1	0.05
5	0.23
10	0.40
20	0.64
100	0.99

a_vec <- c(0.05, 0.005)
m_vec <- 1:800
fpr_fun <- function(m, a) 1 - (1-a)^m
my_df <- lapply(1:2, \(i) {
    data.frame(α = a_vec[i], num_test = m_vec, false_positive_rate = sapply(m_vec, fpr_fun, a=a_vec[i]))
}) |> purrr::reduce(rbind)
my_df |> 
    ggplot(aes(num_test, false_positive_rate)) + 
    geom_line() + 
    facet_wrap(~α, labeller='label_both') +
    labs(x='m (number of tests)', y='1 - (1-α)^m', title='False positive rate')

The problem is not a bad test. The problem is repeated opportunity for a false positive.

The Language of Multiple Testing

A family is the collection of tests we want to treat as one inferential unit. E.g.,

all primary outcomes in a clinical trial
all pairwise group comparisons in an experiment
all genes tested in an expression study
all A/B metrics used to decide whether to ship a feature

Three Error Rates

Error rate	Question answered
Per-test error rate	How often does each individual test falsely reject?
Family-wise error rate	How often do we make at least one false rejection in the family?
False discovery rate	Among rejected hypotheses, what fraction are false discoveries on average?

Family-Wise Error Rate

The family-wise error rate (FWER) is:

\[ P(\text{at least one Type I error in the family}) \]

When even one false positive is a serious problem, then we want to ensure that

\[ \text{FWER} \le \alpha \]

Two approaches to ensure that \(\text{FWER} \le \alpha\):

Bonferroni Correction
- easier to understand
Holm Correction
- uniformly more powerful than the Bonferroni Correction
- slightly more complex, but just use code.

Bonferroni Correction

If testing \(m\) hypotheses \(H_1, \ldots, H_m\) and you want FWER \(\le \alpha\):

\[ \text{reject } H_i \text{ if } p_i \le \frac{\alpha}{m}, \qquad i=1,\ldots,m \]

Example: Five outcomes are tested with \(\alpha = 0.05\).

To ensure that FWER \(\le \alpha\), we will

\[ \text{reject } H_i \text{ if } p_i \le \frac{0.05}{5}, \qquad i=1,\ldots,5 \]

Test	p-value \(p_i\)	Bonferroni decision
1	0.001	reject
2	0.012	fail to reject
3	0.019	fail to reject
4	0.041	fail to reject
5	0.200	fail to reject

Why does this ensure that FWER \(\le \alpha\)? If hypotheses \(H_1, \ldots, H_{m_0}\) are true, then

\[\begin{align} \text{FWER } &=\; P(\text{at least one Type I error in the family})\\ &=\; P\bigg(\bigcup_{i=1}^{m_0} \Big[p_i \leq \frac{\alpha}{m}\Big]\bigg)\\ &\leq\; \sum_{i=1}^{m_0} P\bigg( p_i \leq \frac{\alpha}{m}\bigg)\\ &\leq\; \sum_{i=1}^{m_0} \frac{\alpha}{m}\\ &=\; m_0 \frac{\alpha}{m}\\ &\leq\; \alpha. \end{align}\]

(We don’t even have to know what \(m_0\) is.)

False Discovery Rate

The false discovery rate is the expected fraction of discoveries that are false:

\[ FDR = E\left(\frac{\text{false rejections}}{\text{all rejections}}\right) \]

FWER asks: Did we make any false discovery?
FDR asks: How noisy is our list of discoveries?

When FDR Makes Sense

FDR is common when:

many hypotheses are tested,
follow-up studies are possible,
a short list of candidates is useful,
missing real effects is also costly.

Examples:

genomics
brain imaging
large-scale survey analysis
exploratory product experiments

Benjamini-Hochberg Procedure

To control FDR at level \(q\):

Sort p-values: \(p_{(1)} \le \cdots \le p_{(m)}\).
Compute BH thresholds:

\[ \frac{1}{m}q,\ \frac{2}{m}q,\ \ldots,\ \frac{m}{m}q \]

Find the largest rank \(k\) where:

\[ p_{(k)} \le \frac{k}{m}q \]

Reject all hypotheses ranked \(1\) through \(k\).

Example: Ten p-values, controlling FDR at \(q = 0.10\):

Rank	p-value	BH threshold
1	0.002	0.01
2	0.009	0.02
3	0.015	0.03
4	0.022	0.04
5	0.048	0.05
6	0.071	0.06

Largest passing rank is \(k = 5\), so reject the first 5 hypotheses.

Comparing the Methods

Method	Controls	Typical use	Main tradeoff
No adjustment	per-test error	single planned test	false positives accumulate
Bonferroni	FWER	high-stakes confirmatory testing	conservative
Holm	FWER	confirmatory testing with more power	slightly more complex
BH	FDR	large-scale discovery	allows some false discoveries

Practical Workflow

Before running the analysis:

Define the family of tests.
Decide whether the goal is FWER control or FDR control.
- (depends on the scientific goal)
Choose \(\alpha\) or \(q\).
Run all planned tests.
Report adjusted results and the method used.