Inference for comparing two means

STA35B: Statistical Data Science 2

Akira Horiguchi

Based on Ch 20 of IMS

library(tidyverse)
library(openintro)
library(infer)

library(knitr)
library(ggpubr)
library(kableExtra)
library(gghighlight)

library(scales) # label_dollar

options(pillar.print_min = 9)  # to avoid annoying scroll behavior
knitr::opts_chunk$set(out.height = "100%")
theme_set(theme_bw() + theme(axis.text = element_text(size = 14), 
                             axis.title = element_text(size = 16), 
                             ))

Previously…

We used the bootstrap to create CIs and hypothesis tests for a single mean \(\mu\).

  • e.g. to understand a single numeric value about a population (e.g., height, weight, etc).

Inference for difference of two means

Now: CIs and hypothesis tests for the difference of two means \[\mu_1 - \mu_2,\] the mean of two different populations. Examples:

  • Do pregnant women who are smokers vs. non-smokers have differences in baby weight?
  • Was one exam more difficult than another?
  • Are Americans taller or shorter than Canadians?

For \(\mu_1 - \mu_2\), the point estimate is \[\bar x_1 - \bar x_2,\] where \(\bar x_i\) is sample mean from population \(i\). Many previous ideas will carry over

  • Randomization tests
  • Bootstrap for difference of means
  • Mathematical approach (Central Limit Theorem)

Randomization test

Example: Two slight variations of an exam.

  • Each student received a random version (A and B).
  • Anticipating complaints, the instructor wants to see if the difference observed between the groups is large enough to provide convincing evidence that one version was more difficult (on average) than the other version.

Summary statistics for how students performed on these two exams:

Group n Mean SD Min Max
A 58 75.10 13.87 44 100
B 55 71.96 13.77 38 100
Figure 1: Boxplot and points of exam scores separated by exam version.
  • Hypotheses to evaluate whether observed difference in sample means is likely to have happened due to chance:
    • \(H_0\): exams are equally difficult; \(\mu_A = \mu_B\).
    • \(H_A\): one exam is more difficult; \(\mu_A \neq \mu_B\).
  • Observations regarding setup:
    • Independence within each group and between groups since exams shuffled and randomly passed out.
    • min/max values suggest no outliers .
  • We’ll use an \(\alpha = 0.05\) discernibility threshold.

Randomization test: variability of the statistic

Previously, we estimated the variability of the proportion difference \(\hat p_1 - \hat p_2\) by randomly assigning treatment to each observation. Here we do something similar.

  • To simulate the null hypothesis, we
    • randomly assign 58 of the observed exam scores to group A (the remaining 55 scores then get assigned to group B), then
    • examine the difference \(\bar x_{A,sim1} - \bar x_{B,sim1}\)
Figure 2: Cartoon of randomization/shuffling procedure.
  • Repeating this 10,000 times, we estimate the natural variability in \(\bar x_A - \bar x_B\) when there is no dependence between group and exam score.
Figure 3: Histogram of difference in randomized means under null hypothesis. The red vertical line indicates the observed difference in sample means.
  • The observed difference (highlighted above) was 75.1 - 71.96 = 3.14.
  • 1195 out of 10,000 randomization trials produce a difference \(\geq\) 3.14;
  • 1173 produce difference \(\leq\) -3.14.
  • p-value is then \(\approx\) (1195 + 1173) / 10000 = 0.2368.
  • Larger than \(\alpha = 0.05\) threshold: fail to reject \(H_0\).
  • Conclude: the data do not provide enough evidence that one exam version was more difficult than the other.

Bootstrap CI for difference in means

Example: assess 2 car lots; which one has a cheaper average price?

  • We have a sample of 5 cars from each lot.
  • We take bootstrap samples from each group,
  • then calculate sample means in each bootstrap sample, \(\bar x_{1}^{(i)}\) and \(\bar x_{2}^{(i)}\),
  • then build a distribution of the bootstrapped differences \(\bar x_{1}^{(i)} - \bar x_{2}^{(i)}\),
  • then create a CI for the difference in means \(\mu_1 - \mu_2\).
Figure 4: Cartoon of bootstrap procedure for car example.

Bootstrap CI: stem-cell case study

Consider the following experiment that seeks to examine whether using embryonic stem cells (ESC) help improve heart function following a heart attack

  • In experiment, people are randomly assigned to treatment (ESC) and control groups, and then had their heart pumping capacity measured
  • Want to compute 95% CI for effect of ESC on heart pumping capacity
  • Summary statistics from experiment:
Table 1
Group n Mean SD
ESC 9 3.50 5.17
Control 9 -4.33 2.76
  • Point estimate of the difference in heart pumping capacity:

\[\bar{x}_{esc} - \bar{x}_{control}\ =\ 3.50 - (-4.33)\ =\ 7.83\]

Use bootstrap to estimate the distribution of difference in sample means when repeatedly sampling:

Figure 5: Histogram of differences of two bootstrapped means. The thick solid vertical line indicates the observed difference. The dashed vertical lines indicate the 2.5 and 97.5 percentiles.
  • Bootstrapped CI does not include 0.
  • Conclude: ESC increases heart pumping capacity

If the CI did include 0, then we would not have enough evidence to conclude that ESC increases heart pumping capacity

  • We would not say that “we have evidence that ESC does not change heart pumping capacity”.

Mathematical model for testing difference in means

Example: Is there evidence that newborns from smokers have different birth weight than non-smokers?

  • Let \(\mu_n\): average birthweight of non-smokers, \(\mu_s\): smokers. Set up hypotheses:
    • \(H_0\): no difference in average birthweight for newborns from smoking vs. non-smoking mothers: \(\mu_n - \mu_s =0\);
    • \(H_A\): There is some difference: \(\mu_n -\mu_s \neq 0\).

Data

We’ll use: openintro::births14 dataset. Consists of randomly sampled survey of mothers in the US. First few rows below.

fage mage weeks visits gained weight sex habit
34 34 37 14 28 6.96 male nonsmoker
36 31 41 12 41 8.86 female nonsmoker
37 36 37 10 28 7.51 female nonsmoker
NA 16 38 NA 29 6.19 male nonsmoker
  • Summary statistics from the data:
Habit n Mean SD
nonsmoker 867 7.27 1.23
smoker 114 6.68 1.60

Mathematical model for testing: Variability of the statistic

The test statistic for comparing two means is a T score

\[T \;\;=\;\; \frac{\text{point est.} - \text{null}}{SE} \;\;=\;\; \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}.\]

  • Ratio of how the groups differ compared to how the observations within a group vary.
  • If observed difference is much larger than observed within-group variability, then the absolute value of the T score will be large.

When the null hypothesis is true and the conditions

  1. Independent observations within and between groups.
  2. Large samples and no extreme outliers.

are met, a T score has a t-distribution with \(df = \min(n_1, n_2) - 1.\)

Mathematical model for testing: Birth Weight t-test

We want to model the difference in sample means using \(t\)-distribution; check required assumptions.

  1. Independence: since randomly sampled, samples are independent.
  2. Nearly-normal data: both groups have \(>30\) observations; does data show any extreme outliers?
Figure 6: Histogram of newborn weights, stratified by mothers who smoked and mothers who did not smoke.
  • No apparent extreme outliers, so all conditions needed to satisfy \(t\) distribution assumptions hold.
  • So we can proceed with the analysis.

Let’s now complete the hypothesis test

  • Let’s use \(\alpha=0.05\) (95% significance level)
  • Summary statistics from before:
Habit n Mean SD
nonsmoker 867 7.27 1.23
smoker 114 6.68 1.60

\[df = \min(n_1, n_2)-1 = 113\]

\[ SE \;=\; \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \;=\; \sqrt{\frac{1.23^2}{867} + \frac{1.60^2}{114}} \;=\; 0.155\] \[T \;=\; \frac{\bar x_1 - \bar x_2 - 0}{SE} \;=\; \frac{6.68-7.27}{0.155} \;=\; -3.69\]

(Street-fighting math: Before formally computing the p-value, guess whether or not we reject \(H_0\) based on the T score and df.)

  • Compute the one-sided tail area:
pt(-3.69, df = 113)
[1] 0.0001733097
  • Doubling this gives p-value of 0.00034.
  • The p-value is much smaller than the significance value, 0.05, so we reject the null hypothesis \(H_0\).
  • Conclude: The data provide is convincing evidence of a difference in the average weights of babies born to mothers who smoked during pregnancy and those who did not.

Mathematical model for estimating difference in means (CI)

The \(t\)-distribution can be used for inference when working with the standardized difference of two means if

  • Independence (extended). The data are independent within and between the two groups, e.g., the data come from independent random samples or from a randomized experiment.
  • Normality. We check the outliers for each group separately.

The standard error may be computed as

\[SE \;\;=\;\; \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \;\;\approx\;\; \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}.\]

Degrees of freedom: \(\min(n_1, n_2)-1.\)

The margin of error for \(\bar{x}_1 - \bar{x}_2\) can be directly obtained from \(SE(\bar{x}_1 - \bar{x}_2).\)

\[ \text{Margin of error} \;\;=\;\; t^\star_{df} \times \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}},\]

\(t^\star_{df}\): calculated from percentile of t-distribution w/ \(df\) degrees of freedom.

Mathematical model for CI: stem-cell case study

Let’s compute a 95% CI for effect of ESC on change in heart pump capacity:

\[\begin{aligned} \bar{x}_{esc} - \bar{x}_{control} &= 7.83 \\ SE &= \sqrt{\frac{5.17^2}{9} + \frac{2.76^2}{9}} = 1.95 \end{aligned}\]

Degrees of freedom is \(\min\{9, 9\} - 1 = 8\).

Critical value of \(t^{\star}_{8} = 2.31\) for a 95% CI:

qt(0.025, 8)
[1] -2.306004

95% CI is then

\[ \begin{aligned} \text{point estimate} \ \pm\ t^{\star}_8 \times SE \\ \end{aligned} \]

\[ \begin{aligned} \implies 7.83 \ \pm\ 2.31\times 1.95 \\ \end{aligned} \]

\[ \begin{aligned} \implies (3.32, 12.34) \end{aligned} \]

Conclude: we are 95% confident that heart pumping function in those that received ESC treatment is between 3.32% and 12.34% higher than for those that did not receive ESC.