Inference for two-way tables

STA35B: Statistical Data Science 2

Akira Horiguchi

Based on Ch 18 of IMS

library(tidyverse)
library(openintro)
library(infer)

library(knitr)
library(ggpubr)
library(kableExtra)
library(gghighlight)

library(scales) # label_dollar

options(pillar.print_min = 9)  # to avoid annoying scroll behavior
knitr::opts_chunk$set(out.height = "100%")
theme_set(theme_bw() + theme(axis.text = element_text(size = 14), 
                             axis.title = element_text(size = 16), 
                             ))

The Main Example: iPod Disclosure Experiment

Study: Researchers recruited 219 participants to sell a used iPod that was known to have frozen twice in the past.

  • The participants were incentivized to get as much money as they could for the iPod since they would receive a 5% cut of the sale.

Unbeknownst to the 219 sellers, the buyers were collaborating with the researchers to evaluate the influence of different questions on the likelihood of getting the sellers to disclose the past issues with the iPod.

  • The scripted buyers started their interaction by asking one of three questions:
    • General: What can you tell me about it?
    • Positive Assumption: It does not have any problems, does it?
    • Negative Assumption: What problems does it have?
  • The seller then responds by either:
    • Disclosing the problem of iPod freezing
    • Hiding the problem

Research question: Does the wording of a question affect whether sellers disclose that an iPod has a freezing problem?

Study design:

Question type Disclose Hide Total
General \(a\) \(d\) 73
Positive assumption \(b\) \(e\) 73
Negative assumption \(c\) \(f\) 73
Total \(a+b+c\) \(d+e+f\) 219
  • Numbers \(a,b,c,d,e,f\) are the study’s outcomes.
  • Note about row sums: it must be that \(a+d=b+e=c+f=73\).

Core question (more generally)

Consider two categorical variables, each possibly taking more than two values.

  • Core: are the two variables independent? Or is there an association?

In Ch 17 (comparing two proportions), we could summarize the data with one difference in proportions. That breaks down when:

  • the explanatory variable has 3 or more groups
  • the response variable has 3 or more categories
  • there is no single obvious “success” category to compare across just two groups

Ch 18 extends hypothesis testing to two-way tables, using the chi-squared statistic and the chi-squared distribution.

iPod example: data

Question type Disclose Hide Total
General 2 71 73
Positive assumption 23 50 73
Negative assumption 36 37 73
Total 61 158 219

The rates look very different across rows, but we still need to ask whether those differences could be due to chance.

  • Because the study was an experiment, causal language is justified if evidence is strong enough.

Hypotheses for a Two-Way Table

Null hypothesis H_0

The two variables are independent.

Here: the question type does not affect whether sellers disclose the freezing problem.

Alternative hypothesis H_A

The two variables are associated.

Here: disclosure behavior does depend on the question type.

Expected Counts Under Independence

Big idea: If the variables were independent, each row should have about the same column proportions as the overall table.

  • Hence the expected count for row \(i\) and column \(j\) under independence is:

\[ E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{\text{table total}} \]

  • Worked example: For row 1 / column 1 in the iPod table:

\[ E_{11} = \frac{73 \times 61}{219} = 20.33 \]

  • So under H_0, we would expect about 20.33 disclosures in the General row.

Expected Count Table

Question type Disclose expected Hide expected Row total
General 20.33 52.67 73
Positive assumption 20.33 52.67 73
Negative assumption 20.33 52.67 73
Total 61 158 219

What to notice

  • All three rows have the same expected counts because the row totals are all 73.
  • The observed counts deviate a lot from these expectations.
  • But we still need a single statistic that summarizes how far the whole table is from independence.

The Chi-Squared Statistic

Sum is over rows (indexed by \(i\)) and columns (indexed by \(j\)) \[\begin{align*} \chi^2 = \sum_{i,j} \frac{(\text{observed}_{i,j} - \text{expected}_{i,j})^2}{\text{expected}_{i,j}} = \sum_{i,j} \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}} \end{align*}\]

Calculation for iPod example

Observed counts

Question type Disclose Hide Total
General 2 71 73
Positive 23 50 73
Negative 36 37 73
Total 61 158 219

Expected counts

Question type Disclose Hide Total
General 20.33 52.67 73
Positive 20.33 52.67 73
Negative 20.33 52.67 73
Total 61 158 219

Intermediate step: compute summand for each cell

Question type Disclose Hide
General \(\frac{(2-20.33)^2}{20.33}\) \(\frac{(71-52.67)^2}{52.67}\)
Positive \(\frac{(23-20.33)^2}{20.33}\) \(\frac{(50-52.67)^2}{52.67}\)
Negative \(\frac{(36-20.33)^2}{20.33}\) \(\frac{(37-52.67)^2}{52.67}\)

Adding up these six fractions, we get statistic value of \[\chi^2 = 40.13.\]

How to interpret “40.13”

General interpretation of chi-squared statistic:

  • Large values mean the observed table is far from the expected table.
  • Small values mean the observed and expected counts are relatively close.

What Does “40.13” Mean? By itself, not much.

  • Just like a difference in proportions needs a null distribution, a chi-squared statistic also needs a benchmark.

The inferential question: If H_0 were true, how often would random chance alone generate a chi-squared statistic at least as large as 40.13? Can answer with

  1. a randomization test
  2. a mathematical model using the chi-squared distribution

Randomization Test of Independence

\(H_0\): Sellers would disclose or hide the problem regardless of question type. Then our randomization procedure will:

  • keep the 61 disclosures and 158 hidden responses fixed
  • randomly reassign them to the three question groups
  • recompute the chi-squared statistic

One randomized table:

Question type Disclose Hide Total
General 29 44 73
Positive assumption 15 58 73
Negative assumption 17 56 73
Total 61 158 219

The chi-squared statistic for this table is [use formula]

Randomization Test: iPod example

A histogram of chi-squared statisics from 1,000 simulations produced under the null hypothesis, where the question is independent of the response. The observed statistic of 40.13 is marked by the red line. None of the 1,000 simulations had a chi-squared value of at least 40.13.

Conclusion from the randomization test: The simulated p-value is essentially 0.

Plain-language interpretation: If question wording truly had no effect, the observed table would be extraordinarily unlikely.

So the data provide very strong evidence against independence.

Randomization Test: Interpretation

It can tell us

  • whether the two variables appear to be associated
  • whether the observed departure from independence is too large to attribute to chance alone

It cannot fully tell us

  • a single directional effect the way a one-number difference in proportions can
  • which exact cell explains the whole story by itself
  • whether the association is practically important without context

Mathematical Model: Chi-Squared Distribution

Randomization is conceptually powerful, but a mathematical approximation is often faster and more standardized.

  • Consider the following conditions:
    • Independent observations
    • Large samples: expected count of at least 5 in each cell
  • When the null hypothesis is true and conditions above are met,
    the chi-squared statistic follows a chi-squared distribution with \[df = (R - 1)(C - 1)\] degrees of freedom (for a two-way table with \(R\) rows and \(C\) columns)
    • \(df\) determines shape of the chi-squared distribution: \[\frac{1}{2^{k/2} \Gamma(k/2)} x^{(k/2)-1} e^{-k/2}\qquad\text{for }k \text{ degrees of freedom}.\]

The chi-squared distribution for differing degrees of freedom. The larger the degrees of freedom, the longer the right tail extends. The smaller the degrees of freedom, the more peaked the mode on the left becomes.
  • A chi-squared distribution with \(k\) degrees of freedom can be shown to have \[\text{Mean }= k, \qquad \text{Median } \approx k\left(1-\frac{2}{9k}\right)^3.\]
    • Takeaway: mean and median increase with \(k\).
    • (Makes sense in context of two-way tables: the number of rows/columns in a table directly corresponds to the number of summands in the corresponding chi-squared statistic.)

Mathematical Model: iPod example

Conditions: The study used random assignment, and all expected counts exceed 5, so the chi-squared approximation is appropriate.

Degrees of freedom: There are 3 rows and 2 columns, so:

\[ df = (3 - 1)(2 - 1) = 2 \]

With \(\chi^2 = 40.13\) and \(df = 2\), we get p-value

1 - pchisq(40.13, df = 2)
[1] 1.93144e-09

This p-value is essentially zero (same conclusion as in the randomization test).

Student Check

Determine if the statements below are true or false. For each false statement, suggest an alternative wording to make it a true statement.

  1. The Chi-square statistic is always greater than or equal to 0.
  2. As the degrees of freedom increases, the mean of the Chi-square distribution increases.
  3. If you found \(\chi^2 = 5\) with \(df=5\) you would fail to reject \(H_0\) at the 5% discernibility level.
  4. When finding the p-value of a Chi-square test, we always shade the tail areas in both tails.
  5. As the degrees of freedom increases, the variability of the Chi-square distribution decreases.