library(tidyverse)library(openintro)library(infer)library(knitr)library(ggpubr)library(kableExtra)library(gghighlight)library(patchwork)options(pillar.print_min =9) # to avoid annoying scroll behaviorknitr::opts_chunk$set(out.height ="100%")theme_set(theme_bw())
Inference
We addressed questions about population parameters by estimating them using sample statistics. For example,
In the sex discrimination study, we asked if \(p_M - p_F > 0\) by estimating this proportion difference using \(\hat{p}_M - \hat{p}_F\).
In the college student savings study, we asked if \(p_T - p_C > 0\) by estimating this proportion difference using \(\hat{p}_T - \hat{p}_C\).
In the medical consultant study, we asked if \(p < 0.1\) by estimating this proportion using \(\hat{p}\).
We tried to reject \(H_0\) by seeing if the sample statistic was “unusual” under \(H_0\).
That is, we tried to see if sample-to-sample variability could have explained how far the test statistic was from the null value if \(H_0\) was true.
Inference with mathematical models
We estimated the variability of a statistic using computational techniques:
Randomization tests: the data were permuted assuming the null hypothesis.
Bootstrapping: the data were resampled in order to measure the variability.
We’ll now measure a statistic’s variability by using mathematical formulas derived from statistical models we impose on the data.
Next we will discuss what these statistical models are and why they seem justified to use.
Sampling distributions
A sampling distribution is the distribution of all possible values of a sample statistic from samples of a given sample size from a given population.
Describes how much sample statistics vary from one sample to another.
E.g. mean GPA of 7 randomly sampled UC Davis data-science graduates.
Because a sampling distribution describes sample statistics computed from many studies, it cannot be visualized directly from a single dataset.
Can only be visualized using computational methods / mathematical models.
In previous examples, we ran 10,000 simulations under the null hypothesis \(H_0\).
Each simulation provided a version of the sample statistic under \(H_0\).
These 10,000 versions allowed us to estimate the sampling distribution of the sample statistic under \(H_0\), i.e., the null distribution.
We visualize our estimated null distributions below:
These distribution shapes look oddly similar to each other…
The central limit theorem and normal distributions
Central limit theorem: With enough independent samples from a population, the sample proportion/mean will increasingly resemble a normal distribution, which is a bell-shaped curve that looks like
\(\mu\) (Greek letter “mu”) is the distribution mean (a location parameter)
\(\sigma\) (Greek letter “sigma”) is the distribution “standard deviation” (a scale parameter)
Values of \(\mu\), \(\sigma\) may change from plot to plot, but retains the bell shape. E.g.:
The sample proportion \(\hat p\) will look like a normal distribution centered at population proportion \(p\) provided that:
The observations in the sample are independent: samples are truly randomly sampled from a population.
Sample size is large enough: each class (treatment/control) generally needs \(\geq 10\) observations.
Same ideas hold for sample mean \(\bar x\): centered at population mean \(\mu\).
Normal distribution model
Symmetric, unimodal, bell-shaped. Area under curve always integrates to 1.
Exact values of center / spread can change.
Mean \(\mu\) shifts from left to right.
Standard deviation \(\sigma\) squishes or stretches the bell shape.
Figure 1: Two normal curves. The left curve has mean=0 and sd=1. The right curve has mean=19 and sd=4. Bottom: the two curves plotted together on the same scale.
We denote “normal distribution with mean \(k\) and standard deviation \(r\)” as \(N(\mu = k, \sigma = r)\).
Left curve has \(N(\mu=0, \sigma=1)\); right curve has \(N(\mu=19, \sigma=4)\).
We call \(N(\mu=0, \sigma=1)\) the standard normal distribution.
z-score quantifies how “unusual”/“extreme” a quantity \(x\) is
How many standard deviations \(\sigma\) away from the mean \(\mu\) the quantity \(x\) is: \[Z := \frac{ x - \mu}{\sigma}.\]
z-score examples
If Ant scored 1800 on SAT and Bug scored 24 on ACT, who performed better?
SAT scores approximately follow normal distribution with mean 1500 and standard deviation 300. \(\Longrightarrow\) z-score for Ant: \(\boxed{(1800 - 1500) / 300 = 1}\)
ACT scores approximately follow normal distribution with mean 21 and standard deviation 5. \(\Longrightarrow\) z-score for Bug: \(\boxed{(24 - 21) / 5 = 0.6}\)
Ant has larger z-score, so Ant performed better.
What score corresponds to a z-score of 2.5 in each of the SAT and ACT?
SAT: 1500 + 2.5 * 300 = 2250
ACT: 21 + 2.5 * 5 = 33.5
Normal probability calculations: percentile and quantile
Ant scored 1800 on SAT. What is the percentile of this score?
openintro::normTail(m =1500, s =300, L =1800, cex.lab=1.5, cex.axis=1.5)
Figure 2: Proportion of people who scored below Ant is equal to the area of the shaded region, which we can calculate to be 0.8413447 using R. Hence Ant scored in the 84th percentile.
pnorm() provides percentile associated with a cutoff in the normal curve.
pnorm(1800, mean =1500, sd =300) # percentile
[1] 0.8413447
# Can also calculate percentiles as a function of the z-score. # From previous slide, this corresponds to a z-score of 1.pnorm(1, mean =0, sd =1) # percentile
[1] 0.8413447
normTail(m =0, s =1, L =0.8413447)
Can also do the reverse: identify the SAT/z-score associated with a percentile.
Either approach: Dog did better than ~37% of SAT takers.
Ex: Height
Suppose the height of men is approx. normal with avg 70” and sd 3.3”
If Ewe is 5’7” (67”), Fox is 6’4” (76”), what percentile of men are their heights? Draw a picture and use pnorm() to calculate the percentage.
normTail(70, 3.3, L =67, cex.lab=2, cex.axis=2)
Ewe has height of 67 inches.
pnorm(67, mean =70, sd =3.3)
[1] 0.1816511
normTail(70, 3.3, L =76, cex.lab=2, cex.axis=2)
Fox has height of 76 inches.
pnorm(76, mean =70, sd =3.3)
[1] 0.9654818
Ex: Height Percentiles
Let’s now try and calculate what the 40th percentile for height is
Mean: 70”, s.d.: 3.3”
Always draw a picture first:
normTail(70, 3.3, L =qnorm(0.4, 70, 3.3), col = IMSCOL["blue", "full"])text(67, 0.03, "40%\n(0.40)", cex =1, col = IMSCOL["black", "full"])
z-score associated with 40th percentile:
qnorm(0.4, mean =0, sd =1)
[1] -0.2533471
With z-score, mean, and s.d., we can calculate the height: \[-0.253 = Z = \frac{x-\mu}{\sigma} = \frac{x - 70}{3.3} \]\[ \implies x - 70 = 3.3 \times -0.253 \]\[ \implies x = 70 - 3.3\times 0.253 = 69.18 \]
69.18” is approximately 5’9”
Quantifying the variability of statistics
Most statistics we have seen (sample proportion, sample mean, difference in two sample proportions/means) are approximately normal when the data is independent and there are enough samples.
It is thus very useful to have an intuition for how much variability there is within a few standard deviations of the mean in a normal distribution.
68 - 95 - 99.7 rule: pictorally,
68% of the data lies within 1 s.d. of the mean
95% lies within 2 s.d.’s
99.7% lies within 3 s.d.’s
Only 0.3% lies more than 3 s.d.’s away from the mean.
Hint: how many standard deviations away from the mean is a z-score?
Standard error
Point estimates (e.g., sample proportions/means) vary from sample to sample.
We quantify this variability by the standard error: the standard deviation associated with the statistic.
Actual variability of the statistic in the population is unknown – we use data to estimate the standard error (just as we use data to estimate the unknown population proportion/mean)
We typically estimate standard error using the “central limit theorem”
You may also see the term “margin of error”: describes how far away observations are from the mean (distance from the mean). For example:
68% of the observations are within one margin of error of the mean.
95% of the observations are within two margins of error of the mean.
99.7% of the observations are within three margins of error of the mean.
Opportunity-cost case study
Students are reminded about saving $15 if they don’t buy video game right now.
We estimated that difference in proportions was 0.20.
We then used a randomization test to look at the variability in difference of proportions under null hypothesis: difference in proportions does NOT depend on being presented with reminder
z-score in a hypothesis test is given by substituting the standard error for the standard deviation: \[Z = \frac{\text{observed difference} - \text{null value}}{SE}\]
Here: assume we know the SE is 0.078 (methods to calculate: to come)
\[Z = \frac{0.20 - 0}{0.078} \approx 2.56\]
1-pnorm(0.2/0.078, mean =0, sd =1) # p-value
[1] 0.005172149
Drawbacks of the central limit theorem / normal approximation
Under certain conditions, we can be guaranteed that the sample mean and sample proportions will be approx. normal, with appropriate mean \(\mu\) and s.d. \(\sigma\)
We require:
Independent samples (no correlation between them!)
Large number of samples (>= 10 per treatment for proportions)
We sometimes cannot be certain samples are independent
The normal approximation assigns a positive chance for EVERY value to occur - this is not ideal if you are talking about things like time or other variables with constraints (e.g., time is always >= 0).
Developing statistics which incorporate constraints into the variables is more challenging, and requires mathematical work.