Inference with mathematical models

class: center, middle, inverse, title-slide

.title[
# Inference with mathematical models
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

### Inference with mathematical models  (Based on Ch 13 of IMS)

.pull-left[

Previously we addressed questions about **population parameters** by estimating them using **sample statistics**. For example,

- In the sex discrimination study, we asked if `$p_M - p_F > 0$` by estimating this proportion difference using `$\hat{p}_M - \hat{p}_F$`.
- In the college student savings study, we asked if `$p_T - p_C > 0$` by estimating this proportion difference using `$\hat{p}_T - \hat{p}_C$`.
- In the medical consultant study, we asked if `$p < 0.1$` by estimating this proportion using `$\hat{p}$`.

We tried to reject the null hypothesis `$H_0$` by seeing if the sample statistic was "unusual" under `$H_0$`.

- That is, we tried to see if **sample-to-sample variability** could have explained how far the test statistic was from the null value if `$H_0$` was true.

]

.pull-right[

So far, we estimated the variability of a statistic using **computational techniques**:

* With randomization tests, the data were permuted assuming the null hypothesis.
* With bootstrapping, the data were resampled in order to measure the variability.

We'll now measure a statistic's variability by using **mathematical formulas** derived from statistical models we impose on the data.

- Next we will discuss what these statistical models are and why they seem justified to use.

]

---
.pull-left[ 
### Sampling distributions
In both randomization and bootstrapping, we formed **sampling distributions**

* A **sampling distribution** is the distribution of all possible values of a sample statistic from samples of a given sample size from a given population.
* Describes how much sample statistics vary from one sample to another

In contrast, a **data distribution** shows the variability of the observed data values.

* The data distribution can be visualized from the observations themselves.

Because a sampling distribution describes sample statistics computed from many studies, it cannot be visualized directly from a single dataset.

* Can only be visualized after using computational methods / mathematical models.
 
]

.pull-right[
In previous examples, we ran 10,000 simulations under the null hypothesis

* Each simulation provided a version of the sample statistic under the null hypothesis
* These 10,000 versions allow us to estimate the *sampling distribution* of the sample statistic under the null hypothesis, i.e. **null distribution** 
* We visualize our estimated null distributions below

<img src="lec16_files/figure-html/unnamed-chunk-2-1.png" width="720" />
* These distribution shapes look oddly similar to each other...
]

---
.pull-left[
#### The central limit theorem and normal distributions

The "central limit theorem": With enough *independent* samples from a population, the sample proportion/mean will increasingly resemble a **normal distribution**: a bell-shaped curve that looks like

<img src="lec16_files/figure-html/fig-er6895997-1.png" width="432" />
* `$\mu$` is the distribution mean (a location parameter)
* `$\sigma$` is the distribution "standard deviation" (a scale parameter)

]

.pull-right[
Values of `$\mu$`, `$\sigma$` may change from plot to plot, but the shape remains the same. For example:

* The sample proportion `$\hat p$` will look like a normal distribution centered at population proportion `$p$` provided:
  - The observations in the sample are *independent*: samples are truly randomly sampled from a population
  - Sample size is *large enough*: we can make this exact, but generally need `$>= 10$` observed examples in each class (treatment/control)
* Same ideas hold for sample mean `$\bar x$`: centered at population mean `$\mu$`
]

---

.pull-left[

#### Normal distribution model

* Symmetric, unimodal, bell-shaped.
* Exact values of center / spread can change.
* Mean `$\mu$` shifts from left to right, standard deviation `$\sigma$` squishes to be larger or smaller.
* Area under the curve always integrates to 1.

Two normal curves. The left curve has `$\mu=0$` and `$\sigma=1$`. The right curve has `$\mu=19$` and `$\sigma=4$`.
<img src="lec16_files/figure-html/fig-twoSampleNormals-1.png" width="100%" />

The two curves plotted together on the same scale.
<img src="lec16_files/figure-html/fig-twoSampleNormalsStacked-1.png" width="100%" />

]

.pull-right[
We denote "normal distribution with mean `$k$` and standard deviation `$r$`" as `$N(\mu = k, \sigma = r)$`.

- If `$k=0$`, `$r=1$`, i.e. `$N(\mu=0, \sigma=1)$`, we call this the **standard normal distribution**
- Left curve has `$N(\mu=0, \sigma=1)$`; right curve has `$N(\mu=19, \sigma=4)$`.

The way to quantify how "unusual"/"extreme" an event was is to look at how many standard deviations `$\sigma$` away from the mean `$\mu$` the quantity `$x$` is. 
* This is called the `$Z$` score:
$$ Z := \frac{ x - \mu}{\sigma}. $$

E.g. which is more extreme:

1. value of 27 under `$N(\mu=19, \sigma=4)$`, or 
2. value of 3 under `$N(\mu=0, \sigma=1)$`?
  
- Z-score of `$\frac{27-19}{4} = 2$` vs Z-score of `$\frac{3-0}{1} = 3$`.
- The second value is more extreme.
]
---

.pull-left[
#### Z-score examples
SAT scores are approximately normal distribution with mean 1500 and standard deviation 300; ACT scores are approximately normal with mean 21 and standard deviation 5.

* If Ant scored 1800 on the SAT and Bug scored 24 on the ACT, who performed better?
  - Z-score for Ant: (1800 - 1500) / 300 = 1
  - Z-score for Bug: (24 - 21) / 5 = 0.6
  - Ant performed better
* What score corresponds to a z-score of 2.5 in each of the SAT and ACT?
  - SAT: 1500 + 2.5 * 300 = 1950
  - ACT: 21 + 2.5 * 5 = 33.5
]

.pull-right[
#### Normal probability calculations
Suppose that Cat scored 1800 on SAT.  What is the **percentile** of this score?

<img src="lec16_files/figure-html/fig-satBelow1800-1.png" width="85%" />
* Total area under any normal curve is equal to 1.
* Proportion of people who scored below Cat is equal to the area of the shaded region, which we can calculate to be 0.8413 using a computer.
* R and other statistical software can calculate probabilities/percentiles as a function of the z-score. 
* We know this corresponds to a Z-score of 1.
]

---
.pull-left[
#### Normal probability calculations in R
* `pnorm()` function in base R provides percentile associated with a cutoff in the normal curve.  
  - E.g. Cat got 1800 on SAT, which has mean 1500 and sd 300:

``` r
pnorm(1800, mean = 1500, sd = 300)
#> [1] 0.8413447
```

* `normTail()` function in **openintro** draws associated normal curve:

``` r
normTail(m = 1500, s = 300, L = 1800)
```

]

.pull-right[
We can also do the reverse calculation: identify the Z score associated with a percentile.

* `qnorm()`: identifies **quantile** for given percentage

``` r
qnorm(0.841, mean = 1500, sd = 300)
#> [1] 1799.573
```

``` r
normTail(m = 0, s = 1, L = 0.841)
```

* **quantile** and **percentile** are inverse operations:

``` r
qnorm(pnorm(3, mean = 0, sd = 1), mean = 0, sd = 1)
#> [1] 3
pnorm(qnorm(0.99, mean = 5, sd = 3), mean = 5, sd = 3)
#> [1] 0.99
```

]

---

.pull-left[
#### Normal probability calculations
Again consider SAT scores: normal, mean `$\mu=1500$`, s.d. `$\sigma=300$`.

* Suppose we take a random SAT taker.  What is the probability they score `$\geq 1630$` on the SAT?
* Draw the normal curve and visualize the problem:

]
.pull-right[
* Calculate Z score of 1630:
$$ Z = \frac{x - \mu}{\sigma} = \frac{1630 - 1500}{300} = \frac{130}{300} = 0.433.$$
* Then we want to calculate the **p**ercentile:

``` r
pnorm(130/300, mean = 0, sd = 1)
#> [1] 0.6676137
pnorm(1630, mean = 1500, sd = 300)
#> [1] 0.6676137
```
* Thus proportion of 0.668 have people with Z score **lower** than 0.433
* To compute area **above**, need to take one minus this (total area = 1)
  - Total proportion: `$1-0.668 = 0.332$`.
  - Probability of scoring at least 1630 is 0.332, or 33.2%.

]

---

.pull-left[
Suppose Dog scored 1400 on SAT.  What percentile is this?

* Draw a picture:

<img src="lec16_files/figure-html/fig-edward-percentile-1.png" width="60%" />
* Calculate Z score:
$$ Z = \frac{x - \mu}{\sigma} = \frac{1400 - 1500}{300} = \frac{-100}{300} = -0.333.$$
* `pnorm()` returns percentile

``` r
pnorm(-100/300, mean = 0, sd = 1)
#> [1] 0.3694413
pnorm(1400, mean = 1500, sd = 300)
#> [1] 0.3694413
```
* Dog did better than ~37% of SAT takers
]

.pull-right[
Suppose the height of men is approx. normal with avg 70" and standard deviation 3.3"

* If Ewe is 5'7", Fox 6'4", what percentile of men are their heights? Draw a picture and use `pnorm()` to calculate the percentage. 
  - 5'7" = 67"; 6'4" = 76"

``` r
pnorm(67, mean = 70, sd = 3.3)
#> [1] 0.1816511
pnorm(76, mean = 70, sd = 3.3)
#> [1] 0.9654818
```

]

---

.pull-left[
* Let's now try and calculate what the 40th percentile for height is
* Mean: 70", s.d.: 3.3"
* Always draw a picture first:

``` r
normTail(70, 3.3, L = qnorm(0.4, 70, 3.3), col = IMSCOL["blue", "full"])
text(67, 0.03, "40%\n(0.40)", cex = 1, col = IMSCOL["black", "full"])
```

* Z score associated with 40th percentile:

``` r
qnorm(0.4, mean = 0, sd = 1)
#> [1] -0.2533471
```

]

.pull-right[
* With Z-score, mean, and s.d., we can calculate the height:
$$-0.253 = Z = \frac{x-\mu}{\sigma} = \frac{x - 70}{3.3} $$
$$ \implies x - 70 = 3.3 \times -0.253 $$
$$ \implies x = 70 - 3.3\times 0.253 = 69.18 $$
* 69.18" is approximately 5'9"
]

---

### Quantifying the variability of statistics
.pull-left[
Most of the statistics we have seen (sample proportion, sample mean, difference in two sample proportions/means, slope of a linear model fit to data) are **approximately normal** when the data is independent and there are enough samples.

* It is thus very useful to have an intuition for how much variability there is within a few standard deviations of the mean in a normal distribution.
* **68 - 95 - 99.7 rule**: pictorally,

]

.pull-right[
* 68% of the data lies within 1 s.d. of the mean
* 95% lies within 2 s.d.'s
* 99.7% lies within 3 s.d.'s
* Only 0.3% lies more than 3 s.d.'s away from the mean. 
* We can confirm this with `pnorm()` in R:

``` r
pnorm(1, mean=0, sd=1) - pnorm(-1, mean=0, sd=1)
#> [1] 0.6826895
pnorm(2, mean=0, sd=1) - pnorm(-2, mean=0, sd=1)
#> [1] 0.9544997
pnorm(3, mean=0, sd=1) - pnorm(-3, mean=0, sd=1)
#> [1] 0.9973002
2 * pnorm(-3, mean=0, sd=1)
#> [1] 0.002699796
```

]

---
.pull-left[
#### Z score tables
In addition to using `pnorm()`, people use "Z score tables"

* It's important to be able to read and interpret these - imagine if all computers disappeared :)

Example: what is the percentile corresponding to a Z score of -2.33?
- Draw! (see chalkboard)
- Look for -2.3 on the table
- Go to column of 0.03
- Read out value: 0.0099
- 0.99% 
- Sanity check with 68 / 95 / 99.7 rule: 
* 5% of data > 2 s.d.'s from mean; 2.5% below -2 s.d.'s (Z of -2) of mean
* 0.3% of data > 3 s.d.'s from mean; 0.15% below -3 s.d.'s (Z of -3) of mean
]

.pull-right[
<img src="zscores.png" width="411" />
]

---
.pull-left[
#### Z score tables
Example: What is the percentile corresponding to a Z score of 1.37? 
- Draw! (see chalkboard)
- By symmetry, percentile of Z score of 1.37 corresponds to one minus the percentile corresponding to percentile of -1.37 (draw!)
- Look for -1.3 on the table
- Go to column of 0.07
- Read out value: 0.0853
- 8.53% percentile for Z = -1.3
- 1 - 0.0853 = 0.9147 --> 91.47 percentile for Z of 1.37 
- Sanity check with 68 / 95 / 99.7 rule: 
- 68% of data within 1 s.d. of mean; 34% between mean and 1 s.d. above mean; thus 84% (=50+34) are less than 1 s.d. above mean.
* 95% of data within 2 s.d. of mean; 47.5% between mean and 2 s.d. above mean; thus 97.5% (=50+47.5) are less than 2 s.d. above mean.
]

.pull-right[
<img src="zscores.png" width="411" />
]

---

.pull-left[
#### Standard error
*  Whenever we estimate proportions/means, these estimates vary from sample to sample
* This variability is quantified using the **standard error**: corresponds to standard deviation of the statistic
* Actual variability of the statistic in the population is unknown -- we use data to estimate the standard error (just as we use data to estimate the unknown population proportion/mean)
* We typically estimate standard error using the "central limit theorem" - will see how to calculate this in future lectures

]

.pull-right[
#### Margin of error
You may also see the term "margin of error": describes how far away observations are from the mean (distance from the mean). 
For example:

- 68% of the observations are within one margin of error of the mean.
- 95% of the observations are within two margins of error of the mean.
- 99.7% of the observations are within three margins of error of the mean.

]

---
.pull-left[
#### Opportunity cost case study
Recall the study where students are reminded about being  able to save $15 if they don't spend it on video game right now

* We estimated that difference in proportions (buy vs not buy when given reminder) was 0.20.
* We then used a randomization test to look at the variability in difference of proportions under *null hypothesis*: difference in proportions does NOT depend on being presented with reminder

<img src="lec16_files/figure-html/fig-OpportunityCostDiffs-w-normal-1.png" width="432" />
]

.pull-right[

Z-score in a hypothesis test is given by substituting the standard error for the standard deviation:
`$$Z = \frac{\text{observed difference} - \text{null value}}{SE}$$`

* Here: let's assume we know the standard error is 0.078 (methods to calculate: to come)

`$$Z = \frac{0.20 - 0}{0.078} = 2.56$$`

``` r
1 - pnorm(0.2/0.078, mean = 0, sd = 1)
#> [1] 0.005172149
```

]

---

.pull-left[
#### Z score tables
What is the percentile corresponding to a Z score of 2.56? Draw! (see chalkboard)

- By symmetry, percentile of Z score of 2.56 corresponds to one minus the percentile corresponding to percentile of -2.56 (draw!)
- Look for -2.5 on the table
- Go to column of 0.06
- Read out value: 0.0052
- 0.52% percentile for Z = -2.56
- 1 - 0.0052 = 0.9948 --> 99.48 percentile for Z of -2.56 
- Sanity check with 68 / 95 / 99.7 rule: 
* 95% of data within 2 s.d. of mean; 47.5% between mean and 2 s.d. above mean; thus 97.5% (=50+47.5) are less than 2 s.d. above mean.
* 99.7% of data within 3 s.d. of mean; 49.85% between mean and 3 s.d. above mean; thus 99.85% (=50+49.85) are less than 3 s.d. above mean.
]

.pull-right[
<img src="zscores.png" width="411" />
]

---
#### Drawbacks of the central limit theorem / normal approximation
* Under certain conditions, we can be guaranteed that the sample mean and sample proportions will be approximately normal, with appropriate mean `$\mu$` and s.d. `$\sigma$`
* We require:
  - Independent samples (no correlation between them!)
  - Large number of samples (>= 10 per treatment for proportions)
* We sometimes cannot be certain samples are independent
* The normal approximation assigns a positive chance for EVERY value to occur - this is not ideal if you are talking about things like time or other variables with constraints (e.g., time is always >= 0)
* Developing statistics which incorporate *constraints* into the variables is more challenging, and requires mathematical work

---