Inference for a single proportion

class: center, middle, inverse, title-slide

.title[
# Inference for a single proportion
]
.subtitle[
## STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[
### A more thorough treatment of inference for proportions
In this setting, for each observation there is only a single (categorical) variable taking one of two values measuring success or failure

* e.g. "surgical complication" or "no complication"
* Since there's only a single variable, no way to do a randomization test, so we resort to bootstrapping and mathematical models

]

.pull-right[
Let's return to the medical consultant example: 
* A consultant tries to attract patients by saying although US average complication rate for surgeries is 10%, only 3 of her 62 clients (4.8%) received them
* Not a randomized trial, so no way to actually assess whether her actions cause lower complication rate (she could have selectively chosen healthy patients)
* However, we can assess whether `$\hat p = 0.048$` would occur due to random chance given population average of `$p_0 = 0.10$`. 
* Can formulate this as a hypothesis test:
 - `$H_0$`: no association between consultant contributions and complication rate; `$p=0.10$`
 - `$H_A$`: patients with consultant associated with lower complication rate; `$p<0.10$` 
* We'll estimate a "p-value": if the null hypothesis is true, what is the probability of observing a test statistic `$(\hat p)$` that is as extreme as the one we saw?
]

---

.pull-left[

### Sampling under the null hypothesis

What is the *sampling distribution* of the test statistic `$\hat p$` (proportion of complications) if `$H_0$` is true?

* Dataset: 3 of 62 donors had complications
* Under `$H_0$`, 10% of donors have complications
* Now we want to simulate additional datasets of size 62, where with probability 10%, the donor has a complication.
* Each simulated dataset will produce a proportion `$\hat{p}_{sim} = (\# \text{ complications}) / 62$`

Histogram will show the distribution of these simulated proportions for 10000 simulated datasets.

]

.pull-right[

<img src="lec18_files/figure-html/fig-nullDistForPHatIfLiverTransplantConsultantIsNotHelpful-1.png" width="100%" />


* Shaded = sample proportions under null that are as extreme as observed `$\hat{p} = 3/62 \approx 0.0484$`. 
* 1170 simulated sample proportions were `$\leq 3/62$`
* We use these to construct the null distribution's left-tail area and find the p-value:

`$$\text{left area} = \frac{\text{# sims w/ }\hat{p}_{sim} \leq \text{ (3/62)}}{\text{total # sims}}$$`

* Estimated p-value is equal to this tail area: 0.117.
]

---
.pull-left[
How do we do this in R?

* For each bootstrap sample of size `$n=62$`, we calculated the **proportion of succcesses**, where probability of a complication is `$p$` for each donor:
$$ \hat p_{sim} = \frac{\text{\# complications}}{\text{\# donors}} = \frac{\text{\# complications}}{62}. $$
* `$\#$` complications follows a **binomial distribution** with parameters `$n=62$` and `$p=0.1$`. This is denoted as **Binomial(n,p)**.

More generally, **Binomial(n,p)** models the number of successes in `$n$` independent trials when each trial has probability `$p$` of success.
* To create the 10,000 bootstrap proportions, use: 
`rbinom(n_samples, n_trials, p_success)`
    * This generates a vector of length `n_samples`, where each component is the outcome of `n_trials` where probability of success in each trial is `p_success`

]

.pull-right[
* Some examples of `rbinom()`:

``` r
rbinom(10, 5, 0.3)
#>  [1] 3 0 0 3 0 1 2 0 2 1
rbinom(10, 5, 0.3)
#>  [1] 2 0 2 1 1 2 2 0 4 3
rbinom(10, 5, 0.3)
#>  [1] 1 0 1 0 1 3 1 0 3 4
```
* To convert number of successes to proportion, need to divide by number of trials

``` r
n_trials <- 62
p_success <- 0.1
n_bootstrap_samples <- 10000
n_successes <- rbinom(
 n_bootstrap_samples,
 n_trials,
 p_success
)
 
medical_consultant_sim_dist <- tibble(
 stat = n_successes / n_trials)
```

]

---

.pull-left[
### Mathematical model

The sampling distribution for `$\hat{p}$` based on a sample of size `$n$` from a population with a true proportion `$p$` is nearly normal when:

1. The sample's observations are independent, e.g., are from a simple random sample.
1. We expected to see at least 10 successes and 10 failures in the sample, i.e., `$np\geq10$` and `$n(1-p)\geq10.$` This is called the **success-failure condition**.

When these conditions are met, then the sampling distribution of `$\hat{p}$` is nearly normal with mean `$p$` and standard error of `$\hat{p}$` as `$SE(\hat{p}) = \sqrt{\frac{\ p(1-p)\ }{n}}.$`

]

.pull-right[

#### Checking the two conditions

How do we check the success-failure condition when typically we don't know the true proportion `$p$`?

We can estimate `$p$` with...

- ...the sample proportion `$\hat{p}$`, if computing confidence intervals;
- ...the null value `$p_0$`, if performing a hypothesis test.

The independence condition is a more nuanced requirement (outside the scope of this class).

]

---

.pull-left[
### Confidence interval
Provides a range of plausible values for proportion `$p$`

* When the sample proportion `$\hat{p}$` can be modeled using a normal distribution, a confidence interval for proportion `$p$` takes the form
$$ \hat p \pm z^* \times SE(\hat{p})$$
where
$$ SE(\hat p) = \sqrt{\frac{p(1-p)}n}.$$
* Since `$p$` is unknown, we typically use
`$$SE(\hat{p}) \approx \sqrt{\frac{(\mbox{best guess of }p)(1 - \mbox{best guess of }p)}{n}}$$`
* `$z^*$` is a threshold depending upon level of confidence desired `$(z^*=1.96$`: 95% level)

]

.pull-right[
Example: random sample of 826 payday loan borrowers, assessing interest in regulation for payday loans.  70% of the responders say they support regulations.

1.  Is it reasonable to model the sample-to-sample variability of `$\hat{p}$` using a normal distribution?

1.  Estimate the standard error of `$\hat{p}.$`

1.  Construct a 95% confidence interval for `$p,$` the proportion of payday borrowers who support increased regulation for payday lenders.
]

---

.pull-left[

]
.pull-right[

1. Data are a random sample, so reasonable to assume independent observations that represent the population. Need to check success-failure condition.  We don't have `$p$`, so have to use `$\hat p$` to estimate it:
    - `$\text{Support: } n p \approx 826 \times 0.70 = 578$`
    - `$\text{Not: } n (1 - p) \approx 826 \times (1 - 0.70) = 248$`
Both are >10, so success-failure holds. 
2.  Since `$p$` is unknown, we use `$\hat p$` to estimate the standard error,
`$$SE = \sqrt{\frac{p(1-p)}{n}} \approx \sqrt{\frac{0.70 (1 - 0.70)} {826}} = 0.016.$$`
3. Using the point estimate `$\hat{p} = 0.70$`, `$z^{\star} = 1.96$` for a 95% confidence interval, and the standard error `$SE = 0.016$` from above:
`$$\hat{p} \pm z^{\star} \times SE = 0.70 \pm 1.96 \times 0.016$$`
The confidence interval is then `$(0.669, 0.731)$`.

]

---
.pull-left[
### Changing the confidence level

If we want more confidence that our confidence interval contains `$p$`, the interval should be LARGER to account for greater uncertainty.

* The 95% conf. interval takes the form 
`$$\text{point estimate} \ \pm \ 1.96 \ \times \ SE$$`
* 1.96 corresponds to the 95% confidence level
* 2.58 corresponds to 99% confidence level
* Where do these numbers come from? The normal approximation.

]

.pull-right[

<img src="lec18_files/figure-html/fig-choosingZForCI-1.png" width="90%" />
* We can compute these more exactly using `qnorm()`: quantile function
* 99% confidence interval corresponds to 0.5% tail on each side. (0.5% + 99% + 0.5% = 100%)
* By symmetry, we can just look for the value corresponding to 0.5th percentile.

``` r
qnorm(0.005) # for 99%
#> [1] -2.575829
qnorm(0.025) # for 95%
#> [1] -1.959964
```
]

---
.pull-left[
### Hypothesis test for a proportion
We use **Z scores** to quickly assess how likely/unlikely the sample proportion differs from a hypothesized proportion.

* It normalizes the observed difference by the standard error (expected variability in the sample proportion) under the null hypothesis.

`$$Z = \frac{\hat{p} - p_0}{SE(p_0)} = \frac{\hat{p} - p_0}{\sqrt{p_0(1 - p_0)/n}}$$`

* When null hypothesis is true, and when the samples are independent and we have sufficiently many samples,
`$$np_0 \geq 10, \quad n(1-p_0)\geq 10,$$`
then `$Z$` is approximately a standard normal distribution `$N(0,1)$`.

]

.pull-right[

]

---

.pull-left[
Example: let's again consider whether payday loan borrowers support regulation on the loans that require evaluating debt payments.  Suppose we have a random sample of 826 borrowers, and 51% said they support regulation.

Is it reasonable to model `$\hat p$` w/ a normal distribution?  
* Independence holds because it's a random sample; and `$np_0 = 413$` and `$n(1-p_0)=413$` (we are using the null parameter `$p_0=0.5$` here).  Thus normal model is reasonable.

What hypothesis should we be testing?
* `$H_0$`: not support for regulation, `$p\leq 0.5$`.
* `$H_A$`: support for regulation, `$p>0.5$`.

Under a significance level `$\alpha = 0.05$`, should we reject `$H_0$` given the data?
* Let's first try deciding using Z-score.
* `$SE(p_0) = \sqrt{\frac{p_0(1-p_0)}n} = \sqrt{\frac{0.5(1-0.5)}{826}} = 0.017$`. 
]

.pull-right[

* Based on the normal model, the test statistic can be computed as the Z score of the point estimate:
`$$Z = \frac{\hat{p} - p_0}{SE(p_0)} = \frac{0.51 - 0.5}{0.017} = 0.59$$`
`$\hat{p}$` within 1 std dev of the mean, so don't reject `$H_0$`
* Now try p-value (area of shaded region).
<img src="lec18_files/figure-html/fig-normTail-51-1.png" width="100%" />

* Tail area which represents the p-value is 0.2776.
* B/c p-value is larger than 0.05, do not reject `$H_0.$`

Conclusion: The poll does not provide convincing enough evidence that a majority of payday loan borrowers support loan regulations.
]