Inference for linear regression

class: center, middle, inverse, title-slide

.title[
# Inference for linear regression
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Akira Horiguchi <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[
### Inference for linear regression with a single variable

* We'll consider chain sandwich stores that spent different amounts on advertising and then the amount of revenue they received
* Suppose data, **for all** chain sandwich stores and amount spent on advertising, looked like:

]

.pull-right[

The population model is: `$$y = \beta_0 + \beta_1 x + \varepsilon.$$`

Using all data in the true population, model would be:

`$$\texttt{expected revenue} = 11.23 + 4.9 \times \texttt{advertising}.$$`
* (Can use `lm()` to compute these coefficients.)
* For every \$1,000 spent on advertising, revenue increases by \$4,900 on average
* If we spent nothing on advertising, we expect \$11,230 revenue on average

However, it's often too time-consuming or expensive to get data from the **entire** population.

* More realistic to have/collect data from only a **subset** of the population.
* **Goal**: use subset to **infer** population's slope and intercept.
]

---
.pull-left[
* The relevant hypotheses for a linear model are about whether or not the population slope parameter is zero.
  - `$H_0$`: `$\beta_1 = 0$`, no linear relationship between response and independent variable
  - `$H_A$`: `$\beta_1 \neq 0$`, some linear relationship between response and independent variable

* We can perform similar analyses that we used before:
  - Randomization test
  - Bootstrap confidence interval
  - Mathematical approach

]

.pull-right[
#### Randomization test
* A randomization test randomly permutes the responses in order to destroy any relationship between the response and independent variable.
* This provides a sense of the variability we should expect under null hypothesis.

E.g. consider **births14** dataset as before, which has baby birth weight and number of weeks of gestation

<img src="lec23_files/figure-html/fig-permweightScatter-1.png" width="100%" />
]

---

.pull-left[
* Each permutation produces a line of best fit.

* By doing this for many permutations,<br> we see what slope values we would get when there is no relationship between the response and the independent variable.

]

.pull-right[

Do this for 10,000 permutations.

* If there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15.
* We reject the null hypothesis:<br> we believe that the slope observed on the original data is not just due to natural variability.<br> Rather, we believe there is a linear relationship between baby `weight` and `weeks` of gestation.

]

---

.pull-left[
#### Bootstrap for confidence interval for the slope
Now look at predicting a baby's weight using the mother's age.

* When we run the following code, we get

``` r
lm(weight ~ mage, data=births14_100) |> tidy()
```

<table class="table table-striped table-condensed" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 10em; font-family: monospace;"> (Intercept) </td>
   <td style="text-align:right;width: 5em; "> 7.34 </td>
   <td style="text-align:right;width: 5em; "> 0.60 </td>
   <td style="text-align:right;width: 5em; "> 12.29 </td>
   <td style="text-align:right;width: 5em; "> &lt;0.0001 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 10em; font-family: monospace;"> mage </td>
   <td style="text-align:right;width: 5em; "> 0.00 </td>
   <td style="text-align:right;width: 5em; "> 0.02 </td>
   <td style="text-align:right;width: 5em; "> -0.11 </td>
   <td style="text-align:right;width: 5em; "> 0.915 </td>
  </tr>
</tbody>
</table>

* We will take bootstrap samples of the original data.
* Sampling with replacement `$\Rightarrow$` <br> a bootstrap sample might end up having **multiple** instances of a data point from original dataset, or **one** instance of that data point, or **no** instances.
    * See overlapping points in plots on right.

]

.pull-right[

``` r
births14_100 |> n_distinct()
#> [1] 100
```

]

---

.pull-left[
* For each bootstrap sample, we compute a linear model on the data
* We then plot a histogram of all of the different slopes we get
* This gives us a sense of the variability of the slopes across random samples

]

.pull-right[

Suppose all of the bootstrap slopes are given in the vector `slopes`

* The following code calculating the end points for the bootstrap 95% confidence interval.
    
    ``` r
    quantile(x, probs = c(0.025, 0.975))
    ```

From this we state:<br> we are 95% confident that, for the model of weight of a baby described by mother's age, a one unit increase in mother's age will be associated with an increase in predicted average baby weight of between -0.01 and 0.081 pounds
* This includes 0, so it is possible mother's age does not have meaningful linear relationship with baby's weight 
]

---

.pull-left[
### Mathematical model for testing a slope
* Under certain conditions, can use the `$t$`-distribution to test and estimate the slope parameter in linear regression.  Conditions:

-   **Linearity.** The data should show a linear trend.

-   **Independent observations.** Be cautious about applying regression to data that are sequential observations in time, e.g. stock prices/day

-   **Nearly normal residuals.** Generally, the residuals should be nearly normal.   When this condition is found to be unreasonable, it is often because of outliers or concerns about influential points
    
-   **Constant or equal variability.** The variability of points around the least squares line remains roughly constant.

]

.pull-right[

Four examples below: do they satisfy the conditions?

]

---

.pull-left[
### Example
Dataset `openintro::midterms_house` has info on US elections, with unemployment rate and election outcomes in **midterm** elections

* Question: does higher unmployment rate entail worse performance for the current President's party?

```
#> tibble [31 × 5] (S3: tbl_df/tbl/data.frame)
#>  $ year        : int [1:31] 1899 1903 1907 1911 1915 1919 1923 1927 1931 1935 ...
#>  $ potus       : Factor w/ 19 levels "Barack Obama",..: 18 16 16 17 19 19 3 3 11 6 ...
#>  $ party       : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 1 1 2 2 2 1 ...
#>  $ unemp       : num [1:31] 11.62 4.3 3.29 5.86 6.63 ...
#>  $ house_change: num [1:31] -9.22 -4.28 -12.29 -26.59 -20.96 ...
```

]

.pull-right[

``` r
midterms_house |>
  ggplot(aes(unemp, house_change)) +
  geom_text(aes(label=substr(party, 1, 1), 
                color=party, shape=party), size=5) +
  labs(x = "Percent unemployed", y = "Percent change in House seats\nfor the President's party") +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(3.5, 12.1), breaks = seq(0, 100, by=2)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(-31, 11), breaks = c(-30, -20, -10, 0, 10)) +
  scale_color_manual(
      values = c("Republican" = "red", 
                 "Democrat" = "blue")) +
  theme(legend.position = c(0.8, 0.8))
```

]

---

Let's build a linear model for the change in house seats for a president's party as a function of unemployment rate

* Let's restrict to election years OUTSIDE of the Great Depression (i.e., not year 1935 or 1939)
    
    ``` r
    not_gd_data <- midterms_house |> filter(! ( year %in% c(1935, 1939)))
    ```
* First check that there are no substantial outliers, no clear nonlinearity

.pull-left[

<img src="lec23_files/figure-html/unnamed-chunk-16-1.png" width="432" style="display: block; margin: auto;" />
* Let's now compute and examine residuals (right).

]

.pull-right[
<img src="lec23_files/figure-html/unnamed-chunk-17-1.png" width="432" style="display: block; margin: auto;" />
* No clear problems with the residuals
* Does appear the linear model might not be the best fit, but let's investigate 
]
---

.pull-left[

Recall question: does higher unmployment rate entail worse performance for the current President's party?

We can frame this as hypothesis test:

- `$H_0:\ \beta_1 = 0$`, true linear model has slope zero
- `$H_A:\ \beta_1 \neq 0$`, true linear model has slope different than zero; unmployment rate is predcitve of change in seats after midterm election

To reject the null, we use the following test statistic

`$$T = \frac{\text{estimate} - \text{null}}{SE},$$`
which follows a `$t$` distribution with `$n-2$` degrees of freedom.  The 2 comes from the linear model having two parameters: `$\beta_0$`, intercept, and `$\beta_1$`, slope.

* This course will not teach how to compute SE - just use the R table output.
* Let's build the linear model

]

.pull-right[

``` r
lm(house_change~unemp, data=not_gd_data) |> tidy() |> kable(digits=c(0, 3, 3, 3, 3))
```

|term        | estimate| std.error| statistic| p.value|
|:-----------|--------:|---------:|---------:|-------:|
|(Intercept) |   -7.364|     5.155|    -1.429|   0.165|
|unemp       |   -0.890|     0.835|    -1.066|   0.296|

* From this we see the following predictive model:

$$
`\begin{aligned}
&\texttt{Percent change House seats for Pres's party}  \newline \\
&\qquad\qquad= -7.36 - 0.89 \times \texttt{(unemployment rate)}
\end{aligned}`
$$
* Predicts greater unemployment rate implies worse performance in House midterm election
* Is there significant evidence that the "true" linear model has a negative slope?
* `$T = \frac{-0.890-0}{0.835 } = -1.066$` matches table output.
* We are testing two-sided alternative `$(\beta_1 \neq 0)$`,<br> so p-value is (left-side area + right-side area):

``` r
2 * pt(-1.066, df = nrow(not_gd_data) - 2)
#> [1] 0.2958637
```
]

---
### Practice 
* IMS 22.12(a,b)
* IMS 22.13(a,b)
* IMS 24.3(a,b)

* Take 3 minutes to think about the problem, discuss with your neighbors
* I'll call on one of you and ask what your thoughts are for how to approach the problem
* Not expecting you to have the right answer!  Just want to hear how you approach the problem, then help guide you through thinking about the next steps!