class: center, middle, inverse, title-slide .title[ # Inference for linear regression ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Akira Horiguchi
Figures taken from [IMS], Cetinkaya-Rundel and Hardin text ] --- .pull-left[ ### Inference for linear regression with a single variable * We'll consider chain sandwich stores that spent different amounts on advertising and then the amount of revenue they received * Suppose data, **for all** chain sandwich stores and amount spent on advertising, looked like: <img src="lec23_files/figure-html/unnamed-chunk-1-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ The population model is: `$$y = \beta_0 + \beta_1 x + \varepsilon.$$` Using all data in the true population, model would be: <!-- ```{r} --> <!-- lm(rev ~ ad, data=sandwich) |> tidy() --> <!-- ``` --> `$$\texttt{expected revenue} = 11.23 + 4.9 \times \texttt{advertising}.$$` * (Can use `lm()` to compute these coefficients.) * For every \$1,000 spent on advertising, revenue increases by \$4,900 on average * If we spent nothing on advertising, we expect \$11,230 revenue on average However, it's often too time-consuming or expensive to get data from the **entire** population. * More realistic to have/collect data from only a **subset** of the population. * **Goal**: use subset to **infer** population's slope and intercept. ] <!-- --- --> <!-- .pull-left[ --> <!-- Different subsets lead to different lines of best fit. --> <!-- * Let's took at two different subsets of 20 stores. --> <!-- ```{r} --> <!-- #| echo: false --> <!-- set.seed(470) --> <!-- sandwich2 <- sandwich %>% --> <!-- sample_n(size = 20) --> <!-- sandwich3 <- sandwich %>% --> <!-- sample_n(size = 20) --> <!-- sandwich_many <- sandwich %>% --> <!-- rep_sample_n(size = 20, replace = FALSE, reps = 50) --> <!-- ``` --> <!-- ```{r} --> <!-- #| label: fig-sand-samp1 --> <!-- #| echo: false --> <!-- #| message: false --> <!-- #| warning: false --> <!-- #| fig.align: center --> <!-- #| out.width: 100% --> <!-- sandwich_many |> --> <!-- filter(replicate %in% c(1, 4)) |> --> <!-- mutate(subset = as.factor(replicate)) |> --> <!-- ggplot(aes(ad, rev, color=subset, shape=subset)) + --> <!-- geom_point(size = 3) + --> <!-- geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) + --> <!-- scale_x_continuous(labels = label_dollar(suffix = "K")) + --> <!-- scale_y_continuous(labels = label_dollar(suffix = "K")) + --> <!-- labs( --> <!-- x = "Amount spent on advertising", --> <!-- y = "Revenue", --> <!-- title = "Chain sandwich store" --> <!-- # subtitle = "Random sample of 20 stores" --> <!-- ) + --> <!-- facet_wrap(subset~., nrow=1, labeller="label_both") + --> <!-- coord_cartesian(ylim = c(0, 65)) + --> <!-- theme_bw() + --> <!-- theme(legend.position = "none") --> <!-- ``` --> <!-- ] --> <!-- -- --> <!-- .pull-right[ --> <!-- If we did this on many different samples of 20 stores, we would get varied lines: --> <!-- ```{r} --> <!-- #| label: fig-slopes --> <!-- #| echo: false --> <!-- #| warning: false --> <!-- #| message: false --> <!-- #| out.width: 100% --> <!-- ggplot() + --> <!-- geom_smooth( --> <!-- data = sandwich_many, aes(x = ad, y = rev, group = replicate), --> <!-- method = "lm", se = FALSE, fullrange = TRUE, --> <!-- color = IMSCOL["gray", "f2"], size = 0.5 --> <!-- ) + --> <!-- geom_smooth( --> <!-- data = sandwich, aes(x = ad, y = rev), method = "lm", se = FALSE, --> <!-- fullrange = TRUE, color = IMSCOL["red", "full"] --> <!-- ) + --> <!-- scale_x_continuous(labels = label_dollar(suffix = "K"), breaks=seq(0, 10, by=1)) + --> <!-- scale_y_continuous(labels = label_dollar(suffix = "K")) + --> <!-- labs( --> <!-- x = "Amount spent on advertising", --> <!-- y = "Revenue", --> <!-- title = "Chain sandwich store", --> <!-- subtitle = "Many random samples of 20 stores" --> <!-- ) + --> <!-- coord_cartesian(ylim = c(0, 65)) + --> <!-- theme_bw() --> <!-- ``` --> <!-- ] --> --- .pull-left[ * The relevant hypotheses for a linear model are about whether or not the population slope parameter is zero. - `\(H_0\)`: `\(\beta_1 = 0\)`, no linear relationship between response and independent variable - `\(H_A\)`: `\(\beta_1 \neq 0\)`, some linear relationship between response and independent variable * We can perform similar analyses that we used before: - Randomization test - Bootstrap confidence interval - Mathematical approach ] -- .pull-right[ #### Randomization test * A randomization test randomly permutes the responses in order to destroy any relationship between the response and independent variable. * This provides a sense of the variability we should expect under null hypothesis. E.g. consider **births14** dataset as before, which has baby birth weight and number of weeks of gestation <img src="lec23_files/figure-html/fig-permweightScatter-1.png" width="100%" /> ] --- .pull-left[ * Each permutation produces a line of best fit. <img src="lec23_files/figure-html/fig-permweekslm-1.png" width="100%" /> * By doing this for many permutations,<br> we see what slope values we would get when there is no relationship between the response and the independent variable. ] -- .pull-right[ Do this for 10,000 permutations. <img src="lec23_files/figure-html/fig-nulldistBirths-1.png" width="432" style="display: block; margin: auto;" /> * If there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15. * We reject the null hypothesis:<br> we believe that the slope observed on the original data is not just due to natural variability.<br> Rather, we believe there is a linear relationship between baby `weight` and `weeks` of gestation. ] --- .pull-left[ #### Bootstrap for confidence interval for the slope Now look at predicting a baby's weight using the mother's age. * When we run the following code, we get ``` r lm(weight ~ mage, data=births14_100) |> tidy() ``` <table class="table table-striped table-condensed" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 10em; font-family: monospace;"> (Intercept) </td> <td style="text-align:right;width: 5em; "> 7.34 </td> <td style="text-align:right;width: 5em; "> 0.60 </td> <td style="text-align:right;width: 5em; "> 12.29 </td> <td style="text-align:right;width: 5em; "> <0.0001 </td> </tr> <tr> <td style="text-align:left;width: 10em; font-family: monospace;"> mage </td> <td style="text-align:right;width: 5em; "> 0.00 </td> <td style="text-align:right;width: 5em; "> 0.02 </td> <td style="text-align:right;width: 5em; "> -0.11 </td> <td style="text-align:right;width: 5em; "> 0.915 </td> </tr> </tbody> </table> * We will take bootstrap samples of the original data. * Sampling with replacement `\(\Rightarrow\)` <br> a bootstrap sample might end up having **multiple** instances of a data point from original dataset, or **one** instance of that data point, or **no** instances. * See overlapping points in plots on right. <!-- * We will sometimes get the same sample in the resulting bootstrap, sometimes will not, sometimes get it twice --> ] -- .pull-right[ ``` r births14_100 |> n_distinct() #> [1] 100 ``` <img src="lec23_files/figure-html/fig-birth2BS-1.png" width="80%" style="display: block; margin: auto;" /><img src="lec23_files/figure-html/fig-birth2BS-2.png" width="80%" style="display: block; margin: auto;" /> ] --- .pull-left[ * For each bootstrap sample, we compute a linear model on the data * We then plot a histogram of all of the different slopes we get * This gives us a sense of the variability of the slopes across random samples <img src="lec23_files/figure-html/fig-mageBSslopes-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ Suppose all of the bootstrap slopes are given in the vector `slopes` * The following code calculating the end points for the bootstrap 95% confidence interval. ``` r quantile(x, probs = c(0.025, 0.975)) ``` From this we state:<br> we are 95% confident that, for the model of weight of a baby described by mother's age, a one unit increase in mother's age will be associated with an increase in predicted average baby weight of between -0.01 and 0.081 pounds * This includes 0, so it is possible mother's age does not have meaningful linear relationship with baby's weight ] --- .pull-left[ ### Mathematical model for testing a slope * Under certain conditions, can use the `\(t\)`-distribution to test and estimate the slope parameter in linear regression. Conditions: - **Linearity.** The data should show a linear trend. - **Independent observations.** Be cautious about applying regression to data that are sequential observations in time, e.g. stock prices/day - **Nearly normal residuals.** Generally, the residuals should be nearly normal. When this condition is found to be unreasonable, it is often because of outliers or concerns about influential points - **Constant or equal variability.** The variability of points around the least squares line remains roughly constant. ] -- .pull-right[ Four examples below: do they satisfy the conditions? <img src="lec23_files/figure-html/unnamed-chunk-12-1.png" width="100%" /> ] --- .pull-left[ ### Example Dataset `openintro::midterms_house` has info on US elections, with unemployment rate and election outcomes in **midterm** elections * Question: does higher unmployment rate entail worse performance for the current President's party? ``` #> tibble [31 × 5] (S3: tbl_df/tbl/data.frame) #> $ year : int [1:31] 1899 1903 1907 1911 1915 1919 1923 1927 1931 1935 ... #> $ potus : Factor w/ 19 levels "Barack Obama",..: 18 16 16 17 19 19 3 3 11 6 ... #> $ party : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 1 1 2 2 2 1 ... #> $ unemp : num [1:31] 11.62 4.3 3.29 5.86 6.63 ... #> $ house_change: num [1:31] -9.22 -4.28 -12.29 -26.59 -20.96 ... ``` ] -- .pull-right[ ``` r midterms_house |> ggplot(aes(unemp, house_change)) + geom_text(aes(label=substr(party, 1, 1), color=party, shape=party), size=5) + labs(x = "Percent unemployed", y = "Percent change in House seats\nfor the President's party") + scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(3.5, 12.1), breaks = seq(0, 100, by=2)) + scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(-31, 11), breaks = c(-30, -20, -10, 0, 10)) + scale_color_manual( values = c("Republican" = "red", "Democrat" = "blue")) + theme(legend.position = c(0.8, 0.8)) ``` <img src="lec23_files/figure-html/fig-unemploymentAndChangeInHouse-1.png" width="100%" style="display: block; margin: auto;" /> ] <!-- geom_point(aes(color=party, shape=party), size = 3, alpha = 0.7) + --> --- Let's build a linear model for the change in house seats for a president's party as a function of unemployment rate * Let's restrict to election years OUTSIDE of the Great Depression (i.e., not year 1935 or 1939) ``` r not_gd_data <- midterms_house |> filter(! ( year %in% c(1935, 1939))) ``` * First check that there are no substantial outliers, no clear nonlinearity .pull-left[ <img src="lec23_files/figure-html/unnamed-chunk-16-1.png" width="432" style="display: block; margin: auto;" /> * Let's now compute and examine residuals (right). ] .pull-right[ <img src="lec23_files/figure-html/unnamed-chunk-17-1.png" width="432" style="display: block; margin: auto;" /> * No clear problems with the residuals * Does appear the linear model might not be the best fit, but let's investigate ] --- .pull-left[ Recall question: does higher unmployment rate entail worse performance for the current President's party? We can frame this as hypothesis test: - `\(H_0:\ \beta_1 = 0\)`, true linear model has slope zero - `\(H_A:\ \beta_1 \neq 0\)`, true linear model has slope different than zero; unmployment rate is predcitve of change in seats after midterm election To reject the null, we use the following test statistic `$$T = \frac{\text{estimate} - \text{null}}{SE},$$` which follows a `\(t\)` distribution with `\(n-2\)` degrees of freedom. The 2 comes from the linear model having two parameters: `\(\beta_0\)`, intercept, and `\(\beta_1\)`, slope. * This course will not teach how to compute SE - just use the R table output. * Let's build the linear model ] -- .pull-right[ ``` r lm(house_change~unemp, data=not_gd_data) |> tidy() |> kable(digits=c(0, 3, 3, 3, 3)) ``` |term | estimate| std.error| statistic| p.value| |:-----------|--------:|---------:|---------:|-------:| |(Intercept) | -7.364| 5.155| -1.429| 0.165| |unemp | -0.890| 0.835| -1.066| 0.296| * From this we see the following predictive model: $$ `\begin{aligned} &\texttt{Percent change House seats for Pres's party} \newline \\ &\qquad\qquad= -7.36 - 0.89 \times \texttt{(unemployment rate)} \end{aligned}` $$ * Predicts greater unemployment rate implies worse performance in House midterm election * Is there significant evidence that the "true" linear model has a negative slope? * `\(T = \frac{-0.890-0}{0.835 } = -1.066\)` matches table output. * We are testing two-sided alternative `\((\beta_1 \neq 0)\)`,<br> so p-value is (left-side area + right-side area): ``` r 2 * pt(-1.066, df = nrow(not_gd_data) - 2) #> [1] 0.2958637 ``` ] <!-- --- --> <!-- .pull-left[ --> <!-- ] --> <!-- -- --> <!-- .pull-right[ --> <!-- ] --> --- ### Practice * IMS 22.12(a,b) * IMS 22.13(a,b) * IMS 24.3(a,b) * Take 3 minutes to think about the problem, discuss with your neighbors * I'll call on one of you and ask what your thoughts are for how to approach the problem * Not expecting you to have the right answer! Just want to hear how you approach the problem, then help guide you through thinking about the next steps!