STA35B: Statistical Data Science 2
Done with R4DS book now; moving to Introduction to Modern Statistics (IMS). The coming lectures will look at linear models and how to perform inference.
The common brushtail possum of Australia. Photo by Greg Schecter, flic.kr/p/9BAFbR, CC BY 2.0 license.
We’ll look into the possum
dataset which has measurements of possum body lengths and head lengths
# A tibble: 104 × 8
site pop sex age head_l skull_w total_l tail_l
<int> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 1 Vic m 8 94.1 60.4 89 36
2 1 Vic f 6 92.5 57.6 91.5 36.5
3 1 Vic f 6 94 60 95.5 39
4 1 Vic f 6 93.2 57.1 92 38
5 1 Vic f 2 91.5 56.3 85.5 36
6 1 Vic f 1 93.1 54.8 90.5 35.5
7 1 Vic m 2 95.3 58.2 89.5 36
8 1 Vic f 6 94.8 57.6 91 37
9 1 Vic f 9 93.4 56.3 91.5 37
10 1 Vic f 6 91.8 58 89.5 37.5
# ℹ 94 more rows
Can we predict the head length given the body length?
possum |>
ggplot(aes(total_l, head_l)) +
geom_point(alpha = 1, size = 2) +
labs(x = "Total Length (cm)",
y = "Head Length (mm)")
A linear model is a general framework that describes a linear relationship between a dependent variable and one or more independent variables.
Why model the relationship as linear?
The best fitting line for these data is flat; not useful for describing this clear non-linear relationship. These data are from a physics experiment.
How do we fit a “good” line to the data?
However, a perfect explanatory relationship (linear or not) is unrealistic in almost any natural process.
The basic structure of a linear model: \[ y = b_0 + b_1 x + \epsilon \]
Figure 2: Three datasets where a linear model may be useful even if the data do not all fall exactly on the line.
We saw two lines. Which line fits the data better?
Residuals are the leftover variation in the data after accounting for the model fit: \[\text{Data= Fit + Residual}\] One goal in picking the right linear model is for residuals to be as small as possible.
Suppose we fit a line to the data \((x_i, y_i)_{i=1}^n\).
Example calculation: if the line is given by the equation \[ \hat y = 41 + 0.59 x \] then the residual for the observation \((x_0,y_0) = (76.0, 85.1)\) is \[ \hat y_0 = 41 + 0.59 \times 85.1 = 85.84 \] \[ e_0 = y_0 - \hat y_0 = 85.1 - 85.84 = -0.74 \]
… are helpful when trying to understand how well a linear model fits the data
Residual plot for the model predicting head length from total length for brushtail possums.
Return to earlier question: which line fits the data better? More generally, what is the “best” line (if it exists)?
Use software to compute the slope and intercept of these lines.
Let’s see these what these two lines look like:
model_rlm <- MASS::rlm(head_l ~ total_l, data = possum) # for LAD Line
ggplot(possum, aes(x = total_l, y = head_l)) +
geom_point(size = 2) +
geom_smooth(aes(y = predict(model_rlm)), color="blue", linetype="dashed") + # LAD Line
geom_smooth(method = "lm", se = FALSE, color="orange", linetype="solid") + # OLS Line
labs(x = "Total Length (cm)", y = "Head Length (mm)") + theme_bw() + theme(axis.title=element_text(size=20))
Line that minimizes this least squares criterion is called the least squares line. Reasons to use this line over option 1:
The rest of this course will use this option.
For possum
, we could write the equation of the least squares regression line as \[\widehat{\texttt{head_length}} = \beta_0 + \beta_1 \times \texttt{body_length}\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 42.7097931 | 5.1728138 | 8.256588 | 0 |
total_l | 0.5729013 | 0.0593253 | 9.656946 | 0 |
\[\widehat{\texttt{head_length}} = \hat\beta_0 + \hat\beta_1 \times \texttt{body_length} = 42.7 + 0.573 \times \texttt{body_length}\]
But what do these values mean?
The slope describes the estimated difference in the predicted average outcome of \(y\) if the predictor variable \(x\) happened to be one unit larger.
The intercept describes the average outcome of \(y\) if \(x = 0\) and the linear model is valid all the way to \(x = 0\) (values of \(x = 0\) are not observed or relevant in many applications).
In our example: For each additional cm of body length, we would expect a possum’s head to be \(0.573\)mm larger.
In our example: A possum with no body length is expected to have a head length of \(42.7\)mm. (Here the intercept is not meaningful because there are no observations where \(x\) is near zero. Also, a possum cannot have zero body length.)
Generally speaking, we will not know how data outside of our limited window will behave.
Can we quantify the strength of a linear relationship with a single statistic?
Suppose we want to summarize a set of values \(a_1,a_2,\ldots,a_n\).
We can quantify a “representative value” using sample mean mean()
: \[\bar{a} = \frac{1}{n} \sum_{i=1}^n a_i.\]
We can quantify the “spread” using
var()
: \[s_a^2 = \frac{1}{n-1} \sum_{i=1}^n (a_i - \bar{a})^2\]sd()
: \[s_a = \sqrt{s_a^2}\]Can we quantify the strength of a linear relationship with a single statistic?
Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.
Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.
Common to use \(R^2\), called R-squared or coefficient of determination:
[1] 12.76883
var(possum$head_l - predict(lm(head_l ~ total_l, possum))) # Variance of the residuals, i.e. sum of squared errors (SSE)
[1] 6.670301
1 - var(possum$head_l - predict(lm(head_l ~ total_l, possum))) / var(possum$head_l) # R^2: proportion of response variable's variation explained by least squares line. R^2 = (1 - SSE / SST)
[1] 0.4776105
[1] 0.4776105
Total sum of squares: variability in the \(y\) values about their mean, \(\bar{y}\) \[SST = (y_1 - \bar{y})^2 + (y_2 - \bar{y})^2 + \cdots + (y_n - \bar{y})^2\]
Sum of squared errors (or sum of squared residuals): variability in residuals \[SSE = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \cdots + (y_n - \hat{y}_n)^2\]
The coefficient of determination, \(R^2\), can then be calculated as \[R^2 = 1 - \frac{SSE}{SST}\]
Interpretation for possum
: \(R^2 = 0.48\) or \(48\%\) of the variation in head_l
is explained by the linear model.
Example: Predict the interest rate for someone based on whether that person has filed for bankruptcy. (Data from openintro::loans_full_schema
.)
\[ \widehat{\texttt{interest_rate}} = b_0 + b_1 \times \texttt{bankruptcy} \]
bankruptcy
with two levels/categories:
Let’s see what the interest rates look like for the two levels
Let’s inspect the least-squares linear model resulting from this
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 12.3380046 | 0.0532983 | 231.489782 | 0.0e+00 |
bankruptcy1 | 0.7367856 | 0.1529062 | 4.818548 | 1.5e-06 |
Fundamentally, the intercept and slope interpretations do not change when modeling categorical variables with two levels.
Good practice for dealing with outlying observations is to produce two analyses: one with and one without the outlying observations.
Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage or leverage points.
It is tempting to remove outliers. Don’t do this without a very good reason. Models that ignore exceptional (and interesting) cases often perform poorly.