Peter Cao

Jed Quiaoit

The **least squares regression line (LSRL)** is the best linear regression line that exists in the sense that it minimizes the sum of the squared residuals. (Remember from previous sections that residuals are the differences between the observed values of the response variable, y, and the predicted values, ŷ, from the model.)

The least squares criterion is used to find the line of best fit because it minimizes the sum of the squared residuals. This is done by minimizing the difference between the observed and predicted values, which in turn maximizes the accuracy of the model.

The least squares regression line is given by the formula ŷ = a + bx, where ŷ is the predicted value of the response variable, x is the predictor or explanatory variable, a is the y-intercept (the value of ŷ when x is zero), and b is the slope (the change in ŷ per unit change in x). The y-intercept and slope can be calculated using the one-variable statistics of x and y.

The reason why the residuals are *squared* in the least squares criterion is to give more weight to larger residuals and to eliminate the cancellation of positive and negative residuals. Squaring the residuals also has the effect of penalizing larger deviations from the line of best fit more heavily, which can help to reduce the overall variance in the model. 🪢

The **slope** is the *predicted increase in the response variable with an increase of one unit of the explanatory variable*. To find the slope, we have the formula: ⛰️

(where b is the slope, r is the correlation coefficient between x and y, sy is the standard deviation of y, and sx is the standard deviation of x.)

The least squares estimate of the slope takes into account the variability in both x and y and the strength of the linear relationship between them. It is a weighted average of the deviation of y from the mean of y over the deviation of x from the mean of x, with the correlation coefficient, *r*, serving as a correction factor.

When asked to interpret a slope of a LSRL, follow the template below:

⭐ "There is a predicted increase/decrease of ______ (slope in unit of y variable) for every 1 (unit of x variable)."

**Big Three**

- Context

- Word "predicted"

Once you have calculated the slope of the least squares regression line, you can use the point-slope form to find the y-intercept and the general formula for the line.

The point-slope form of a linear equation is given by:

ŷ - y1 = m(x - x1)

where ŷ is the predicted value of the response variable, m is the slope of the line, x is the predictor or explanatory variable, and (x1, y1) is a point on the line.

Substituting the values into the point-slope form, we have:

ŷ - ȳ = b(x - x̄)

Solving for ŷ, we get:

ŷ = bx + (-bx̄ + ȳ)

The expression in parentheses is the **y-intercept** of the line, which represents the *value of the response variable when the explanatory variable is zero*. 💛

Template time! When asked to interpret a y-intercept of a LSRL, follow the template below:

⭐ "The predicted value of (y in context) is _____ when (x value in context) is 0 (units in context)."

Big Three

- Context

The **coefficient of determination**, also known as **R-squared**, is a statistic that is used to evaluate the fit of a linear regression model (how well the LSRL fits the data). It is a measure of how much of the variability in the response variable (y) can be explained by the model. 🍄

R-squared can be defined as the *square of the correlation coefficient* (r) between the observed and predicted values of the response variable. It is represented by the symbol R-squared and ranges from 0 to 1, with a value of 0 indicating *no* relationship between the explanatory and response variables (LSRL does not model the data at all) and a value of 1 indicating a *perfect linear *relationship.

There is also another formula for r^2 as well. This formula is:

This is saying that this is the percent difference between the variance of y and the sum of the residual squared. In other words, this is the reduction in the variation of y due to the LSRL. When interpreting this we say that it is the “percentage of the variation of y that can be explained by a linear model with respect to x.”

Template time yet again! When asked to interpret a coefficient of determination for a least squares regression model, use the template below:

⭐ "____% of the variation in (y in context) is due to its linear relationship with (x in context)."

Big Three

- Context

The last statistic we will talk about is the **standard deviation** of the residuals, also called s. S is the typical residual by a given data point of the data with respect to the LSRL. The formula for s is given as: 🐫

which looks similar to the sample standard deviation, except we will divide by n-2 and not n-1. Why? We will learn more about s when we learn inference for regression in Unit 9.

On the AP test, it is very likely that you will be expected to read a **computer printout** of the data. Here is a sample printout with a look at where most of the statistics you will need to use are (the rest you will learn in Unit 9): 🖥️

A researcher is studying the relationship between the amount of sleep (in hours) and the performance on a cognitive test. She collects data from 50 participants and fits a linear regression model to the data. The summary of the model is shown below:

a) Interpret the slope of the model in the context of the problem.

b) Interpret the y-intercept of the model in the context of the problem.

c) Interpret the correlation coefficient of the model in the context of the problem.

d) Interpret the R-squared value of the model in the context of the problem.

e) Based on the summary of the model, do you think that the amount of sleep has a significant effect on the performance on the cognitive test? Why or why not?

f) Suppose the researcher collects data from an additional 50 participants and fits a new linear regression model to the combined data. The summary of the new model is shown below:

Compare the two models and explain how the new model differs from the original model in terms of the strength and direction of the relationship between the amount of sleep and the performance on the cognitive test.

- The y-intercept is also slightly lower in the new model (48) compared to the original model (50).
- The correlation coefficient is slightly weaker in the new model (-0.6) compared to the original model (-0.7).
- Finally, the R-squared value is lower in the new model (0.36) compared to the original model (0.49).

Overall, these differences suggest that the new model has a slightly weaker and less negative relationship between the amount of sleep and the performance on the cognitive test compared to the original model.

Browse Study Guides By Unit

👆Unit 1 – Exploring One-Variable Data

✌️Unit 2 – Exploring Two-Variable Data

🔎Unit 3 – Collecting Data

🎲Unit 4 – Probability, Random Variables, & Probability Distributions

📊Unit 5 – Sampling Distributions

⚖️Unit 6 – Proportions

😼Unit 7 – Means

✳️Unit 8 – Chi-Squares

📈Unit 9 – Slopes

✏️Frequently Asked Questions

✍️Free Response Questions (FRQs)

📆Big Reviews: Finals & Exam Prep

© 2023 Fiveable Inc. All rights reserved.