AHSS Fitting a line by least squares regression

Section 8.2 Fitting a line by least squares regression

In this section, we answer the following questions:

How well can we predict financial aid based on family income for a particular college?
How does one find, interpret, and apply the least squares regression line?
How do we measure the fit of a model and compare different models to each other?
Why do models sometimes make predictions that are ridiculous or impossible?

Subsection 8.2.1 Learning objectives

Calculate the slope and y-intercept of the least squares regression line using the relevant summary statistics. Interpret these quantities in context.
Understand why the least squares regression line is called the least squares regression line.
Interpret the explained variance $R^2\text{.}$
Understand the concept of extrapolation and why it is dangerous.
Identify outliers and influential points in a scatterplot.

Subsection 8.2.2 An objective measure for finding the best line

Fitting linear models by eye is open to criticism since it is based on an individual preference. In this section, we use least squares regression as a more rigorous approach.

This section considers family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois.¹ Gift aid is financial aid that does not need to be paid back, as opposed to a loan. A scatterplot of the data is shown in Figure 8.2.1 along with two linear fits. The lines follow a negative trend in the data; students who have higher family incomes tended to have lower gift aid from the university.

These data were sampled from a table of data for all freshmen from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education chronicle.com/article/What-Students-Really-Pay-to-Go/131435

Figure 8.2.1. Gift aid and family income for a random sample of 50 freshman students from Elmhurst College. Two lines are fit to the data, the solid line being the *least squares line*.

We begin by thinking about what we mean by “best”. Mathematically, we want a line that has small residuals. Perhaps our criterion could minimize the sum of the residual magnitudes:

\begin{gather} |y_1 - \hat{y}_1| + |y_2-\hat{y}_2| + \dots + |y_n-\hat{y}_n|\label{sumOfAbsoluteValueOfResiduals}\tag{8.2.1} \end{gather}

which we could accomplish with a computer program. The resulting dashed line shown in Figure 8.2.1 demonstrates this fit can be quite reasonable. However, a more common practice is to choose the line that minimizes the sum of the squared residuals:

\begin{gather} (y_1 - \hat{y}_1)^2 + (y_2-\hat{y}_2)^2+ \dots + (y_n-\hat{y}_n)^2\label{sumOfSquaresForResiduals}\tag{8.2.2} \end{gather}

The line that minimizes the sum of the squared residuals is represented as the solid line in Figure 8.2.1. This is commonly called the least squares line.

Both lines seem reasonable, so why do data scientists prefer the least squares regression line? One reason is that it is easier to compute by hand and in most statistical software. Another, and more compelling, reason is that in many applications, a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.

In Figure 8.2.2, we imagine the squared error about a line as actual squares. The least squares regression line minimizes the sum of the areas of these squared errors. In the figure, the sum of the squared error is $4+1+1=6\text{.}$ There is no other line about which the sum of the squared error will be smaller.

Figure 8.2.2. A visualization of least squares regression using Desmos. Try out this and other interactive Desmos activities at openintro.org/ahss/desmos.

Subsection 8.2.3 Finding the least squares line

For the Elmhurst College data, we could fit a least squares regression line for predicting gift aid based on a student's family income and write the equation as:

\begin{gather*} \widehat{\text{aid}} = a + b\times \text{family_income} \end{gather*}

Here $a$ is the $y$-intercept of the least squares regression line and $b$ is the slope of the least squares regression line. $a$ and $b$ are both statistics that can be calculated from the data. In the next section we will consider the corresponding parameters that they statistics attempt to estimate.

We can enter all of the data into a statistical software package and easily find the values of $a$ and $b\text{.}$ However, we can also calculate these values by hand, using only the summary statistics.

The slope of the least squares line is given by

\begin{gather} b = r\frac{s_y}{s_x}\label{slopeOfLSRLine}\tag{8.2.3} \end{gather}

where $r$ is the correlation between the variables $x$ and $y\text{,}$ and $s_x$ and $s_y$ are the sample standard deviations of $x\text{,}$ the explanatory variable, and $y\text{,}$ the response variable.
The point of averages $(\bar{x}, \bar{y})$ is always on the least squares line. Plugging this point in for $x$ and $y$ in the least squares equation and solving for $a$ gives

\begin{align*} \bar{y} \amp = a + b\bar{x} \amp \amp a=\bar{y}-b\bar{x} \end{align*}

Finding the slope and intercept of the least squares regression line.

The least squares regression line for predicting $y$ based on $x$ can be written as: $\hat{y}=a+bx\text{.}$

\begin{gather*} b=r\frac{s_y}{s_x} \qquad \bar{y} = a + b\bar{x} \end{gather*}

We first find $b\text{,}$ the slope, and then we solve for $a\text{,}$ the $y$-intercept.

Checkpoint 8.2.3.

Table 8.2.4 shows the sample means for the family income and gift aid as $101,800 and $19,940, respectively. Plot the point $(101.8, 19.94)$ on Figure 8.2.1 to verify it falls on the least squares line (the solid line).²

If you need help finding this location, draw a straight line up from the x-value of 100 (or thereabout). Then draw a horizontal line at 20 (or thereabout). These lines should intersect on the least squares line.


	family income, in $1000s (“$x$”)	gift aid, in $1000s (“$y$”)

mean	$\bar{x} = 101.8$	$\bar{y} = 19.94$
sd	$s_x = 63.2$	$s_y = 5.46$

		$r=-0.499$

Table 8.2.4. Summary statistics for family income and gift aid.

Example 8.2.5.

Using the summary statistics in Table 8.2.4, find the equation of the least squares regression line for predicting gift aid based on family income.


	family income, in $1000s (“\(x\)”)	gift aid, in $1000s (“\(y\)”)

mean	\(\bar{x} = 101.8\)	\(\bar{y} = 19.94\)
sd	\(s_x = 63.2\)	\(s_y = 5.46\)

		\(r=-0.499\)


	Estimate	Std. Error	t value	Pr\((>\|t\|)\)

(Intercept)	24.3193	1.2915	18.83	0.0000
family_income	-0.0431	0.0108	-3.98	0.0002

`a`	\(a\text{,}\) the y-intercept of the best fit line
`b`	\(b\text{,}\) the slope of the best fit line
\(r^2\)	\(R^2\text{,}\) the explained variance
`r`	\(r\text{,}\) the correlation coefficient

`DiagnosticOn`
	`Done`

Section 8.2 Fitting a line by least squares regression

Subsection 8.2.1 Learning objectives

Subsection 8.2.2 An objective measure for finding the best line

Subsection 8.2.3 Finding the least squares line

Finding the slope and intercept of the least squares regression line.

Checkpoint 8.2.3.

Example 8.2.5.

Example 8.2.6.

Example 8.2.8.

Example 8.2.9.

Subsection 8.2.4 Interpreting the coefficients of a regression line

Example 8.2.10.

Example 8.2.11.

Interpreting coefficients in a linear model.

Checkpoint 8.2.12.

Checkpoint 8.2.13.

Exercise caution when interpreting coefficients of a linear model.

Subsection 8.2.5 Extrapolation is treacherous

Example 8.2.14.

Subsection 8.2.6 Using \(R^2\) to describe the strength of a fit

Checkpoint 8.2.17.

\(R^2\) is the explained variance.

Checkpoint 8.2.18.

Checkpoint 8.2.19.

Subsection 8.2.7 Calculator/Desmos: linear correlation and regression

TI-84: finding \(a\text{,}\) \(b\text{,}\) \(R^2\text{,}\) and \(r\) for a linear model} Use STAT, CALC, LinReg(a + bx)..

What to do if \(r^2\) and \(r\) do not show up on a TI-83/84.

What to do if a TI-83/84 returns: ERR: DIM MISMATCH.

Casio fx-9750GII: finding \(a\text{,}\) \(b\text{,}\) \(R^2\text{,}\) and \(r\) for a linear model.

Checkpoint 8.2.20.

Example 8.2.22.

Subsection 8.2.8 Types of outliers in linear regression

Example 8.2.23.

Leverage.

Don't ignore outliers when fitting a final model.

Subsection 8.2.9 Categorical predictors with two levels (special topic)

Example 8.2.27.

Interpreting model estimates for categorical predictors..

Subsection 8.2.10 Section summary

Exercises 8.2.11 Exercises

1. Units of regression.

2. Which is higher?

3. Over-under, Part I.

4. Over-under, Part II.

5. Tourism spending.

6. Nutrition at Starbucks, Part I.

7. The Coast Starlight, Part II.

8. Body measurements, Part III.

9. Murders and poverty, Part I.

10. Cats, Part I.

11. Outliers, Part I.

12. Outliers, Part II.

13. Urban homeowners, Part I.

14. Crawling babies, Part II.

TI-84: finding \(a\text{,}\) \(b\text{,}\) \(R^2\text{,}\) and \(r\) for a linear model} Use `STAT`, `CALC`, `LinReg(a + bx)`..