Math 225 Course Notes


Section 9.2: Regression


Key Concepts

Ordinary least squares regression is a technique that finds the "best" line through a cloud of points, by finding the line that makes the residual sum of squares as small as possible. Before giving equations for the slope and intercept of the regression line, we will study correlation, a measure of linear dependence between two variables. We demonstrate these ideas in an example.

Correlation

Correlation is a measure of linear dependence between two variables. The formula for correlation is

Notice that each variable is standardized, and that correlation is (almost) the average product of the standardized values for each pair of data values x and y. For every point where x and y are both above their means or below their means, there is a positive contribution to the sum. If one is above its mean while the other is below, there is a negative contribution to the sum. Thus, r will be positive if large x and large y tend to go together and small x and small y tend to go together. In contrast, r will be negative if small x and large y or large x and small y tend to go together.

The correlation coefficient satisfies these properties:

  1. -1 < r < 1
  2. r = -1 only if data falls exactly on a line with negative slope
  3. r = 1 only if data falls exactly on a line with positive slope
  4. If r is close to -1 or 1, the data is tightly clustered around a line.
  5. The closer r is to 0, the less clustered the data is around a line.
  6. There may be a strong nonlinear relationship between two variables with r close to 0. r only measures the linear relationship.
  7. Computation of r does not depend on the units chosen for each variable. (It only depends on the standardized values of each variable.) thus, r is a unitless expression.

Ordinary Least Squares

For any line drawn through a scatter of points, define a residual for each point as the positive or negative vertical distance from the point to the line. Points above the line have positive residuals and points below the line have negative residuals. Each line has a residual sum of squares (RSS), the sum of the squares of each of these vertical distances. Clearly, a line fits well when this RSS is small. Theory shows that there is a unique line that makes this RSS as small as possible. This line is the OLS regression line.

The Regression Line

The regression line has the equation
   y = a + bx
a is the y intercept and b is the slope. The textbook gives rather messy formula for calculating a and b from data. Simpler equations are
  b = r (sy / sx)
  a =  - b 

In practice, a computer should be used to fit a regression line.

Example

This is an example from exercise 9.3.6 in the textbook. The data is the heights (cm) and weights (kg) of 14 young men. We wish to use the heights to predict the weights. A portion of the output of regression analysis from a computer package is
Coefficients:
               Value Std. Error  t value Pr(>|t|) 
(Intercept) -60.6216  40.4924    -1.4971   0.1602
     height   0.7547   0.2289     3.2966   0.0064
In this output, a = -60.6216 and b = 0.7547.

The mean and standard deviation of the heights are 176.5 cm and 10.9 cm respectively. The mean and standard deviation of the weights are 72.6 kg and 11.9 kg respectively. the correlation between the two variable is r = 0.69. Check the output of the computer package (within roundoff error).

Problem 1:

Use the OLS line to predict the weight of a young man who is 180 cm tall.

Solution:

  y = a + bx = -60.6216 + 0.7547(180) = 75.2 kg
Note that the individual's height is (180 - 176.5) / 10.9 = 0.32 standard deviations above the mean. We predict is weight to be (0.32)r standard deviations above the mean, or (0.32)(.69) + 72.6 = 75.2 kg.

Problem 2:

Give a 95% confidence interval for the population slope.

Solution:

We assume here that in the population of interest, the mean weight of all young men of the same height is well described by a linear function of the height. The slope of our OLS regression line is an estimate of the slope of this population regression line. We construct a confidence interval in the same way as usual.

  (estimate) +/- (reliability coefficient)(standard error)
Since the standard error is estimated from data, we use a reliability coefficient from a t-distribution. For regression, the number of degrees of freedom is n-2, since we need to estimate two parameters to find our OLS line. The general formula is
  b +/- t* SE(b)
We get the values of b and its SE from the computer output and the value of t* from a t table. In this case, there are 12 degrees of freedom.
  0.7547 +/- (2.1788)(0.2289)
or
  0.7547 +/- 0.4987
Problem 3:

Test the hypothesis that the slope of the population regression line is 0. (In other words, there is no predictive value in height for predicting weight.)

Solution:

The null hypothesis is that the slope is 0. Under this hypothesis, standardize the estimate of the slope and compare it to a t-distribution with 12 degrees of freedom.

  t = (0.7547 - 0) / 0.2289 = 3.297
This agrees with the t statistic included in the computer output. The (2-sided) p-value is reported as 0.0064, which is the twice the area to the right of 3.297 under a t-distribution with 12 degrees of freedom. A p-value this low indicates strong evidence aganst the null hypothesis.

Regression Warnings

It is very easy to have a computer churn out regression analyses. It takes some care to see that common sense prevails in its interpretation.

Always plot your data. It is not always true that the relationship between two variables is well described by a straight line. A computer package will fit a line as best it can, whether it makes sense to or not. You need to check by plotting your data.

Extrapolation is dangerous. The relationship between two variables might be linear in the range of the data you have collected, but nonlinear outside of this range. The data in the previous example came from a sample of young men whose heights ranged from 161 cm to 205 cm. We shouldn't expect the same relationship to be good for people of different heights (or different age or gender, for that matter). There is nothing but common sense to prevent one from predicting the weight of a three-year old boy who is 100 cm tall to be 14.8 kg by the equation, but we should be very skeptical of the accuracy of such a prediction, and not be mislead by predictions that are clearly nonsensical.

Association is not causation. Simply the fact that two variables are related in a linear fashion with a slope that is significantly non-zero does not in itself imply that one variable is the direct cause of the other. Sales of ice-cream are positively correlated with incidence of malaria. This is not evidence that ice cream consumption causes malaria, but merely an artifact of the dependence both variables have on warm weather.



Last modified: April 17, 1996

Bret Larget, larget@mathcs.duq.edu