Math 225

Introduction to Biostatistics

Notes from Lecture #20

Regression is a statistical tool for building a model to predict a response variable on the basis of one or more explanatory variables. In this course we will restrict attention to a single quantitative response variable and one or more quantitative explanatory variables.

To understand regression, we first need to understand how to measure the relationship between two quantitative variables.

Correlation

The correlation coefficient, r, is a measure of the strength of the linear relationship between two continuous variables. An intuitive formula for r is

r = sum( ((x - xbar)/s_x) * ((y - ybar)/s_y) ) / (n-1)

where xbar and s_x are the sample mean and standard deviation of the x variable and ybar and s_y are the sample mean and standard deviation of the y variable.

Notice that each individual measurement is standardized to its z-score; subtract the mean and divide by the standard deviation. The correlation coefficient r is almost the average of the product of the z-scores for the x and y measurements of each data point.

This implies a couple things. First, because z-scores are unitless, the correlation coefficient is unitless as well. If one variable is length, it does not matter if length is measured in inches, feet, centimeters, or kilometers - r would not change. Second, terms will be positive when either both z-scores are positive or both z-scores are negative. Terms will be negative when one z-score is positive and the other is negative. The sign of the correlation coefficient will be a measure of the association between two variables.

Association

We say that two variables x and y are positively associated if when x is large relative to its mean, y also tends to be large, and when x is small relative to its mean, y also tends to be small. Two variables are negatively associated when big x tends to go with small y and vice versa. A scatter plot of positively associated variables will have many points in the upper right and lower left. A scatter plot of negatively associated variables will have many points in the lower right and upper left. Data that is positively associated will have a positive correlation coefficent and data that is negatively associated will have a negative correlation coefficient. This follows because products of corresponding z-scores will tend to be positive for positively associated data (+ times + or - times -) and negative for negatively associated data (+ times - or - times +).

Correlation coefficient facts

-1 <= r <= 1
r = -1 if and only if the data lie exactly on a line with a negative slope.
r = 1 if and only if the data lie exactly on a line with a positive slope.
The correlation coefficient measures only the strength of the linear relationship.

The last point requires some elaboration. A correlation coefficient near 0 does not by itself imply that two variables are unrelated. It could be the case that there is a strong nonlinear relationship between the variables. Furthermore, a correlation coefficient near -1 or 1 does not, by itself, imply that a linear relationship by itself is most appropriate. A nonlinear curve could be a better description of the relationship.

The value of r^2 is often used as a summary statistic as well. It may be interpreted as the proportion of the variability in y that is explained by the regression. When r=-1, or r=1, r^2=1 and all of the variability in y is explained by x. In other words, there is no variability around the regression line. A large r^2 value implies that the data is more tightly clustered around a line with some nonzero slope that around the line at y = ybar.

Residuals

We will draw the response variable (dependent variable) on the y axis and the explanatory variable (independent variable) on the x axis. For any line drawn through the points, the vertical distance from a point to the line is called a residual. Residuals are positve for points above the line and negative for points below the line.

The Criterion of Least Squares

Lines which are good descriptions of the relationship between two variables will tend to have small residuals whereas lines that give poor fits have some large residuals (in absolute value). One particular criterion for choosing a ``best line'' is to make the sum of all of the squared residuals as small as possible. This line is called the least squares line. Notice that because residuals are measured vertically, it matters quite a bit which variable is designated x and which is y.

The least squares line is

yhat = b₀ + b₁ x

where b₀ is the intercept and b₁ is the slope. The slope and intercept are determined by the mean and standard deviation of x and y and the correlation coefficient.

b₁ = r * (s_y / s_x)

b₀ = ybar - b₁ * xbar

Predicted values and z-scores

There is another way to look at the least squares regression line. Let's calculate the predicted y value if the initial x value is z standard deviations from the mean (x = xbar + z s_x so that its z-score is z). We simply substitute for the expressions.

(predicted y) = b₀ + b₁ * x
              = (ybar - b₁ * xbar) + b₁ * (xbar + z s_x)
              = ybar + b₁ * z * s_x
              = ybar + (r * s_y / s_x) * z * s_x
              = ybar + r*z*s_y

Notice this says that if x is z standard deviations from the mean, predict y will be rz standard deviations from the mean.

In particular, the predicted value when x is xbar is ybar.

Simple Linear Regression

Simple linear regression uses the least squares line to predict the value of y for each x. You may think of the line as representing the average y value for all individuals in a population with a given x value. The regression line is an estimate of the ``true'' line based on sample data. Note that the method of least squares will always provide a line, even when a nonlinear curve would be more appropriate. The correlation coefficient alone cannot be used as a measure of the appropriateness of a linear fit. Plotting the data to ascertain if a linear fit is appropriate is necessary.

Residuals

We will draw the response variable (dependent variable) on the y axis and the explanatory variable (independent variable) on the x axis. For any line drawn through the points, the vertical distance from a point to the line is called a residual. Residuals are positive for points above the line and negative for points below the line.

Regression Diagnostics

It is often the case that the relationship between two quantitative variables should not be summarized with a straight line. It may be the case that some nonlinear relationship is better. In addition, the methods of statistical inference in a regression framework assume that the variance is constant for different values of the explanatory x variable. Often, a plot of the data itself makes it clear when a nonlinear fit is more appropriate, or when nonconstant variance is a potential problem. However, a plot of residuals versus the fitted values (or the original x values) makes it easier to see these potential problems.

Patterns in a residual plot indicate nonlinearity. For example, if the residuals tend to be positive for low x values, negative for middle x values, and positive again for high x values, this indicates that the data is scattered around a curve that is concave up.

When the size of the residuals tends to increase with the size of the fitted values, this indicates that the variance is related to the explanatory variable. A common solution is to transform the explanatory variable. Perhaps, log(y) has a linear relationship with x with variance that is more nearly constant as x changes.

In class, we examined my son Riley's growth over time. This data is here.

A plot of the data shows that fitting a single straight line to all of the data is unwarranted. For the first year and a half of his life, his height increases rapidly. At some point between eighteen months and two years, there is a noticeable change in his growth rate. But from age 2 to age 8, the data appears to be fairly linear.

If we restrict attention to ages 24 months and higher, we have this summary data. x is Riley's age in months, y is his height in inches, and n is the number of times this data is recorded.

n=16, xbar = 61.6 months, s_x = 21.9 months, ybar = 45.7 inches, s_y = 5.5 inches, and r = 0.999. If we plug these values, we find the least squares regression line

b₁ = r * s_y / s_x = (0.999)(5.5)/(21.9) = 0.25 inches per month.

b₀ = ybar - b₁ * xbar = 45.7 - 0.25*61.6 = 30.3 inches.

The slope is generally the more important variable. It has units (y units over x units). In this case, we can say that Riley has been growing about a quarter inch per month, or about three inches per year.

The intercept can be interpreted as the predicted value when x = 0. This interpretation is only valid when 0 is in the range of the measured data. In this problem, 0 is a meaningful x value (the time of birth), but the prdiction is ridiculous - no babies are over thirty inches long at birth. The use of a regression line well outside its range of applicability is called extrapolation.

You are responsible for the concepts presented above. You should know how to find a least squares regression line from the five summary statistics. You should be able to make a prediction with a regression line and to know when the prediction is invalid because of extrapolation.

Last modified: May 2, 2001

Bret Larget, larget@mathcs.duq.edu