Math 225

Introduction to Biostatistics


Multiple Regression Example

We will examine the same FEV data set you saw in lab. The data is here.

We will be using this data to illustrate several concepts in multiple regression.

Fit #1

Begin by fitting the simplest possible model that uses two explanatory variables.

(FEV) = b0 + b1(height) + b2(age)

Here is the output.

Coefficients:
               Value Std. Error  t value Pr(>|t|)
(Intercept)  -4.6105   0.2243   -20.5576   0.0000
         ht   0.1097   0.0047    23.2628   0.0000
        age   0.0543   0.0091     5.9609   0.0000

Residual standard error: 0.4197 on 651 degrees of freedom
Multiple R-Squared: 0.7664

Look at a plot of the residuals versus fitted values.

Notice that there are patterns in the residual plot.

  1. There is a curve. This indicates that a nonlinear fit.
  2. The spread increases as the fitted values increase. This indicates heteroscedasticity (different spread).

Fit #2

Now try transforming the response variable by taking logarithms. This often helps when the spread increases with fitted values (and sometimes gets rid of nonlinearity problems as well).

Here are the fitted values.

Coefficients:
               Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9711   0.0783   -25.1639   0.0000
         ht   0.0440   0.0016    26.7059   0.0000
        age   0.0198   0.0032     6.2305   0.0000
 
Residual standard error: 0.1466 on 651 degrees of freedom
Multiple R-Squared: 0.8071

Also, look at the residual plot.

Notice that there are no obvious patterns. The residuals have similar spread for all fitted values and there are no trends.

Fit #3

We could also see if we could do better by adding a quadratic term for age.

Here is a summary of the fit.

Coefficients:
               Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9809   0.0801   -24.7360   0.0000
         ht   0.0435   0.0018    23.6722   0.0000
        age   0.0271   0.0128     2.1239   0.0341
   I(age^2)  -0.0003   0.0005    -0.5915   0.5544
 
Residual standard error: 0.1467 on 650 degrees of freedom
Multiple R-Squared: 0.8072

Notice that the p-value for the quadratic term is very large. This extra term did not add much to the quality of the fit.

Fit #4

We could also see if we could do better by adding an interaction term between age and height.

Here is a summary of the fit.

Coefficients:
               Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9666   0.1878   -10.4733   0.0000
         ht   0.0439   0.0032    13.6514   0.0000
        age   0.0193   0.0207     0.9322   0.3516
     ht:age   0.0000   0.0003     0.0267   0.9787
 
Residual standard error: 0.1467 on 650 degrees of freedom
Multiple R-Squared: 0.8071 

Notice that the interaction term does not add much of value.

Our best model was Fit #2.

Fit #5

We can also add categorical variables to the multiple regression. The variable sex can be represented by a numerical variable with one value for male and another for female. A similar thing holds for smoking status.

Here is a summary of the fit.

Coefficients:
               Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9524   0.0807   -24.1811   0.0000
         ht   0.0428   0.0017    25.4893   0.0000
        age   0.0234   0.0033     6.9845   0.0000
      smoke  -0.0230   0.0105    -2.2031   0.0279
        sex   0.0147   0.0059     2.5020   0.0126
 
Residual standard error: 0.1455 on 649 degrees of freedom
Multiple R-Squared: 0.8106 

All of these variables appear to be important because they have small p-values.


Last modified: April 5, 2001

Bret Larget, larget@mathcs.duq.edu