# Math 225

## Introduction to Biostatistics

### Multiple Regression Example

We will examine the same FEV data set you saw in lab. The data is here.

We will be using this data to illustrate several concepts in multiple regression.

#### Fit #1

Begin by fitting the simplest possible model that uses two explanatory variables.

(FEV) = b0 + b1(height) + b2(age)

Here is the output.

```Coefficients:
Value Std. Error  t value Pr(>|t|)
(Intercept)  -4.6105   0.2243   -20.5576   0.0000
ht   0.1097   0.0047    23.2628   0.0000
age   0.0543   0.0091     5.9609   0.0000

Residual standard error: 0.4197 on 651 degrees of freedom
Multiple R-Squared: 0.7664

Look at a plot of the residuals versus fitted values.

Notice that there are patterns in the residual plot.

There is a curve.  This indicates that a nonlinear fit.
The spread increases as the fitted values increase.

Fit #2

Now try transforming the response variable by taking logarithms.
This often helps when the spread increases with fitted values
(and sometimes gets rid of nonlinearity problems as well).

Here are the fitted values.
Coefficients:
Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9711   0.0783   -25.1639   0.0000
ht   0.0440   0.0016    26.7059   0.0000
age   0.0198   0.0032     6.2305   0.0000

Residual standard error: 0.1466 on 651 degrees of freedom
Multiple R-Squared: 0.8071

Also, look at the residual plot.

Notice that there are no obvious patterns.
The residuals have similar spread for all fitted values
and there are no trends.

Fit #3

We could also see if we could do better by adding a quadratic term for age.

Here is a summary of the fit.

Coefficients:
Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9809   0.0801   -24.7360   0.0000
ht   0.0435   0.0018    23.6722   0.0000
age   0.0271   0.0128     2.1239   0.0341
I(age^2)  -0.0003   0.0005    -0.5915   0.5544

Residual standard error: 0.1467 on 650 degrees of freedom
Multiple R-Squared: 0.8072

Notice that the p-value for the quadratic term is very large.
This extra term did not add much to the quality of the fit.

Fit #4

We could also see if we could do better by adding an interaction term between age and height.

Here is a summary of the fit.

Coefficients:
Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9666   0.1878   -10.4733   0.0000
ht   0.0439   0.0032    13.6514   0.0000
age   0.0193   0.0207     0.9322   0.3516
ht:age   0.0000   0.0003     0.0267   0.9787

Residual standard error: 0.1467 on 650 degrees of freedom
Multiple R-Squared: 0.8071

Notice that the interaction term does not add much of value.

Our best model was Fit #2.

Fit #5

We can also add categorical variables to the multiple regression.
The variable sex can be represented by a numerical variable
with one value for male and another for female.
A similar thing holds for smoking status.

Here is a summary of the fit.

Coefficients:
Value Std. Error  t value Pr(>|t|)
(Intercept)  -1.9524   0.0807   -24.1811   0.0000
ht   0.0428   0.0017    25.4893   0.0000
age   0.0234   0.0033     6.9845   0.0000
smoke  -0.0230   0.0105    -2.2031   0.0279
sex   0.0147   0.0059     2.5020   0.0126

Residual standard error: 0.1455 on 649 degrees of freedom
Multiple R-Squared: 0.8106

All of these variables appear to be important
because they have small p-values.