Math 225

Introduction to Biostatistics


Multiple Regression

Assignment 11

Prerequisites

This lab assumes that you already know how to:
  1. Login, find course Web page, run S-PLUS
  2. Use the Commands Window to execute commands
  3. Load data sets

Technical Objectives

This lab will teach you to:
  1. use S-PLUS to plot bivariate data with a scatterplot
  2. use S-PLUS to fit a multiple regression model
  3. use S-PLUS to make residual plots

Conceptual Objectives

In this lab you should begin to understand:
  1. to interpret residual plots to determine if fitting a linear model is justified
  2. to understand the relationship between correlation and causation

Multiple Regression

In this lab, you will use multiple regression to examine the relationship between life expectancy, television, and physicians.

In-class Activities

The World Almanac contains a great deal of information about different countries in the world. In this class we will look at the response variable life expectancy (an average of male and female life expectancy in years) and two explanatory variables, prevalence of televisions (number of people per tv) and prevalence of physicians (number of people per physician) in many countries.

  1. Which variable do you expect will be more closely related with life expectancy? Predict whether the association between each explanatory variable and the response variable is positive or negative.

  2. Load in the Life Expectancy data set.

    Attach the data set.

    > attach(television)
    
    in the Commands Window.

  3. Make a plot with tv on the x axis and life on the y axis.
    > plot(tv,life)
    
    Is the association positive or negative? Is this what you expected?

    Does it look like a linear relationship is adequate, or is a nonlinear relationship better?

    If a linear relationship is inadequate, try both a reciprocal and a log transformation to see which is better. The reciprocal would be televisions per person.

    > plot(1/tv,life)
    
    > plot(log(tv),life)
    

    Do the same for the physician variable.

    > plot(phys,life)
    
    Is there a negative or positive association?
    > plot(1/phys,life)
    
    > plot(log(phys),life)
    
    Which transformation makes the relationship with life expectancy most linear?

  4. Use S-PLUS to fit a model with both log(tv) and log(phys) as explanatory variables.

    1. Use your mouse to select Statistics:Regression:Linear....
    2. In the Formula box type
      life ~ log(tv) + log(phys)
      
      This means ``life expectancy in years is modeled as a a linear function of log(tv) and log(phys)''. An intercept is included by default.
    3. Click on the Plots tab.
    4. Click on the plot Residuals versus Fitted Values
    5. Click on OK.
    6. Read the Report Window and look at the graphs.

    Examine the residual plot. Do you see much of a pattern?

    In the Report Window, there will be a table labeled "Coefficients" with the fitted parameter values.

    Coefficients:
                    Value Std. Error  t value Pr(>|t|)
     (Intercept)  90.6222   4.3557    20.8056   0.0000
       log(phys)  -2.2589   0.7474    -3.0221   0.0047
         log(tv)  -2.9156   0.5907    -4.9358   0.0000                                                    
    

    The column headed "Value" has the slope and intercept of the regression line. These are statistics that can be used to describe the relationship between these variables.

    The column headed "Std. Error" has the estimated standard errors of the estimated coefficients.

    The column headed "t value" is the t statistic of the hypothesis test that tests if the true parameter value is 0.

    The column headed "Pr(>|t|)" is the two-sided p-value of the hypothesis test.

    Are both variables useful for making predictions on life expectancy?

    Notice that television has a larger (absolute) t value and a smaller p-value.

    Comment on the following conclusion.
    Our model is

    (life expectancy in years) = 90.6 - 2.26 log(people per physician) - 2.92 log(people per television)

    Doubling the number of televisions in a poor country is cheaper than doubling the number of physicians. If we doubled the number of televisions, this would halve the number of people per television which would affect the life expectancy by -2.92 log(0.5) = 2 years. We can increase life expectancy in poor countries by shipping lots of televisions!

    Is this conclusion justified?

  5. Discuss the difference between association and causation. There is a negative association between the number of people per television and life expectancy. Does this mean that one variable has a causal relationship with the other? If not, what else might explain the assocation?
You do not need to turn in any work for this lab. There will be a question on the final examination that uses information from this lab.
Last modified: April 20, 2001

Bret Larget, larget@mathcs.duq.edu