Math 225

Introduction to Biostatistics


Analysis of Variance

Prerequisites

This lab assumes that you already know how to:
  1. Login, find course Web page, run S-PLUS
  2. Use the Commands Window to execute commands
  3. Load data sets

Technical Objectives

This lab will teach you to:
  1. use S-PLUS for analysis of variance
  2. complete an ANOVA table by hand
  3. interpret an ANOVA table
  4. count degrees of freedom
  5. find p-values from an F table

Conceptual Objectives

In this lab you should begin to understand:
  1. understand when a one-way ANOVA is appropriate
  2. know the assumptions in an ANOVA

Analysis of Variance

Analysis of variance is a general statistical method that is appropriate if there is a continuous response variable and one or more categorical explanatory variables. In this lab we are concerned with a single explanatory variable (called a factor) with at least three possible values (called levels). In a one-way analysis of variance, we may think of the data as consisting of g independent samples (possibly of different sizes) from g populations. In a previous chapter, you studies the case when g was two, and you could test whether two population means were equal with a t test. This chapter generalizes this to test if three or more population means are equal. The basic structure of the test is similar to the case of two populations: Calculate a test statistic and compare this statistic to its sampling distribution under the null hypothesis that all population means are equal. If you observe a test statistic that is very different from what you would expect to see by chance (indicated both with a small p-value and a test statistic in the critical region), there is evidence against the null hypothesis. With three or more populations, there is a different test statistic and you compare its value to an F distribution instead of a t distribution. An ANOVA table (ANOVA is short for ANalysis Of VAriance) is little more than a step-by-step procedure to calculate the test statistic. If the null hypothesis were true, you would expect the sample means to be close to one another because they all would be unbiased estimates of the same population mean. Because of chance variation, the sample means will probably not be exactly equal. The basic idea is that if the variability among sample means is greater than what you would expect due to chance, you have evidence that the population means are not all equal. The test statistic compares the variation among the sample means to the variation within the samples. If this statistic is large, this indicates that the sample means are farther apart than expected due to chance considering the within sample variablity and the sample sizes. The p-value is the area to the right of the test statistic under an F distribution with the appropriate degrees of fredom.

Assumptions

Analysis of variance makes these assumptions:
  1. Populations are normal.
  2. The populations have equal variances.
  3. The samples are independent.

ANOVA is robust to lack of normality in the populations if the sample sizes are large or if the populations are not strongly skewed. The only concern is small skewed samples. The assumption of equal variances is equivalent to an assumption made when comparing two population means with independent samples. ANOVA is robust to lack of equal variance (heteroskedacticity) as long as the sample standard deviations are about the same order of magnitude. If the ratio of the largest sample sd to the smallest is more than ten or so, you need to do something more advanced to account for the heteroskedacticity. The last assumption says that ANOVA is not appropriate for paired or matched samples, foe example.

ANOVA Table

Your book describes analysis of variance in Chapter 9. Page 238 sumarizes the formulae you need to do a calculation without software. Instead of the textbook formulae, you may find it easier to understand these equivalent formulae.

SS_Among = sum n_i * (xbar_i - grand_mean)^2
SS_Within  = sum (n_i - 1) * (s_i)^2
df_Among = (number of groups) - 1
df_Within  = sum (n_i - 1) = (total number of measurements) - (number of groups)
MS_Among = SS_Among / df_Among
MS_Within = SS_Within / df_Within
F = MS_Among / MS_Within

The p-value is the area to the right of the test statistic under an F distribution.

Multiple Comparisons

When you can reject the hypothesis that all population means are the same, you often wish to identify which population means are different. This can involve multiple pairwise comparisons. There are several methods to construct simultaneous confidence intervals. Two of these are the Scheffe and Bonferroni methods. They are discussed on pages 243 and 245 respectively. Each of these methods depends on the assumptions of normality, equal variance, and independence. The Scheffe method allows for any number of comparisons of any contrasts and is very conservative. The Bonferroni method is only valid if the comparisons are specified prior to examining the data. For the purposes of this course, we only consider comparisons between all sample means.

The basic structure of the two methods is the same. The simultaneous confidence intervals are of this form.

(difference in sample means) +/- (multiplier)(standard error)

where the standard error has the form

(pooled estimate of sigma)*sqrt(1/(sample size 1) + 1/(sample size 2))

For the Scheffe method, the multiplier comes from an F distribution, specifically

multiplier = sqrt( (g-1)*F(1-alpha) )
where F(1-alpha) is the point that cuts off the upper right tail area of alpha from an F distribution with (g-1) and (N-g) degrees of freedom.

For the Bonferroni method, the multiplier is the value t so that the area between -t and t is 1 - alpha/k from a t distribution with (N-g) degrees of freedom where k is the number of comparisons.

In-class Activities

  1. Load in the data from exercise 9-1.
  2. Use S-PLUS to do an analysis of variance with the Scheffe method for multiple comparisons.
    1. Select Statistics:ANOVA:Fixed Effects...
    2. On the Model tab, click on hours as the dependent variable and treatment as the independent variable.
    3. For the Results tab, click on Means (along with the defaults).
    4. For the Plot tab, click on Residuals versus Fit and change Number of Extreme Points To Identify from 3 to 0.
    5. On the Compare tab, select temperature for Levels Of, click on Plot Intervals, and select Scheffe for Method.
    6. Click on the Apply Button.
    7. Close the warning message box that appears.
    Look at the output in the Report Window, the residual plots, and the graphical display of the confidence intervals.
  3. Change the method for multiple comparisons to Bonferroni and click on OK, and examine the output.

If time permits, you can complete the ANOVA table from the formulae using a hand calculator. You may wish to use S-PLUS to find sample means and variances.

  1. Use S-PLUS to find the mean and sample variance from each sample as well as the mean and sample variance of the data treated as one large sample.
  2. Fill in a one-way ANOVA table as on page 238 for your data.

    Source      SS        df        MS       F
    
    Among Within
    Total

  3. Find the exact p-value by calculating the area to the right of your F test statistic. Something like the example below will do the trick. (You will get a different test statistic.)

    > 1-pf(22.42,2,12)
    


Homework Assignment

Use S-PLUS to do exercise 9-2 on page 264. Here is the data.
  1. Construct side-by-side boxplots of count versus mouse. In this informal analysis, does it appear that the count depends on the mouse? Do the boxplots show skewed or fairly symmetric distributions?
  2. Use S-PLUS to carry out a one-way ANOVA for this data using count as the response. Complete the table. What is the F statistic and the p-value?
  3. If you have a very small p-value, is it reasonable to conclude that the effects of the treatment are not the same for each mouse?

Last modified: November 27 2000

Bret Larget, larget@mathcs.duq.edu