Math 225

Introduction to Biostatistics

Notes from Lecture #19

Analysis of variance can be formed when there are multiple categorical explanatory variables. Here is a motivating example.


A study investigated the concentration of alanine (in mg/100 ml) in males and females of three species of millipedes. (The source gives no indication on why this may be of interest.)

The raw data is here.

We can ask if the sex of the millipedes affects alanine concentration, if the species affects alanine concentration, and if there is an interaction between sex and species that affects alanine concentration.

Summary Statistics

Summary Data of Alanine Concentration by Sex and Species
SexSpecies 1Species 2Species 3
Male21.20 (s=1.33,n=4)16.18 (s=1.67,n=4)18.53 (s=1.84,n=4)
Female15.08 (s=1.24,n=4)12.68 (s=1.33,n=4)13.73 (s=1.21,n=4)

Informally, we can see that the means tend to be larger for males than for females in the same species. The differences are large relative to the standard deviations, and so are probably real. We also see differences in the species means that may be statistically significant as well.

Interaction Graphs

In an interaction graph between the two factors, the continuous response variable is plotted on the y axis and one of the two explanatory variables is plotted on the x axis with evenly spaced locations between groups. The means are plotted and corresponding points for the other factor are connected.

For example, we can plot females and males on the x axis. A line for species 1 connects the points (female,15.08) with (male,21.20). Another line for species 2 connects the points (female,12.68) with (male,16.18). A third line for species 3 connects the points (female,13.73) with (male,18.53).

These lines are close to being parallel. This means that there is very little apparent interaction. The observed differences between male and female (6.12, 3.50, 4.80) are not that different for each species.

ANOVA can test this more formally.


            Df Sum of Sq  Mean Sq  F Value     Pr(F) 
        sex  1  138.7204 138.7204 65.67942 0.0000002
    species  2   55.2608  27.6304 13.08207 0.0003103
sex:species  2    6.8908   3.4454  1.63129 0.2233107
  Residuals 18   38.0175   2.1121                   
Total       23  238.8895

You are not responsible for any formulas to construct a two-way ANOVA table. Some details are that mean squares are ratios of sums of squares and degrees of freedom. Each F ratio uses the mean square for residuals as the denominator.

There are three hypotheses that are tested. The first (sex) has a very small p-value and indicates that the observed differences in means for males and females are very difficult to explain with chance error alone. The second (species) is also highly statistically significant - species seems to be important as well. The third hypothesis for an interaction is not significant (p > 0.2). Chance error alone can explain why the lines in the interaction graph are not perfectly parallel. The differences we see between male/female differences for each species may not depend on the species, but may depend on chance alone.

If the lines had been less parallel in the interaction plot, I would have expected the p value to be smaller.

You should also know that the square root of the mean square for residuals (sqrt(1.63) = 1.28) is a pooled estimate from all six combinations of the size of typical deviation for an observation from its group mean. The fact that the six separate standard deviations are similar indicates that assuming the corresponding population standard deviations are equal is not a bad assumption.

Last modified: April 27, 2001

Bret Larget,