Math 225

Introduction to Biostatistics

Notes on Goodness-of-Fit

A motivating example. If an observed genetic trait is determined by a single dominant/recessive gene, we would expect that in a cross between two heterozygotes, the proportions of dominant to recessive offspring would be in a 3:1 ratio. If the genetic situation is more complicated (multiple genes, multiple alleles, incomplete dominance), the observed proportions may differ. Also, we may see a difference due to chance variation.

In one particular data set, we observe 99 offspring with the dominant trait and 45 withe the recessive trait. Does this data fit the simple genetic model?

We can examine this question with a chi-square test. The chi-square test statistic is a measure between what we observe and what we expect to see. If the discrepency is larger than expected due to chance alone, there is evidence that something else (such as more complicated genetics) is important.

The chi-square test statistic is as follows.

X² = sum( (O_i-E_i)² / E_i

where O_i is the observed count in the ith category and E_i is the expected count in the ith category.

Notice that the chi-square test statistic will be large if some observed counts differ greatly from the expected counts.

The chi-square test statistic will follow (approximately) a chi-square distribution if the null hypothesis (expected proportions in each category are correct) is true. The chi-square distribution with k degrees of freedom is what you get if you take k independent standard normal random variables, square them, and add them up. For this type of problem, the correct number of degrees of freedom is:

df = (# of categories) - 1 - (# of estimated parameters)

For our problem, we have two categories and no estimated parameters (we are given expected proportions of 0.75 and 0.25), so there is only one degree of freedom.

In a total of 144 offspring, the expected counts are (0.75)*144 = 108 and (0.25)*144 = 36. Notice that the expected counts and the observed counts total to the same value, 144 in this case. The test statistic is

X² = (99-108)²/108 + (45-36)²/36 = 3.00

The p-value os the area to the right of 3.00 under a chi-square distribution with 1 degree of freedom. From the table, we see this is between 0.05 and 0.10. This p-value is marginal. There is at best weak evidence of more complicated genetics. Chance alone might explain the difference between what was observed and what was expected.

Last modified: May 2, 2001

Bret Larget, larget@mathcs.duq.edu