Math 225

Introduction to Biostatistics

Highlights from Lecture #2

Distinguishing between different species of flea beetles can be difficult.
Exploratory data analysis can be used to help do this.
This page has data and background information about using measurements of beetle anatomy to distinguish between three species.

Chapter 2
Spread is a general statistical concept that describes the variablity in the distribution of a quantitative variable.
On rough measure of spread is the range, which is the maximum value minus the minimum value. This measure of spread is not very robust as it depends only on two extreme values.
The most commonly used measure of spread is the standard deviation. In words, it is (almost) the square root of the mean squared deviation from the mean.
To calculate the standard deviation by hand, follow these steps.
1. Compute the mean.
2. Subtract the mean from each observation.
3. Square each of the deviations.
4. Sum them.
5. Divide by one less than the number of observations (almost the mean).
6. Take the square root.
The standard deviation may be interpreted as the size of a typical deviation from the mean.
You can roughly estimate the standard deviation from a histogram from this interpretation. Most observations will be closer to the mean than one standard deviation, but there will be a fair number of observations farther away.
The common notation for standard deviation is s. At times I will use the notation SD as well.
The standard deviation is the square root of the variance.
You are not responsible for the mean (absolute) deviation.
You are not responsible for the computing formula for the standard deviation (or variance) on page 20.
Section 2-4 discusses frequency tables and histograms. You are responsible for interpretting histograms as described in the previous lecture.
You are not responsible for the nuts and bolts of making a histogram.
You are not responsible for calculating a mean from a histogram, but you should be able to estimate the balancing point and therefore the mean.
You are not responsible for frequency polygons or cumulative frequency polygons.
You should know what percentiles, quartiles, and quantiles are.
The lower quartile or first quartile is another name for the 25th percentile. It is the location that cuts off the smallest quarter of the data.
The upper quartile or third quartile is another name for the 75th percentile. It is the location that cuts off the largest quarter of the data.
A five number summary is the minimum, lower quartile, median, upper quartile, and maximum. You should know this although it is not in your textbook.
A boxplot is a graphical display of the five number summary. With the numerical axis drawn vertically, the box represents the middle half of the data from the lower quartile to the upper quartile.
The box is split at the median.
There is a whisker that extends from the top of the box to the maximum and from the bottom of the box to the minimum. Each whisker repesents one quarter of the data.
More detailed boxplots will identify outliers (individual observations far from the overall pattern of the data) with individual lines or points. The whiskers would then extend to the most extreme non-outliers.
Side-by-side boxplots are especially useful for comparing the distributions of two or more different quantitative variables.
The interquartile range or IQR is the upper quartile minus the lower quartile.
It is a more robust measure of spread than the range.
It is the height of the box in a boxplot.

Last modified: January 16, 2001

Bret Larget, larget@mathcs.duq.edu

Math 225

Introduction to Biostatistics

Highlights from Lecture #2

Chapter 2