Distinguishing between different species of flea beetles
can be difficult.
Exploratory data analysis can be used to help do this.
This page has data and background
information about using measurements of beetle anatomy
to distinguish between three species.
- Spread is a general statistical concept that describes
the variablity in the distribution of a quantitative variable.
On rough measure of spread is the range,
which is the maximum value minus the minimum value.
This measure of spread is not very robust as it depends only on two extreme
The most commonly used measure of spread is the standard deviation.
In words, it is (almost) the square root of the mean squared deviation
from the mean.
To calculate the standard deviation by hand, follow these steps.
- Compute the mean.
- Subtract the mean from each observation.
- Square each of the deviations.
- Sum them.
- Divide by one less than the number of observations (almost the mean).
- Take the square root.
The standard deviation may be interpreted
as the size of a typical deviation from the mean.
You can roughly estimate the standard deviation
from a histogram from this interpretation.
Most observations will be closer to the mean
than one standard deviation,
but there will be a fair number of observations farther away.
The common notation for standard deviation is s.
At times I will use the notation SD as well.
The standard deviation is the square root of the variance.
- You are not responsible for the mean (absolute) deviation.
- You are not responsible for the computing formula
for the standard deviation (or variance) on page 20.
- Section 2-4 discusses frequency tables and histograms.
You are responsible for interpretting histograms as described
in the previous lecture.
You are not responsible for the nuts and bolts of making a histogram.
You are not responsible for calculating a mean from a histogram,
but you should be able to estimate the balancing point and therefore the mean.
You are not responsible for frequency polygons
or cumulative frequency polygons.
You should know what percentiles, quartiles, and quantiles are.
The lower quartile
or first quartile
is another name for the 25th percentile.
It is the location that cuts off the smallest quarter of the data.
The upper quartile
or third quartile
is another name for the 75th percentile.
It is the location that cuts off the largest quarter of the data.
A five number summary
is the minimum, lower quartile, median, upper quartile, and maximum.
You should know this although it is not in your textbook.
A boxplot is a graphical display of the five number summary.
With the numerical axis drawn vertically,
the box represents the middle half of the data
from the lower quartile to the upper quartile.
The box is split at the median.
There is a whisker that extends from the top of the box to the maximum
and from the bottom of the box to the minimum.
Each whisker repesents one quarter of the data.
More detailed boxplots will identify outliers
(individual observations far from the overall pattern of the data)
with individual lines or points.
The whiskers would then extend to the most extreme non-outliers.
Side-by-side boxplots are especially useful for comparing
the distributions of two or more different quantitative variables.
The interquartile range
is the upper quartile minus the lower quartile.
It is a more robust measure of spread than the range.
It is the height of the box in a boxplot.