Math 225
Introduction to Biostatistics
Final Exam Review
Exploratory Data Analysis
Concepts:
Mean, median, standard deviation, variance, skewness, histogram, boxplot,
quartiles, interquartile range, categorical and quantitative variables.
Sample multiple choice problem:
A five number summary of exam scores is 17, 66, 76, 85, 97.
A histogram of scores is unimodal
with a peak near 75 and the left tail extending longer than the right tail.
(a) The exam scores are skewed to the right.
(b) The mean exam score is 76.
(c) The standard deviation is greater than 4.
(d) The standard deviation is the square root of the variance.
(e) The interquartile range is 19.
Solution
(a) is false because the scores are skewed to the left.
(b) is false. The median score is 76.
(c) is true. We do not know the standard deviation exactly, but a typical distance from the mean
must be larger than four if the quartiles are so spread.
(d) is true.
(e) is true. 85  66=19.
Probability
Concepts:
Probability, outcome space, event, independence, mutual exclusive events,
addition principle, multiplication principle, inclusionexclusion,
factorials, permutation, combinations, conditional probability, Bayes Rule,
sensitivity, specificity, prevalence rate, false positive rate, false negative rate,
predictive value,
the binomial distribution,
the Poisson distribution,
the normal distribution,
standard normal distribution,
sampling distributions,
the central limit theorem,
population, sample, parameter, statistic,
random variable, simple random sample,
Sample multiple choice problem:
The distribution of volumes of soda in cans is normal
with a mean of 12.2 ounces and a standard deviation of 0.1 ounces.
(a) The proportion of soda cans with less than 12 ounces of soda is about 16%.
(b) The median of the distribution is 12.2 ounces.
(c) The proportion of soda cans with between 12.1 and 12.2 ounces of soda is about 34%.
(d) In a sample of 25 soda cans, the sample mean is about 95% likely to fall between
12.16 and 12.24 ounces.
(e) The sampling distribution of the sample mean from this population is normal for any sample size.
Solution
(a) is false. About 16% of the cans are more than one standard deviation below the mean.
12 ounces is two stsandard deviations below the mean.
(b) is true. Normal curves are symmetric about the mean.
(c) is true. About 68% is within one standard deviation of the mean, above or below.
Half of these will be below the mean.
(d) is true. The CLT tells us the sampling distribution of the sample mean
in this problem will be normal with a mean of 12.2 ounces and SE 0.02.
The middle 95% of this distribution is the interval given.
(e) is true.
Sample quantitative problems:
You may use these facts:
 The mean of the binomial distribution is np.
 The standard deviation of the binomial distribution is sqrt(np(1p)).
 Recall the 689599.7 rule for normal distributions.

If a drug is fifty percent likely to be more effective than a control
in each trial,
what is the probability of it being more effective than a control
in exactly five out of six trials?
Solution
This is binomial with n=6 and p=0.5.
The probability is 0.09375.

In a classroom experiment,
a large bucket is filled with colored beads,
30\% of which are red.
A student takes a random sample of 50 beads.
 What is the expected number of red beads in the sample?
 What is the size of a typical deviation between the actual
random number of red beads and the expected number?
(Standard deviation of the appropriate probability distribution.)
 Write an expression for the exact probability that
20 or more sampled beads are red.
Do not simplify or evaluate this expression.
 Describe how to approximate the probability in the previous part
with an area under a normal curve.
Solution
(a) mu = 50*(0.3) = 15.
(b) sigma = sqrt(50*0.3*0.7) = 3.24.
(c) P(X >= 20) =
_{50}C_{20} * (0.3)^{20} * (0.7)^{30}
_{50}C_{21} * (0.3)^{21} * (0.7)^{29}
+ ... +
_{50}C_{50} * (0.3)^{50} * (0.7)^{0}
(d) The sum of binomial probabilities will be approximately the area
to the right of z = (19.515)/3.24 = 1.39 under a standard normal curve.

In a population,
IQ scores are normally distributed with a mean of 115 and a standard deviation of 20.
(a) What proportion of IQ scores are greater than 140?
(b) What is the 85th percentile of this distribution?
(c) State the three conclusions of the central limit theorem
as presented in the computer lab.
(d) In a sample of nine individuals,
what is the probability that the mean IQ score is greater than 120?
(e) Name a characteristic of a population
that has a great effect on the normality
of the sampling distribution of the sample mean.
Solution
(a) 0.1056; (b) 135.7;
(c) The mean of the sampling distribution of the sample mean is the population mean,
the standard devaition of the sampling distribution of the sample mean is sigma / sqrt(n),
and the shape is approximately normal if the sample size is large enough.
(d) 0.2266;
(e) skewness.

In a given population,
a disease has a prevalence rate of 0.2%.
For individuals with the disease,
a positive test result occurs 98% of the time.
For individuals without the disease, a negative test result occurs 93% of the time.
An individual from this population is randomly sampled and has a positive test result.
What is the probability the person has the disease?
Solution
(0.002)(0.98)/((0.002)(0.98)+(0.998)(0.07)) = 0.027.
Statistical Inference
Concepts:
Hypothesis test, null and alternative hypotheses, one and twosided tests,
test statistic, pvalue, significance level,
confidence interval, standard error.
Sample multiple choice problem:
In a random sample of 100 individuals,
the sample mean and sample standard deviation body temperatures
are 98.2 and 0.7 respectively.
(a) If the distribution is approximately normal, about 95% of the sampled individuals
have body temperatures between 96.8 and 99.6 degrees.
(b) We can be about 95% confident that the population mean body temperature
is between 98.06 and 98.34.
(c) Any twosided hypothesis test with null hypothesis that mu is in the interval above
would be significant at the 5% level.
(d) In a onesided hypothesis test with null hypothesis that mu=98.6,
the pvalue is the area to the left of t = (98.298.6)/0.7
under a t distribution curve with 99 degrees of freedom.
(e) If the pvalue is much less than 0.0005,
this indicates very strong evidence that the population mean is lower than 98.6.
Solution
(a) The sample proportion within two standard deviations of the sample mean
from a large sample drawn from a normal population will probably be close to 95%.
(b) SE = 0.7/sqrt(100) = 0.07. The 95% confidence interval is correctly constructed and interpreted.
(c) is false. The 95% confidence interval contains those values
for which the twosided confidence interval would NOT be significant.
(d) false. Use the SE, not s.
(e) is true.
Sample quantitative problems:

(a) Under what condition(s) is it reasonable to approximate
the sampling distribution of a proportion
with a normal distribution?
(b) In a random sample of 100 individuals from a city,
13 are infected with HIV.
What is the test statistic in the hypothesis test
with a null hypothesis that 5% of the populace in the city has HIV
versus the alternative that the proportion is higher?
To which distribution should this test statistic be compared
to conclude the test?
Solution
(a) When np is at least 5 and when n(1p) is at least 5.
For sample data, this corresponds to when there are at least five observations of each type.
(b) z = (12.5  5)/sqrt(100*.05*.95) = 3.44.
Compare to a standard normal curve.

A random sample of 219 adult drug users contains 55 individuals
who began smoking cigarettes at age 12 or younger.
A second random sample of 822 people who do not use drugs contains
117 individuals who began smoking at age 12 or younger.
Find the appropriate teststatistic in a test that the population proportions
of individuals who began smoking at age 12 or younger are the same
for both drug users and nonusers.
Describe how to find a pvalue as an area under a curve.
Solution
p_{1}hat = 55/219 = 0.251.
p_{1}hat = 117/822 = 0.142.
pbar = (55+117)/(219+822) = 0.165.
SE = sqrt((0.165*0.835)(1/219 + 1/822)) = 0.0282.
z = (0.251  0.142) / 0.0282 = 3.87.
pvalue = 0.0001.
Analysis of Variance
Sample quantitative problem:

(Circle the correct choices.)
Analysis of variance can be an appropriate model
for analyzing data
when there is a categorical/quantitative response variable
and one or more categorical/quantitative explanatory variables.
Solution
Quantitative response varaibles, categorical explanatory variables.

In a study to compare four different diets,
28 young pigs are randomly assigned to four treatment groups
with seven pigs in each group.
For each pig, the weight gain after three months is measured.
Name a graphical procedure that summarizes the distribution
of weight gains for all four diets on the same graph.
Solution
Sidebyside boxplots are very useful for comparing several distributions.

For the study described in the previous problem,
if a researcher analyzed the data using oneway analysis of variance,
to which F distribution would the test statistic be compared
to determine a pvalue?
Solution
The F statistic has an F distribution with 3 and 24 degrees of freedom.

Twentyfour faculty members who did not regularly exercise
volunteered to participate
in a study on the effect of exercise on weight.
The faculty members were randomly assigned to three treatment groups:
seven in a control group, eight in a low intensity exercise group,
and nine in a high intensity exercise group.
The changes in weight are displayed and summarized below.
(A positive number indicates a gain in weight during the program.)
Exercise Group 
n 
Weight difference in pounds 
mean 
sd 
Control 
7 
1, 7, 2, 0, 4, 10, 2 
2.57 
4.61 
Low intensity 
8 
7, 3, 6, 2, 14, 3, 2, 2 
2.13 
6.27 
High intensity 
9 
1, 0, 7, 8, 7, 4, 4, 1, 1 
1.00 
4.85 
Complete the following ANOVA table.
Source 
Sum of Squares 
df 
MS 
F 
pvalue 
Among 
87.2 



0.24 
Within 
590.6 

Total 

Solution
Source 
Sum of Squares 
df 
MS 
F 
pvalue 
Among 
87.2 
2 
43.6 
1.55 
0.24 
Within 
590.6 
21 
28.12 
Total 
677.8 
Did the exercise program appear to have an effect on weight?
Summarize the study without using any formal statistical terminology.
Solution
There is no evidence that the mean weight change
among the three exercise groups were different.
The observed differences in mean weight change
can be explained by chance variation alone.
Regression

True or False.
Ordinary regression can be an appropriate model
for analyzing data
when there is a quantitative response variable
and one or more categorical explanatory variables.
Solution
False.
All variables are quantitative.

True or False.
The criterion for selecting the best line
in ordinary least squares regression is to make the sum of absolute residuals as small as possible.
Solution
False.
Minimize the sum of the squared residuals.

Multiple choice.
A scatter plot of total serum cholesterol versus weight
among healthy males aged 2550 for a sample of 250 men
shows a weak linear relationship and a positive association.
Which correlation coefficient could plausibly fit this description?
1.4, 1.0, 0.8, 0.2, 0.0, 0.3, 0.97, 1.0, or 2.34.
Solution
The correlation 0.3 is the only plausible value.
Correlations must be between 1 and 1.
Correlations will be positive if the association is positive.
A weka linear relationship will have a correlation coefficient closer to 0 than to 1.
Categorical Data
Hair color and eye color are categorized for this sample of 1000 people.
Is there evidence that the two variables are not independent?
Hair color
Eye color  Brunette  Red  Blonde  Total

Blue  100  20  180  300
Brown  450  30  120  600
Green  50  20  30  100

Total  600  70  330  1000
Solution
The expected counts are:
Hair color
Eye color  Brunette  Red  Blonde  Total

Blue  180  21  99  300
Brown  360  42  198  600
Green  60  7  33  100

Total  600  70  330  1000
The chisquare statistic is 213.1.
The 0.9995 quantile is 24.1.
The pvalue is essentially 0.
There is overwhelming evidence that hair color and eye color are related.
Last modified: May 1, 2001
Bret Larget,
larget@mathcs.duq.edu