Math 225

Introduction to Biostatistics

Final Exam Review

Exploratory Data Analysis

Concepts:

Mean, median, standard deviation, variance, skewness, histogram, boxplot, quartiles, interquartile range, categorical and quantitative variables.

Sample multiple choice problem:

A five number summary of exam scores is 17, 66, 76, 85, 97. A histogram of scores is unimodal with a peak near 75 and the left tail extending longer than the right tail.

(a) The exam scores are skewed to the right.
(b) The mean exam score is 76.
(c) The standard deviation is greater than 4.
(d) The standard deviation is the square root of the variance.
(e) The interquartile range is 19.

Solution

(a) is false because the scores are skewed to the left.
(b) is false. The median score is 76.
(c) is true. We do not know the standard deviation exactly, but a typical distance from the mean must be larger than four if the quartiles are so spread.
(d) is true.
(e) is true. 85 - 66=19.

Probability

Concepts:

Probability, outcome space, event, independence, mutual exclusive events, addition principle, multiplication principle, inclusion-exclusion, factorials, permutation, combinations, conditional probability, Bayes Rule, sensitivity, specificity, prevalence rate, false positive rate, false negative rate, predictive value, the binomial distribution, the Poisson distribution, the normal distribution, standard normal distribution, sampling distributions, the central limit theorem, population, sample, parameter, statistic, random variable, simple random sample,

Sample multiple choice problem:

The distribution of volumes of soda in cans is normal with a mean of 12.2 ounces and a standard deviation of 0.1 ounces.

(a) The proportion of soda cans with less than 12 ounces of soda is about 16%.
(b) The median of the distribution is 12.2 ounces.
(c) The proportion of soda cans with between 12.1 and 12.2 ounces of soda is about 34%.
(d) In a sample of 25 soda cans, the sample mean is about 95% likely to fall between 12.16 and 12.24 ounces.
(e) The sampling distribution of the sample mean from this population is normal for any sample size.

Solution

(a) is false. About 16% of the cans are more than one standard deviation below the mean. 12 ounces is two stsandard deviations below the mean.
(b) is true. Normal curves are symmetric about the mean.
(c) is true. About 68% is within one standard deviation of the mean, above or below. Half of these will be below the mean.
(d) is true. The CLT tells us the sampling distribution of the sample mean in this problem will be normal with a mean of 12.2 ounces and SE 0.02. The middle 95% of this distribution is the interval given.
(e) is true.

Sample quantitative problems:

You may use these facts:

The mean of the binomial distribution is np.
The standard deviation of the binomial distribution is sqrt(np(1-p)).
Recall the 68-95-99.7 rule for normal distributions.

If a drug is fifty percent likely to be more effective than a control in each trial, what is the probability of it being more effective than a control in exactly five out of six trials?

Solution
This is binomial with n=6 and p=0.5. The probability is 0.09375.
In a classroom experiment, a large bucket is filled with colored beads, 30\% of which are red. A student takes a random sample of 50 beads.
1. What is the expected number of red beads in the sample?
2. What is the size of a typical deviation between the actual random number of red beads and the expected number? (Standard deviation of the appropriate probability distribution.)
3. Write an expression for the exact probability that 20 or more sampled beads are red. Do not simplify or evaluate this expression.
4. Describe how to approximate the probability in the previous part with an area under a normal curve.
Solution
(a) mu = 50*(0.3) = 15.
(b) sigma = sqrt(50*0.3*0.7) = 3.24.
(c) P(X >= 20) = ₅₀C₂₀ * (0.3)²⁰ * (0.7)³⁰ ₅₀C₂₁ * (0.3)²¹ * (0.7)²⁹ + ... + ₅₀C₅₀ * (0.3)⁵⁰ * (0.7)⁰
(d) The sum of binomial probabilities will be approximately the area to the right of z = (19.5-15)/3.24 = 1.39 under a standard normal curve.
In a population, IQ scores are normally distributed with a mean of 115 and a standard deviation of 20.
(a) What proportion of IQ scores are greater than 140?
(b) What is the 85th percentile of this distribution?
(c) State the three conclusions of the central limit theorem as presented in the computer lab.
(d) In a sample of nine individuals, what is the probability that the mean IQ score is greater than 120?
(e) Name a characteristic of a population that has a great effect on the normality of the sampling distribution of the sample mean.

Solution
(a) 0.1056; (b) 135.7;
(c) The mean of the sampling distribution of the sample mean is the population mean, the standard devaition of the sampling distribution of the sample mean is sigma / sqrt(n), and the shape is approximately normal if the sample size is large enough.
(d) 0.2266;
(e) skewness.
In a given population, a disease has a prevalence rate of 0.2%. For individuals with the disease, a positive test result occurs 98% of the time. For individuals without the disease, a negative test result occurs 93% of the time. An individual from this population is randomly sampled and has a positive test result. What is the probability the person has the disease?

Solution
(0.002)(0.98)/((0.002)(0.98)+(0.998)(0.07)) = 0.027.

Statistical Inference

Concepts:

Hypothesis test, null and alternative hypotheses, one- and two-sided tests, test statistic, p-value, significance level, confidence interval, standard error.

Sample multiple choice problem:

In a random sample of 100 individuals, the sample mean and sample standard deviation body temperatures are 98.2 and 0.7 respectively.

(a) If the distribution is approximately normal, about 95% of the sampled individuals have body temperatures between 96.8 and 99.6 degrees.
(b) We can be about 95% confident that the population mean body temperature is between 98.06 and 98.34.
(c) Any two-sided hypothesis test with null hypothesis that mu is in the interval above would be significant at the 5% level.
(d) In a one-sided hypothesis test with null hypothesis that mu=98.6, the p-value is the area to the left of t = (98.2-98.6)/0.7 under a t distribution curve with 99 degrees of freedom.
(e) If the p-value is much less than 0.0005, this indicates very strong evidence that the population mean is lower than 98.6.

Solution

(a) The sample proportion within two standard deviations of the sample mean from a large sample drawn from a normal population will probably be close to 95%.
(b) SE = 0.7/sqrt(100) = 0.07. The 95% confidence interval is correctly constructed and interpreted.
(c) is false. The 95% confidence interval contains those values for which the two-sided confidence interval would NOT be significant.
(d) false. Use the SE, not s.
(e) is true.

Sample quantitative problems:

(a) Under what condition(s) is it reasonable to approximate the sampling distribution of a proportion with a normal distribution?
(b) In a random sample of 100 individuals from a city, 13 are infected with HIV. What is the test statistic in the hypothesis test with a null hypothesis that 5% of the populace in the city has HIV versus the alternative that the proportion is higher? To which distribution should this test statistic be compared to conclude the test?

Solution
(a) When np is at least 5 and when n(1-p) is at least 5. For sample data, this corresponds to when there are at least five observations of each type.
(b) z = (12.5 - 5)/sqrt(100*.05*.95) = 3.44. Compare to a standard normal curve.
A random sample of 219 adult drug users contains 55 individuals who began smoking cigarettes at age 12 or younger. A second random sample of 822 people who do not use drugs contains 117 individuals who began smoking at age 12 or younger. Find the appropriate test-statistic in a test that the population proportions of individuals who began smoking at age 12 or younger are the same for both drug users and nonusers. Describe how to find a p-value as an area under a curve.

Solution
p₁-hat = 55/219 = 0.251.
p₁-hat = 117/822 = 0.142.
p-bar = (55+117)/(219+822) = 0.165.
SE = sqrt((0.165*0.835)(1/219 + 1/822)) = 0.0282.
z = (0.251 - 0.142) / 0.0282 = 3.87. p-value = 0.0001.

Analysis of Variance

Sample quantitative problem:

(Circle the correct choices.) Analysis of variance can be an appropriate model for analyzing data when there is a categorical/quantitative response variable and one or more categorical/quantitative explanatory variables.

Solution
Quantitative response varaibles, categorical explanatory variables.
In a study to compare four different diets, 28 young pigs are randomly assigned to four treatment groups with seven pigs in each group. For each pig, the weight gain after three months is measured. Name a graphical procedure that summarizes the distribution of weight gains for all four diets on the same graph.

Solution
Side-by-side boxplots are very useful for comparing several distributions.
For the study described in the previous problem, if a researcher analyzed the data using one-way analysis of variance, to which F distribution would the test statistic be compared to determine a p-value?

Solution
The F statistic has an F distribution with 3 and 24 degrees of freedom.

Twenty-four faculty members who did not regularly exercise volunteered to participate in a study on the effect of exercise on weight. The faculty members were randomly assigned to three treatment groups: seven in a control group, eight in a low intensity exercise group, and nine in a high intensity exercise group. The changes in weight are displayed and summarized below. (A positive number indicates a gain in weight during the program.)

Exercise Group n Weight difference in pounds mean sd

Control 7 1, 7, 2, 0, -4, 10, 2 2.57 4.61

Low intensity 8 -7, 3, 6, 2, -14, -3, -2, -2 -2.13 6.27

High intensity 9 1, 0, 7, 8, 7, -4, 4, -1, 1 1.00 4.85

Exercise Group	n	Weight difference in pounds	mean	sd
Control	7	1, 7, 2, 0, -4, 10, 2	2.57	4.61
Low intensity	8	-7, 3, 6, 2, -14, -3, -2, -2	-2.13	6.27
High intensity	9	1, 0, 7, 8, 7, -4, 4, -1, 1	1.00	4.85

Complete the following ANOVA table.

Source Sum of Squares df MS F p-value

Among 87.2 0.24

Within 590.6

Total

Solution

Source	Sum of Squares	df	MS	F	p-value
Among	87.2	2	43.6	1.55	0.24
Within	590.6	21	28.12
Total	677.8

Did the exercise program appear to have an effect on weight? Summarize the study without using any formal statistical terminology.

Solution

There is no evidence that the mean weight change among the three exercise groups were different. The observed differences in mean weight change can be explained by chance variation alone.

Regression

True or False. Ordinary regression can be an appropriate model for analyzing data when there is a quantitative response variable and one or more categorical explanatory variables.

Solution
False. All variables are quantitative.
True or False. The criterion for selecting the best line in ordinary least squares regression is to make the sum of absolute residuals as small as possible.

Solution
False. Minimize the sum of the squared residuals.
Multiple choice. A scatter plot of total serum cholesterol versus weight among healthy males aged 25--50 for a sample of 250 men shows a weak linear relationship and a positive association. Which correlation coefficient could plausibly fit this description?
-1.4, -1.0, -0.8, -0.2, 0.0, 0.3, 0.97, 1.0, or 2.34.

Solution
The correlation 0.3 is the only plausible value. Correlations must be between -1 and 1. Correlations will be positive if the association is positive. A weka linear relationship will have a correlation coefficient closer to 0 than to 1.

Categorical Data

Hair color and eye color are categorized for this sample of 1000 people. Is there evidence that the two variables are not independent?

              Hair color
Eye color | Brunette | Red | Blonde | Total
-------------------------------------------
Blue      |    100   |  20 |   180  |  300
Brown     |    450   |  30 |   120  |  600
Green     |     50   |  20 |    30  |  100
-------------------------------------------
Total     |    600   |  70 |   330  | 1000

Solution

The expected counts are:

              Hair color
Eye color | Brunette | Red | Blonde | Total
-------------------------------------------
Blue      |    180   |  21 |    99  |  300
Brown     |    360   |  42 |   198  |  600
Green     |     60   |   7 |    33  |  100
-------------------------------------------
Total     |    600   |  70 |   330  | 1000

The chi-square statistic is 213.1. The 0.9995 quantile is 24.1. The p-value is essentially 0. There is overwhelming evidence that hair color and eye color are related.

Last modified: May 1, 2001

Bret Larget, larget@mathcs.duq.edu