Math 225

Introduction to Biostatistics


Statistical Inferences for One Population

Prerequisites

This lab assumes that you already know how to:
  1. Login, find course Web page, run S-PLUS
  2. Use the Commands Window to execute commands
  3. Load data sets

Technical Objectives

This lab will teach you to:
  1. Calculate probabilities and quantiles associated with the t distribution.
  2. Estimate a population mean with a confidence interval using S-PLUS
  3. Test the a hypothesis about a single population mean using S-PLUS

Conceptual Objectives

In this lab you should begin to understand:
  1. how to properly interpret confidence intervals
  2. how to interpret hypothesis test results
  3. what a p-value is
  4. when inference based on the t distribution is appropriate
Confidence intervals
You have learned to standardize normal random variables to compare the standardized value (z-score) to a standard normal curve.

z = (x-mu)/sigma

You also know that the same idea holds for a sample mean, xbar. In this case, you must use SE(xbar) = sigma/sqrt(n) instead of sigma in the denominator.

z = (xbar-mu)/(sigma/sqrt(n))

Consider estimating the mean body temperature of a population. You could take a random sample of people and find the body temperature of each. The sample mean would be your best single guess for a value of the unknown population mean body temperature. However, because you know that your estimate may not be exactly correct, you should include with your single best estimate an assessment of how accurate the estimate is likely to be. A confidence interval does just this.

If you wish to be 95% confident that your estimate is within some distance of the true population mean, you create an interval centered at xbar extending a width 1.96 SE(xbar) in each direction. This is justified because the middle 95% of the standard normal curve is within 1.96 standard deviations and the central limit theorem says that the sampling distribution of xbar will be approximately normal (for sufficiently large sample sizes).

A 95% confidence interval for mu:

xbar ± 1.96 sigma / sqrt(n)

A confidence interval with a different level of condfidence (90% and 99% are typical choices) would use a multiplier other than 1.96 (1.645 or 2.576 for 90% or 99% respectively) determined by the cutpoints of the corresponding center area under the standard normal curve.

The t distribution
In a typical situation, however, the population standard deviation sigma will not be known. In this case, you can replace sigma with the sample standard deviation s. The problem that arises is that the random variable

t = (x-mu)/(s/sqrt(n))

does not have a standard normal distribution. Instead, it has a t distribution with n-1 degrees of freedom. This distribution is symmetric, bell-shaped, and centered at 0, just as the standard normal curve is, but it is more spread out. There is a different t distribution for every sample size. When constructing a 95% confidence interval, the formula

xbar ± 1.96 s / sqrt(n)

will be too small and contain an area with less than 95% of the area under the sampling distribution because there is uncertainty in estimating the standard error. To correct for this, a multiplier from the t distribution should be used instead of from the standard normal distribution. This mulitplier should correspond to the middle 95% of the t distribution with the correct number of degrees of freedom.

The t distribution in S-PLUS
The two functions in S-PLUS you need to learn for this lab are pt which finds areas under t distributions and qt which finds quantiles of t distributions. They are used as follows.

> pt(x,df)
is the area to the left of x under a t distribution with df degrees of freedom. For confidence intervals for a population mean, df = n-1.

> qt(x,df)
is the number x for which the area to the left of x under a t distribution with df degrees of freedom.

S-PLUS help is available in this on-line guide.

Appropriateness of the t distribution
The t distribution in theory assumes that the populations are normal. The central limit theorem implies that the t distribution will be appropriate for nonnormal populations provided the sample size is sufficiently large. As the previous lab showed, the t distribution may not be suitable if the population is strongly skewed and the sample size is insufficiently large. However, in most situations where strong skewness is not a problem, the t distribution is appropriate even for small samples.

In-class Activities

The first goal is to learn to use the computer to make calculations with a t distribution and to compare these to the standard normal distribution.
  1. Open a Commands Window. [How?]
  2. Find the area to the left of -2 under the standard normal curve.
    > pnorm(-2)
    
  3. Find the area to the left of -2 under t distributions with 1, 10, 30, and 100 degrees of freedom.
    > pt(-2,c(1,10,30,100))
    

    (The S-PLUS function c collects several numbers together. You could use pt(-2,1) to find just the first of the probabilities.)

    Notice that these numbers are all larger than the corresponding area for a normal curve, but that they get closer to the normal curve area as the degrees of freedom increases. What do you think happens as the number of degrees of freedom increases to infinity?

  4. What is the number z so that the area under the standard normal curve between -z and z is 95%?
    > qnorm(0.975)
    
  5. What are the numbers t so that the area under a t distribution with 1, 10, 30, and 100 degrees of freedom between -t and t is 95%?
    > qt(0.975,c(1,10,30,100))
    
    Notice that numbers get closer to 1.96 as the sample size increases. You can find these numbers in the table on page 419 in your textbook.

Is the mean body temperature of human adults really 98.6? The body temperature data set contains the body temperature and gender of 130 volunteers, 65 men and 65 women. For the present, we will ignore differences due to gender.

Load this data into S-PLUS.

  • To apply the methods in the textbook, we need to find the mean and standard deviation of body temperature, coded as the variable temp.

    > attach(temperature)
    > mean(temp)
    > sqrt(var(temp))
    

  • Draw a histogram of the variable temp. Is it centered about where you would expect? Does the standard deviation represent the size of a typical deviation from the mean? Is the shape approximately normal, or do you see substantial skewness or outliers?
  • Construct a 95% confidence interval for the unknown population mean using the method described in the textbook in section 6-2-3. Use the t table on page 419 to find the correct t value.
  • Now use S-PLUS to find a confidence interval.
    1. Use your mouse to select Statistics:Compare Samples:One Sample:t test
    2. Select the variable temp.
    3. Use 98.6 as the mean for the null hypothesis.
    4. Click on OK.
    5. Read the Report Window.

    A p-value is the probability of observing a result as least as extreme as that actually observed if the experiment were repeated and the null hypothesis were true. Small p-values imply strong evidence that the null hypothesis is false, but should not be interpreted as the probability the null hypothesis is true. Later lab assignments will consider p-values in more detail.

  • Which statements below are justified:
    1. 95% of the sample has body temperatures between 98.12 and 98.38 degrees.
    2. 95% of the population has body temperatures between 98.12 and 98.38 degrees.
    3. There is fairly strong evidence that the mean population body temperature is not 98.6 degrees.
    4. We can be about 95% sure that the population mean body temperature is between 98.12 and 98.38.
    5. If we took another sample of 130 people, it would be 95% likely that the sample mean would be between 98.12 and 98.38.


    Homework Assignment

    You may solve each problem below using the program S-PLUS. You should also know how to do each problem with paper, pencil, and your t table. Solutions with t tables may not be as accurate as solutions with the computer. Put your answers on this form.

    1. For a t distribution with 10 degrees of freedom, what is the number t such that the area between -t and t is
      1. 0.90?
      2. 0.95?
      3. 0.99?
      What are the corresponding numbers for the standard normal curve?

    2. For a t distribution with 24 degrees of freedom, what is the number t such that the area between -t and t is
      1. 0.90?
      2. 0.95?
      3. 0.99?
      What are the corresponding numbers for the standard normal curve?

    3. The rat data set is modeled after Exercise 6-1 on page 160. Each value is the time in hours until a skin cancer is gone after treatment with a drug. Find a 95% confidence interval for the mean time until cancers disappear using the textbook formula (You may use S-PLUS to calculate summary statisics.)
    4. Use S-PLUS to find the 95% confidence interval for the population mean time until cancers disappear assuming the 24 rats are a random sample from a larger population.
      1. Use your mouse to select Statistics:Compare Samples:One Sample:t test
      2. Select the variable hours.
      3. Click on OK.
      4. Read the Report Window.

    5. Consider again the body temperature data from class. Make the assumption that the mean body temperature in the population is 98.6 degrees and that the population standard deviation is 0.73 degrees. You may solve these problems with or without S-PLUS.
      1. What is the probability that a single randomly sampled body temperature would be 98.25 or lower?
      2. What is the probability that the mean of 10 randomly sampled body temperatures would be 98.25 or lower?
      3. What is the probability that the mean of 130 randomly sampled body temperatures would be 98.25 or lower?


    Last modified: March 5, 2001

    Bret Larget, larget@mathcs.duq.edu