Math 225

Introduction to Biostatistics

Sampling Distributions and the Central Limit Theorem

Prerequisites

This lab assumes that you already know how to:

Login, find course Web page, run S-PLUS
Use the Commands Window to execute commands
Load and run an S-PLUS program

Technical Objectives

This lab will teach you to:

Use S-PLUS to calculate probabilities of events involving the sample mean.
Graph the sampling distribution of the sample mean of various sizes and from various populations.

Conceptual Objectives

In this lab you should learn to:

Undestand what a sampling distribution is.
Understand what factors most greatly affect the accuracy of the normal approximation of sampling distributions.
Understand the central limit theorem.
Understand that there is no single sample size for which the sampling distribution will be approximately normal for all possible populations.

Sampling Distributions

A statistic may be calculated from sample data. If the sample is random, the statistic is a random variable and has a probability distribution called a sampling distribution. This lab examines the sampling distribution of the sample mean.

The Central Limit Theorem

Roughly speaking, the central limit theorem for the sample mean states three things:

The mean of the sampling distribution of the sample mean is the population mean.
The standard deviation of the sampling distribution of the sample mean is the population standard deviation divided by the square root of the sample size.
If the sample size is sufficiently large, the sampling distribution of the sample mean is approximately normal.

If we refer to the sample mean as xbar, the population mean as mu, the population standard deviation as sigma, and the sample size as n, these statements are equivalent to:

mean(xbar) = mu
se(xbar) = sigma / sqrt(n)
the distribution of xbar is approximately normal when n is sufficiently large.

How big is ``sufficiently large''

When the population itself is normal, the sampling distribution of xbar is exactly normal for any sample size. When the population is not normal, the sampling distribution becomes more normal as the sample size increases. Some textbooks state that the sampling distribution of the sample mean is approximately normal when the sample size is at least 30. This is often untrue, as you shall see in this lab. The most valid answer to ``how large does the sample size need to be for the sampling distribution to be approximately normal?'' is ``it depends''. However, for many data situations which do arise, a sample size of about 30 or more is large enough for approximate normality of the sampling distribution. You should, however, be aware of some characteristics of a population which lead to exceptions to this ``30 is sufficiently large'' rule, and you will be guided to discovering this in the lab.

The Central Limit Theorem in S-PLUS

The S-PLUS script clt.ssc contains several functions for graphing sampling distributions. We will use three of these functions.

gskew(n,mu,sd,skew)
will make a graph of a sampling distribution of xbar for sample size n, where the mean, standard deviation, and skewness coefficient of the unimodal skewed population are mu, sd, and skew respectively.
gbimod(n,mu,sd,d)
will make a graph of a sampling distribution of xbar for sample size n, where the population mean and standard deviation for a bimodal symmetric population are mu and sd respectively. The argument d is the distance from mu to each mode. Note that d cannot be greater than sd.
gbinom(n,p,low,high,scale)
graphs the binomial distribution which may be considered as the sampling distribution of the sample mean for a population with only zeros and ones.

S-PLUS help is available in this on-line guide.

Note that you can use the mouse to highlight a command from Netscape, switch over to S-PLUS, and paste the command into the Commands Window. This can save on typing. Also, you may use the arrow keys to retrieve and edit previous commands.

In-class Activities

Open a Commands Window. [How?]
Load in the script clt.ssc by following these steps.
1. Click on the clt.ssc link above.
2. Save the file onto the Desktop.
3. Switch over to S-PLUS.
4. Under the file menu, select Open.
5. Open the file clt.ssc. You may need to change the box ``Look in'' to Desktop and the box ``File type'' to either all files or *.ssc files. This opens up a Script Window.
6. Under the Script menu, choose Run. This will load the function gnorm into S-PLUS.
7. Close the Script Window by clicking the x-button in the upper right corner.
Make several graphs of unimodal populations with mean 100, standard deviation 10, and varying skewness coefficients. Get a feel for how the skewness coefficient affects the shape of the distribution. The green curve is the actual sampling distribution. The red curve is a normal curve with the same mean and standard deviation as the sampling distribution for the purpose of comparison. This code makes a graph for a single example.
```
> gskew(1,100,10,0.5)
```
This code will draw graphs for several values simultaneously.
```
> for(skew in seq(-4,4,0.5)){gskew(1,100,10,skew)}
```
Now, we will see how the sampling distribution changes as the sample size increases for a population with a mild amount of skewness.
```
> for(n in c(1,2,5,10,15,20,25,30,50,100)){gskew(n,100,10,1)}
```
In the previous graphs, how do the center (mean), spread (standard deviation), and skewness change as sample size increases?
We will now do a similar exercise for an example with more skewness.
```
> for(n in c(1,2,5,10,15,20,25,30,50,100)){gskew(n,100,10,-5)}
```
In the previous graphs, how do the center (mean), spread (standard deviation), and skewness change as sample size increases?
Consider now populations that are not skewed, but have nonnormal bimodal shapes. Get a feel for how the skewness coefficient affects the shape of the distribution. This code makes a graph for a single example.
```
> gbimod(1,100,10,8)
```
This code will draw graphs for several values simultaneously.
```
> for(d in seq(8.1,9.9,0.2)){gbimod(1,100,10,d)}
```
Now consider what happens as the sample size increases when there is a great deal of space between the modes.
```
> for(n in c(1,2,5,10,15,20,25,30,50,100)){gbimod(n,100,10,9.2)}
```
In the previous graphs, how do the center (mean), spread (standard deviation), and skewness change as sample size increases?
Will the sampling distribution of xbar for a sample size of 30 be more symmetric and normal if the population is symmetric but nonnormal, or if the population is strongly skewed?

Homework Assignment

Load the script clt.ssc into S-PLUS and answer the questions below. You should write your answers on this form and turn it in to your lab instructor by the due date.

Further S-PLUS help is available in this on-line guide.

Draw a graph of a skewed population with mean 100, standard deviation 10, and skewness coefficient 3 and the sampling distribution of xbar for a sample size of 4.
```
> for(n in c(1,4)){gskew(n,100,10,3)}
```
Find the mean and the standard deviation of the sampling distribution of xbar.
Find the probability that xbar exceeds 110 assuming that the sampling distribution is normal by finding an area under the appropriate normal curve using pnorm or the normal table in your book. The area to the right of 110 under the green curve is the actual probability that the sample mean exceeds 110 when the population is skewed. The area to the right of 110 under the red curve is the calculation of the probability assuming the population is normal.
Based on the graph, is the area under the normal curve too small, too large, or just about right?
Complete a similar problem for a larger sample size, 400.
```
> for(n in c(1,400)){gskew(n,100,10,3)}
```
Find the mean and the standard deviation of the sampling distribution of xbar.
Find the probability that xbar exceeds 101 assuming that the sampling distribution is normal by finding an area under the appropriate normal curve using pnorm or the normal table in your book. The area to the right of 101 under the green curve is the true probability that the sample mean exceeds 101 when the population is skewed. The area to the right of 101 under the red curve is the calculation of the probability assuming the population is normal.
Based on the graph, is the area under the normal curve too small, too large, or just about right?
Consider the previous two problems. Would you conclude that the accuracy of the normal approximation to the sampling distribution of the sample mean from skewed populations is more accurate when the sample size is large? If so, why? Justify your response by refering to the results of the previous two problems.
Your textbook says the normal approximation to the binomial distribution will be fairly accurate when np > 5 and n(1-p) > 5.
If p = 0.01, at least how large should the sample be according to the rule?
If p = 0.2, at least how large should the sample be according to the rule?
If p = 0.5, at least how large should the sample be according to the rule?
Draw graphs of binomial distributions with n equal to 30 and p equal to 0.01, 0.2, 0.5 and 0.9.
```
> for(p in c(0.01,0.2,0.5,0.9)){gbinom(30,p)}
```
Which statement is more correct?
A. As long as n is at least 30, the binomial distribution is well approximated by a normal curve.
B. When p is close to 0 or 1, n must be larger for the normal approximation to be good than when p is close to 0.5.

Last modified: February 16, 2001

Bret Larget, larget@mathcs.duq.edu