Math 225

Introduction to Biostatistics

Exploratory Data Analysis I: Histograms, Means, and Medians

Prerequisites

This lab assumes that you already know how to:

login to the system computers;
use a browser to find the course Web page;
start the S-PLUS software;
and move back and forth between the browser and S-PLUS.

Technical Objectives

This lab will teach you to:

enter data directly into S-PLUS;
enter data into a simple text editor to read into S-PLUS;
to read in a prepared data set from the course Web page;
to use S-PLUS to draw histograms;
to use S-PLUS to calculate means and medians.

Conceptual Objectives

In this lab you should learn to:

estimate the mean and the median of a distribution from a histogram;
understand the qualitative differences between the mean and the median;
use a histogram to describe the shape of a distribution;
use a histogram to identify skewness in a distribution.

In-class Activities

Entering data directly into S-PLUS.
The following data is from Exercise 2-4 on page 35 of the textbook and represents the time until death in hours for thirteen sheep that were fed a toxic weed as part of an experiment.
```
44 27 24 24 36 36 44 44 120 29 36 36 36
```
Follow these steps to create a variable named deathTime with this data.
1. Open a Commands Window by clicking on the button with the ">x" symbol if you do not already have one open. [How?]
2. In the Commands Window, create a variable called deathTime using the scan function. You can put spaces or single carriage returns between numbers. A carriage return on a blank line ends the input. See the example below. (Some of the characters below are computer output.)
```
> deathTime <- scan()
> 1: 44 27 24 24 36
> 6: 36 44 44 120 29
> 11: 36 36 36
>
```
Calculating means and medians in the Commands Window.
You can calculate the mean and median.
```
> mean(deathTime)
[1] 41.23077
> median(deathTime)
[1] 36
```
Creating a file and reading the file into S-PLUS.
Exercise 2-5 on page 35 contains ten columns of ten cholinesterase indices. Assume that the first column are measurements from men and that the second column are measurements from women. Ignore the final eight columns of data. You can enter this data with two variables into a file to read into S-PLUS following these steps.
1. Click on the Start Button and select Programs:Accessories:NotePad
2. Enter the data into the file including a header row with the variable names.
```
index sex
2.29  male
2.67  male
3.09  male
.
.
.
1.82  male
1.95  female
1.75  female
.
.
.
1.06  female
```
3. Save this file to the Desktop naming the file chol by selecting Save As... from the File menu.
4. Import the data into S-PLUS by selecting Import Data from the File menu. [How?]
Calculating Means and Medians in the Commands Window.
To refer to variables in a data set by name, you need to attach the data set.
```
> attach(chol)
```
You can find the mean of all the index measurements.
```
> mean(index)
[1] 1.91
```
You can find the mean of the index measurements separately for males and females.
```
> mean(index[sex=="male"])
[1] 2.239
> mean(index[sex=="female"])
[1] 1.581
```
The square brackets select the subset of the index variable for which the logical statement inside is true.
Read in data from the Web page.
Find the HARVEST data set on the course Web page and save it to the Desktop. Import the data into S-PLUS.
Using S-PLUS to draw a histogram.
A histogram is a bar graph for displaying the distribution of a single quantitative variable.
Make a histogram in S-PLUS following these steps.
1. Click on the ``2D Plots'' button, which is on the ruler and has a small picture with a bar graph and a jagged line. This opens up the Plots2D palette.
2. Click the histogram button which has a picture of a little histogram. A graphics window and a dialog box will open.
3. Click the little arrow next to ``Data Set'' and then click on the name of the data frame where your variable is.
4. Click on the little arrow next to ``x Column(s)'' and then click on the name of the variable.
5. Finally, click on the OK button.
Often, the default choice of the number of bars is not good. You can follow these steps to make a better graph.
1. Complete the first four steps above.
2. Click on the ``Options'' tab.
3. Change ``Number of bars'' from ``Auto'' to a number, such as 15.
4. If the variable is integer valued, select ``Integer'' instead of ``Continuous''.
5. Click on the OK button.
Interpreting histograms.
The center of a histogram may be described in two ways. The median is the location that divides the shaded area of the histogram in half. The mean is the location at which the histogram would balance if the histogram were made from a uniform solid material. If a histogram looks similar to its mirror image, we say the histogram is symmetrical. If the left half of the data is more spread than the right half of the data, we say that the distribution is skewed to the left Also, if the right half of the data is more spread than the left half of the data, we say that the distribution is skewed to the right. Make histograms of the variables SBPCB, DBPCB, and HRCB. Which is most symmetrical? Which is skewed to the right? Which is skewed to the left?
Calculating means and medians when there are missing values.
The HARVEST data set includes many missing values, because every individual was not measured at every time point in the study, and for some individuals, smoking or exercise information was not collected. Missing data is represented by the code ``NA'' in S-PLUS. If you ask S-PLUS to calculate the mean or median of a variable that includes missing data, it gives ``NA'' as the result. You can override this behavior with the option na.rm=T which removes missing values before calculation.
```
> attach(harvest)
> mean(HRCB)
> [1] NA
> mean(HRCB,na.rm=T)
> 74.97
> median(HRCB,na.rm=T)
[1] 74.33
```

Homework Assignment

Print this answer sheet to record your answers.

Obtain the heights (in inches) of fifteen male and fifteen female students at Duquesne. Create a text file (using NotePad or another text editor) with this information in a format ready to read into S-PLUS. The file should begin something like this.
```
height sex
70     male
72     male
66     female
```
Read the data into S-PLUS and use S-PLUS to calculate the mean and median heights of all thirty individuals combined as well as for the males and females separately.
Load the cereal data set into S-PLUS. Find a variable that is skewed to the right by plotting its histogram. Calculate the mean and median of this variable. Is the mean larger than the median as expected?

Further S-PLUS help is available in this on-line guide.

Last modified: January 4, 2001

Bret Larget, larget@mathcs.duq.edu