Notice that each variable is standardized, and that correlation is (almost) the average product of the standardized values for each pair of data values x and y. For every point where x and y are both above their means or below their means, there is a positive contribution to the sum. If one is above its mean while the other is below, there is a negative contribution to the sum. Thus, r will be positive if large x and large y tend to go together and small x and small y tend to go together. In contrast, r will be negative if small x and large y or large x and small y tend to go together.

The correlation coefficient satisfies these properties:

- -1 < r < 1
- r = -1 only if data falls exactly on a line with negative slope
- r = 1 only if data falls exactly on a line with positive slope
- If r is close to -1 or 1, the data is tightly clustered around a line.
- The closer r is to 0, the less clustered the data is around a line.
- There may be a strong
*nonlinear*relationship between two variables with r close to 0. r only measures the linear relationship. - Computation of r does not depend on the units chosen for each variable.
(It only depends on the standardized values of each variable.)
thus, r is a
*unitless*expression.

y = a + bxa is the y intercept and b is the slope. The textbook gives rather messy formula for calculating a and b from data. Simpler equations are

b = r (sIn practice, a computer should be used to fit a regression line._{y}/ s_{x}) a = - b

Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -60.6216 40.4924 -1.4971 0.1602 height 0.7547 0.2289 3.2966 0.0064In this output, a = -60.6216 and b = 0.7547.

The mean and standard deviation of the heights are 176.5 cm and 10.9 cm respectively. The mean and standard deviation of the weights are 72.6 kg and 11.9 kg respectively. the correlation between the two variable is r = 0.69. Check the output of the computer package (within roundoff error).

**Problem 1:**

Use the OLS line to predict the weight of a young man who is 180 cm tall.

*Solution:*

y = a + bx = -60.6216 + 0.7547(180) = 75.2 kgNote that the individual's height is (180 - 176.5) / 10.9 = 0.32 standard deviations above the mean. We predict is weight to be (0.32)r standard deviations above the mean, or (0.32)(.69) + 72.6 = 75.2 kg.

**Problem 2:**

Give a 95% confidence interval for the population slope.

*Solution:*

We assume here that in the population of interest, the mean weight of all young men of the same height is well described by a linear function of the height. The slope of our OLS regression line is an estimate of the slope of this population regression line. We construct a confidence interval in the same way as usual.

(estimate) +/- (reliability coefficient)(standard error)Since the standard error is estimated from data, we use a reliability coefficient from a t-distribution. For regression, the number of degrees of freedom is n-2, since we need to estimate two parameters to find our OLS line. The general formula is

b +/- tWe get the values of b and its SE from the computer output and the value of t^{*}SE(b)

0.7547 +/- (2.1788)(0.2289)or

0.7547 +/- 0.4987

Test the hypothesis that the slope of the population regression line is 0. (In other words, there is no predictive value in height for predicting weight.)

*Solution:*

The null hypothesis is that the slope is 0. Under this hypothesis, standardize the estimate of the slope and compare it to a t-distribution with 12 degrees of freedom.

t = (0.7547 - 0) / 0.2289 = 3.297This agrees with the t statistic included in the computer output. The (2-sided) p-value is reported as 0.0064, which is the twice the area to the right of 3.297 under a t-distribution with 12 degrees of freedom. A p-value this low indicates strong evidence aganst the null hypothesis.

**Always plot your data.**
It is not always true that the relationship
between two variables is well described by a straight line.
A computer package will fit a line as best it can, whether it makes
sense to or not.
You need to check by plotting your data.

**Extrapolation is dangerous.**
The relationship between two variables might be linear in the range
of the data you have collected, but nonlinear outside of this range.
The data in the previous example came from a sample of young men
whose heights ranged from 161 cm to 205 cm.
We shouldn't expect the same relationship to be good for people
of different heights (or different age or gender, for that matter).
There is nothing but common sense
to prevent one from predicting the weight of a three-year old
boy who is 100 cm tall to be 14.8 kg by the equation,
but we should be very skeptical of the accuracy of such a prediction,
and not be mislead by predictions that are clearly nonsensical.

**Association is not causation.**
Simply the fact that two variables are related in a linear fashion
with a slope that is significantly non-zero does not in itself
imply that one variable is the direct cause of the other.
Sales of ice-cream are positively correlated with incidence of malaria.
This is not evidence that ice cream consumption causes malaria,
but merely an artifact
of the dependence both variables have on warm weather.

Last modified: April 17, 1996

Bret Larget, larget@mathcs.duq.edu