The discussion of the construction, evaluation and application of psychological tests is beyond the scope of this course. However, issues of the reliability and validity of a psychological test are parallel to concerns that one may have about any measure. A psychological test is simply an approach to measurement often used in psychology.

So here we provide a brief overview of some of the traditional ways of thinking about the reliability and validity of tests. Be warned, though, that when we discuss the threats to validity in experiementation we shall be using a rather different conceptual framework.

I. Test construction: Introduction and Overview

II. Reliability

III. Validity

IV. Item Analysis

V. Test Interpretation


I. Test construction: Introduction and Overview

A. Definition of Psychological Tests:

"an objective and standardized measure of a sample of behavior."

B. Standards of Test Construction and Test Use:

A good test should be reliable and valid. Concerns relating to standards include user qualification, security of test content, confidentiality of test results, and the prevention of the misuse of tests and results.

C. Test Characteristics and Response sets.

Characteristics include

i. a test of maximum performance (e.g., achievement test) which tells us what a person can do.

ii. a test of typical performance (e.g., personality test) which tells us what a person usually does.

iii. a speed test, in which response rate is assessed.

iv. a mastery test asses whether or not the person can attain a pre-specified mastery level of performance.

Response sets include: social desirability (giving responses that are perceived to be socially acceptable), acquiescence (agreeing or disagreeing with everything, and deviation (giving unusual or uncommon responses).

All of the above can threaten the validity of a given set of results.


II. Reliability


A. Classical test theory states:

a test is reliable a) to the degree that it is free from error and provides information about examinees’ "true" test scores and b) to the degree that it provides repeatable, consistent results.

B. Methods of estimating reliability

[we did not go over this as it involves calculating a reliability coefficient--a correlation coefficient--which we have not discussed yet]

C. Standard error of measurement.

[we skipped over, as we are not ready to discuss yet]

D. Factors affecting the reliability coefficient

Any factor which reduces score variability or increases measurement error will also reduce the reliability coefficient. For e.g., all other things being equal, short tests are less reliable than long ones, very easy and very difficult tests are less reliable than moderately difficult tests, and tests where examinees’ scores are affected by guessing (e.g. true-false) have lowered reliability coefficients.


III. Validity

Three major categories:

content, criterion-related, and construct validity

1) content validity:

A test has content validity if it measures knowledge of the content domain of which it was designed to measure knowledge. Another way of saying this is that content validity concerns, primarily, the adequacy with which the test items adequately and representatively sample the content area to be measured. For e.g., a comprehensive math achievement test would lack content validity if good scores depended primarily on knowledge of English, or if it only had questions about one aspect of math (e.g., algebra). Content validity is primarily an issue for educational tests, certain industrial tests, and other tests of content knowledge like the Psychology Licensing Exam.

Expert judgement (not statistics) is the primary method used to determine whether a test has content validity. Nevertheless, the test should have a high correlation w/other tests that purport to sample the same content domain.

This is different from face validity: face validity is when a test appears valid to examinees who take it, personnel who administer it and other untrained observers. Face validity is not a technical sense of test validity; i.e., just b/c a test has face validity does not mean it will be valid in the technical sense of the word. "just cause it looks valid doesn’t mean it is."


2) criterion-related validity:

Criterion-related validity is a concern for tests that are designed to predict someone’s status on an external criterion measure. A test has criterion-related validity if it is useful for predicting a person’s behavior in a specified situation.

2a) Concurrent vs. predictive validation:

First the term "validation" refers to the procedures used to determine how valid a predictor is. There are two types.

In concurrent validation, the predictor and criterion data are collected at or about the same time. This kind of validation is appropriate for tests designed to asses a person’s current criterion status. It is good diagnostic screening tests when you want to diagnose.

In Predictive validation, the predictor scores are collected first and criterion data are collected at some later/future point. this is appropriate for tests designed to asses a person’s future status on a criterion.

2b) Standard Error of estimate

The standard error of estimate (s est) is used to estimate the range in which a person’s true score on a criterion is likely to fall, given his/her score as estimated by a predictor.

2c) Decision-Making

In many cases when using predictor tests, the goal is to predict whether or not a person will meet or exceed a minimum standard of criterion performance — the criterion cutoff point. When a predictor is to be used in this manner, the goal of the validation study is to set an optimal predictor cutoff score; an examinee who scores at or above the predictor cutoff is predicted to score at or above the criterion cutoff.

We then get

true positives (or valid acceptance): accurately identified by the predictor as meeting the criterion standard

False positives (or false acceptance): incorrectly identified by the predictor as meeting the criterion standard.

True negative (valid rejection): accurately identified by the predictor as not meeting the criterion standard.

False negative (invalid rejection): meets the criterion standard, even though the predictor indicated s/he wouldn’t .

2d) Factors affecting the criterion-related validity coefficient:

This is about factors that potentially affect the magnitude of the criterion-related validity coefficient. We will not go into this as it relates to correlation.

3) Construct validity

a test has construct validity if it accurately measures a theoretical, non-observable construct or trait. The construct validity of a test is worked out over a period of time on the basis of an accumulation of evidence. There are a number of ways to establish construct validity.

Two methods of establishing a test’s construct validity are convergent/divergent validation and factor analysis.

3a) Convergent/divergent validation

A test has convergent validity if it has a high correlation with another test that measures the same construct. By contrast, a test’s divergent validity is demonstrated through a low correlation with a test that measures a different construct. Note this is the only case when a low correlation coefficient (b/w two test that measure different traits) provides evidence of high validity.

The multitrait-multimethod matrix is one way to assess a test’s convergent and divergent validity. We will not get into this right now.

3b) Factor analysis

Factor analysis is a complex statistical procedure which is conducted for a variety of purposes, one of which is to assess the construct validity of a test or a number of tests. We will get there.

3c) Other methods of assessing construct validity:

we can asses the test’s internal consistency. That is, if a test has construct validity, scores on the individual test items should correlate highly with the total test score. This is evidence that the test is measuring a single construct

also developmental changes. tests measuring certain constructs can be shown to have construct validity if the scores on the tests show predictable developmental changes over time.

and experimental intervention, that is if a test has construct validity, scores should change following an experimental manipulation, in the direction predicted by the theory underlying the construct.

4) Relationship between reliability and validity


If a test is unreliable, it cannot be valid.

For a test to be valid, it must reliable.

However, just because a test is reliable does not mean it will be valid.


Reliability is a necessary but not sufficient condition for validity!

IV. Item Analysis

There are a variety of techniques for performing an item analysis, which is often used, for example, to determine which items will be kept for the final version of a test. Item analysis is used to help "build" reliability and validity are "into" the test from the start. Item analysis can be both qualitative and quantitative. The former focuses on issues related to the content of the test, eg. content validity. The latter primarily includes measurement of item difficulty and item discrimination.

item difficulty: an item’s difficulty level is usually measured in terms of the percentage of examinees who answer the item correctly. This percentage is referred to as the item difficulty index, or "p".

Item discrimination refers to the degree to which items differentiate among examinees in terms of the characteristic being measured (e.g., between high and low scorers). This can be measured in many ways. One method is to correlate item responses with the total test score; items with the highest test correlation with the total score are retained for the final version of the test. This would be appropriate when a test measures only one attribute and internal consistency is important.

Another way is a discrimination index (D).

V. Interpretation of Test Scores

1. norm-referenced interpretation:

involves comparing an examinee’s score to that of others in a normative sample. It provides an indication of where the examinee stands in relation to others who have taken the test. Norm-referenced scores include developmental scores (eg, mental age scores and grade equivalents), which indicate how far along the normal developmental path an individual has progressed, and within-group norms, which provide a comparison to the scores other individuals whom the examinee most resembles.

2. Criterion-referenced interpretation:

involves interpreting an examinee’s test score in terms of an external pre-established standard of performance.