The discussion of the construction, evaluation and application of psychological tests is beyond the scope of this course. However, issues of the reliability and validity of a psychological test are parallel to concerns that one may have about any measure. A psychological test is simply an approach to measurement often used in psychology.
So here we provide a brief overview of some of the traditional ways of thinking about the reliability and validity of tests. Be warned, though, that when we discuss the threats to validity in experiementation we shall be using a rather different conceptual framework.
"an objective and standardized measure of a sample of behavior."
A good test should be reliable and valid. Concerns relating to standards include user qualification, security of test content, confidentiality of test results, and the prevention of the misuse of tests and results.
Characteristics include
i. a test of maximum performance (e.g., achievement test) which tells us what a person can do.
ii. a test of typical performance (e.g., personality test) which tells us what a person usually does.
iii. a speed test, in which response rate is assessed.
iv. a mastery test asses whether or not the person can attain a pre-specified mastery level of performance.
Response sets include: social desirability (giving responses that are perceived to be socially acceptable), acquiescence (agreeing or disagreeing with everything, and deviation (giving unusual or uncommon responses).
All of the above can threaten the validity of a given set of results.
a test is reliable a) to the degree that it is free from error and provides information about examinees "true" test scores and b) to the degree that it provides repeatable, consistent results.
[we did not go over this as it involves calculating a reliability coefficient--a correlation coefficient--which we have not discussed yet]
[we skipped over, as we are not ready to discuss yet]
Any factor which reduces score variability or increases measurement error will also reduce the reliability coefficient. For e.g., all other things being equal, short tests are less reliable than long ones, very easy and very difficult tests are less reliable than moderately difficult tests, and tests where examinees scores are affected by guessing (e.g. true-false) have lowered reliability coefficients.
Three major categories:
content, criterion-related, and construct validity
A test has content validity if it measures knowledge of the content domain of which it was designed to measure knowledge. Another way of saying this is that content validity concerns, primarily, the adequacy with which the test items adequately and representatively sample the content area to be measured. For e.g., a comprehensive math achievement test would lack content validity if good scores depended primarily on knowledge of English, or if it only had questions about one aspect of math (e.g., algebra). Content validity is primarily an issue for educational tests, certain industrial tests, and other tests of content knowledge like the Psychology Licensing Exam.
Expert judgement (not statistics) is the primary method used to determine whether a test has content validity. Nevertheless, the test should have a high correlation w/other tests that purport to sample the same content domain.
This is different from face validity: face validity is when a test appears valid to examinees who take it, personnel who administer it and other untrained observers. Face validity is not a technical sense of test validity; i.e., just b/c a test has face validity does not mean it will be valid in the technical sense of the word. "just cause it looks valid doesnt mean it is."
Criterion-related validity is a concern for tests that are designed to predict someones status on an external criterion measure. A test has criterion-related validity if it is useful for predicting a persons behavior in a specified situation.
First the term "validation" refers to the procedures used to determine how valid a predictor is. There are two types.
In concurrent validation, the predictor and criterion data are collected at or about the same time. This kind of validation is appropriate for tests designed to asses a persons current criterion status. It is good diagnostic screening tests when you want to diagnose.
In Predictive validation, the predictor scores are collected first and criterion data are collected at some later/future point. this is appropriate for tests designed to asses a persons future status on a criterion.
The standard error of estimate (s est) is used to estimate the range in which a persons true score on a criterion is likely to fall, given his/her score as estimated by a predictor.
In many cases when using predictor tests, the goal is to predict whether or not a person will meet or exceed a minimum standard of criterion performance the criterion cutoff point. When a predictor is to be used in this manner, the goal of the validation study is to set an optimal predictor cutoff score; an examinee who scores at or above the predictor cutoff is predicted to score at or above the criterion cutoff.
We then get
true positives (or valid acceptance): accurately identified by the predictor as meeting the criterion standard
False positives (or false acceptance): incorrectly identified by the predictor as meeting the criterion standard.
True negative (valid rejection): accurately identified by the predictor as not meeting the criterion standard.
False negative (invalid rejection): meets the criterion standard, even though the predictor indicated s/he wouldnt .
This is about factors that potentially affect the magnitude of the criterion-related validity coefficient. We will not go into this as it relates to correlation.
a test has construct validity if it accurately measures a theoretical, non-observable construct or trait. The construct validity of a test is worked out over a period of time on the basis of an accumulation of evidence. There are a number of ways to establish construct validity.
Two methods of establishing a tests construct validity are convergent/divergent validation and factor analysis.
A test has convergent validity if it has a high correlation with another test that measures the same construct. By contrast, a tests divergent validity is demonstrated through a low correlation with a test that measures a different construct. Note this is the only case when a low correlation coefficient (b/w two test that measure different traits) provides evidence of high validity.
The multitrait-multimethod matrix is one way to assess a tests convergent and divergent validity. We will not get into this right now.
Factor analysis is a complex statistical procedure which is conducted for a variety of purposes, one of which is to assess the construct validity of a test or a number of tests. We will get there.
we can asses the tests internal consistency. That is, if a test has construct validity, scores on the individual test items should correlate highly with the total test score. This is evidence that the test is measuring a single construct
also developmental changes. tests measuring certain constructs can be shown to have construct validity if the scores on the tests show predictable developmental changes over time.
and experimental intervention, that is if a test has construct validity, scores should change following an experimental manipulation, in the direction predicted by the theory underlying the construct.
If a test is unreliable, it cannot be valid.
For a test to be valid, it must reliable.
However, just because a test is reliable does not mean it will be valid.
Reliability is a necessary but not sufficient condition for validity!
There are a variety of techniques for performing an item analysis, which is often used, for example, to determine which items will be kept for the final version of a test. Item analysis is used to help "build" reliability and validity are "into" the test from the start. Item analysis can be both qualitative and quantitative. The former focuses on issues related to the content of the test, eg. content validity. The latter primarily includes measurement of item difficulty and item discrimination.
item difficulty: an items difficulty level is usually measured in terms of the percentage of examinees who answer the item correctly. This percentage is referred to as the item difficulty index, or "p".
Item discrimination refers to the degree to which items differentiate among examinees in terms of the characteristic being measured (e.g., between high and low scorers). This can be measured in many ways. One method is to correlate item responses with the total test score; items with the highest test correlation with the total score are retained for the final version of the test. This would be appropriate when a test measures only one attribute and internal consistency is important.
Another way is a discrimination index (D).
involves comparing an examinees score to that of others in a normative sample. It provides an indication of where the examinee stands in relation to others who have taken the test. Norm-referenced scores include developmental scores (eg, mental age scores and grade equivalents), which indicate how far along the normal developmental path an individual has progressed, and within-group norms, which provide a comparison to the scores other individuals whom the examinee most resembles.
involves interpreting an examinees test score in terms of an external pre-established standard of performance.