Classical Test Theory
Classical test theory (CTT) is both a philosophical argument in psychological science and a set of operations in mathematical statistics that focus on measuring mental attributes in humans. This broad description of CTT contains all the elements one needs to understand and appreciate it, both as a theory and in its operation in testing programs, but it does require an extended explanation. In the description, the phrases “philosophical argument in psychological science,” “operations in mathematical statistics,” and “measuring mental attributes” deserve particular attention because understanding them is the key to learning CTT. In this entry each term is explained in nontechnical language. Then CTT is discussed regarding its mathematical underpinnings, again presented nontechnically. In this second part, four formulas are explained that illustrate two essential aspects of CTT: reliability and standard error of measurement.
CTT is a scientific endeavor. In fact, realizing that CTT is science is an important step in learning about it. CTT meets all the criteria of any true science: It has a philosophical underpinning, a coherent methodology, and its methods are replicable by other scientists. Its place in science is so well established that it may be claimed that measuring mental attributes—the purpose of CTT—is psychology's greatest contribution to the world of science.
Even before exploring the terms listed in its description, however, it is useful to note the words that compose CTT: classical, test, and theory. Classical suggests something old as well as tried-and-true. CTT is classical in the sense that it is fundamental to measurement science, but it is not ancient. In fact, many persons are surprised to learn that the field of measurement science formed only in the mid-to-late 19th century with much of its development not until the 20th century, stemming from the groundbreaking work of Spearman, Binet, Thurstone, and later Thorndike, all persons with unusually high IQ. (For a history of the field, see the 1997 special issue of Educational Measurement: Issues and Practices or Sternberg's delightfully readable Metaphors of Mind: Conceptions of the Nature of Intelligence .) The noun test is widely known, of course, meaning an instrument used to appraise, examine, or analyze. The last word in CTT, theory, is also descriptive and accurate. Theories are, by definition, ontologically unprovable; however, like most robust theories, CTT has provided over the years ample evidence that applying it appropriately yields meaningful and useful information.
The descriptive terms of CTT, “philosophical argument in psychological science,” “operations in mathematical statistics,” and “measuring mental attributes” need to be explained next. Regarding the philosophical argument supporting CTT, the ontological contention is made that there are certain malleable aspects to the fundamental being of humans—such as ability, proficiency, beliefs, attitudes, opinions, and probably, desire and intent, too. In psychological science these mental attributes are conceived of as cognitive processes and get hypothesized as constructs. Constructs are often more fully called latent constructs to emphasize that they are deeply embedded in human psyche and represent cognitive processes. Cognitive processes are malleable—after all, the very purpose of education is to develop them— but they are not directly observable. People cannot see reasoning even with sophisticated devices, although they can measure brain waves and quantify electrical and chemical actions. Still, despite not seeing into the brain, individuals perform myriad behavioral operations based on cognitive processes, such as reading or voting or expressing an opinion.
It follows that while latent constructs are not observable, the behaviors people use to express them are. For instance, people can observe an individual reading a story, reciting historical incidents, solving a mathematical or reasoning problem, or expressing a belief or an opinion. Hence, by observing behaviors they infer stimulation of a cognitive process. It is important to recognize that mental measurements of hypothesized latent constructs are only—and at best—inferences of more deeply seated cognitive processes.
Furthermore, people can reliably make distinctions between observations. That is, they can suggest that one person consistently comprehends more of a given passage in a text than is comprehended by another person or that a particular individual can reason through a problem more thoughtfully than another person orthat one manorwoman has an opinion that is more extreme than is the opinion expressed by others. In a 1904 foundational work on CTT, Introduction to the Theory of Mental and Social Measurements, Thorndike described the capacity to reliably make distinctions in behavioral observations as a concept called a just noticeable difference. In 1966, this just noticeable difference was given a mathematical structure in Fechner's law.
Putting the facts together that (1) observation of behaviors leads to inferences about cognitive activity, and (2) distinctions can be reliably made between degrees of a hypothesized construct gives us a test. A test—simply but significantly—standardizes the examinees' behavioral responses so that others can scale and score them and then make interpretations, usually by comparisons to peers or to defined standards. In testing contexts, a test item or test exercise is considered to be a carefully prescribed stimulus.
However, the testing situation becomes complex when this basic idea is implemented with people. Several obvious reasons contribute to the complexity: (1) latent constructs are difficult to specify with precision, (2) the constructs are malleable, and (3) the stimuli meant to engender the examinee's behavioral response may be constructed such that an examinee's response does not completely or accurately engage the targeted construct (for example, in the case of a poorly crafted test item). All this adds up to imprecision in measurement—termed measurement error. Without measurement error, people would know the true ability for any individual on a given cognitive process. And the notion of true ability is a central feature of CTT. Of course, people never measure mental attributes without error.
Mathematically, the notion of measuring true ability in CTT is expressed in this famous formula:
Formula 1 states that in CTT an observed score is the sum of a true score plus some error. While this is a logical supposition, expressing it as a mathematical formula allows testers to use it to model an individual's behavior in a testing context. Examining the CTT formula shows how this works.
Each term in the equation has a special meaning. The left-hand term (X) is a Roman letter indicating that this outcome is observed; namely, it is the test score of a given examinee. The first term on the right-hand side of the equation is the true score (τ tau) and denotes the structural part of the equation. That is to say, it represents the latency being measured. The error (E epsilon) is called the stochastic part of the equation, from the Greek word for aim or guess. It is characterized by randomness in the population, an accurate description since no two individuals' true score is estimated with the same degree of precision. Greek letters are used for these terms to indicate that they apply globally; that is, everyone in the population has a true score (although it differs from person to person) and measuring always includes error.
The subscripts in the formula have meaning, too. The first subscripts specify that the observation is on a given item (i) and for a particular individual (j). There is no subscript for the error since it is not tied directly to any individual: again, the randomness of all measurement errors.
From the formula, it is easy to see that as the error decreases to a limit of zero (attenuates, in statistical language); the observed score grows closer to the value of the true score. In perfect measurement, they are the same: X = T. Unfortunately, measurements are never perfect, which explains the existence of the error term in CTT.
CTT often used to estimate the degree of error in a particular testing context. This application of the theory usually centers on two statistics: the standard error of measurement (SEM) and an index of the reliability of scores for a particular testing occurrence. Both concepts are mathematical estimations of measurement error, but they convey different information. In a global sense both measures indicate stability in measurement, or reliability. In fact, CTT is often called a theory of reliability, encompassing both indices.
Reliability means consistency of measurement. A perfectly reliable test is one that would yield the identical score (i.e., the examinee's true score) over many occasions, presuming no learning or other confounding factor intervened between the test administrations. But, of course, perfect reliability is never achieved in real-world scenarios. Were a test administered many times, any given examinee would likely not obtain the same score over and over again. Even with a carefully constructed test some variation in scores would occur. As illustration, in a case in which the raw score on a first testing occasion was 72, while on the second it was 74, and on the third a 68 was obtained, the inconsistency in scores is evidence of measurement error, or less than perfect reliability.
Given the improbable—but theoretically interesting— scenario in which a tolerant examinee took the test (or equivalent tests) a very large number of times and (also improbably) no learning or other factor such as fatigue influenced any attempt, the examinee would eventually have an entire distribution of scores that ranged from the examinee's lowest obtained score to the highest obtained score. From this theoretical distribution of scores, the mean is considered to be the examinee's true score, and the standard deviation is the SEM.
Thus, reliability and SEM are indicators of consistency in CTT, and there are methods for calculating various statistics that represent them. Of course, calculating indicators of consistency requires multiple occasions; with only one occasion, consistency cannot be determined. This point, while obvious, is important: For reliability estimation, regarding the test, there needs to be more than one item, and for the examinee, there needs to be more than one testing occasion.
To address this constraint it is necessary to introduce another important aspect of CTT, its additive characteristic. In CTT, tests are considered to be composed of some number of items that work together in an additive fashion. An additive function is one that conserves the addition operation, as shown for a test with n items in Equation 2.
Essentially, this equation shows that in CTT the scores of individual items—which are often, but not always, dichotomous (meaning right or wrong)—can be summed to a cumulative whole, a test score. More technically, the overall test score is a linear combination of the individual test item scores. Working from the addi-tivity rule, reliability is determined by calculating a consistent response among the items.
Traditional reliability strategies use one of two ways to estimate the consistency: either temporal stability or internal consistency. These are two routes to determine consistency of responses. Temporal stability is a family of techniques that gauge the extent to which a test yields consistent scores from one occasion to the next. It includes such strategies as test-retest and splitting the test in half (i.e., split-half) to produce an index of reliability. Internal consistency looks to the covariance structure of the item responses to produce an index of reliability. One popular measure of internal consistency is called Cron-bach's alpha (a). By any of these means, however, reliability is theoretically conceived as the correlation of the observed score with the individual's true score, and it is expressed syntactically in Formula 3. In statistical contexts, a r (rho) symbolizes a correlation, and the subscripts denote the variables.
This formula represents an index of reliability and applying it to any of the calculation strategies yields a coefficient. Coefficients in this context range from 0 to 1: no reliability (i.e., randomness) to perfect reliability.
Finally, as explained above, SEM indicates discrepancies between observed scores and true scores, and it too is a good indicator of reliability. Syntactically, it is the ratio of the standard deviation of the errors (expressed as sigma, s) to the standard deviation of the observed scores, a shown in Formula 4. When the test's reliability is known, the SEM is easily calculated.
In sum, CTT is clearly a powerful theory of measurement with a philosophical base and set of mathematics useful to implement it. From learning about them, people can readily appreciate why CTT is the most commonly used basis for educational and psychological tests.
See also:Item Response Theory
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author.
Fechner, G. T. (1966). Elements of psychophysics. New York: Holt, Reinhart, & Winston.
Thorndike, E. L. (1904). Introduction to the theory of mental and social measurements. New York: Teachers College, Columbia University.
Washington Virtual Academies
Tuition-free online school for Washington students.
- Coats and Car Seats: A Lethal Combination?
- Kindergarten Sight Words List
- Signs Your Child Might Have Asperger's Syndrome
- Child Development Theories
- 10 Fun Activities for Children with Autism
- Social Cognitive Theory
- Why is Play Important? Social and Emotional Development, Physical Development, Creative Development
- GED Math Practice Test 1
- Problems With Standardized Testing
- The Homework Debate