Item Response Theory
Item response theory (IRT) is an approach to modern educational and psychological measurement that posits a particular notion about cognition and sets forth sophisticated statistics to appraise cognitive processes. Its objective is to reliably calibrate individuals and test stimuli (i.e., items and exercises) on a common scale that is interpreted to show the individuals' ability or proficiency and specified characteristics of the test stimuli.
IRT is attractive for a number of reasons but principally because it is presumed that IRT-based estimates of examinees' ability are more precise than can be garnered through traditional means, such as summing the number of correct responses to a set of test items or exercises. Also, IRT is applicable to many practical testing problems, such as generalizability of test results, various item analyses, examining test bias and differential item functioning, equating test forms, estimating construct parameters, domain scoring, and adaptive testing.
In the IRT theory, cognitive processes are hypothesized as abilities or proficiencies. Some examples are reading, computing, and reasoning problems through to credible solutions, as well as beliefs, attitudes, opinions, and likely desires and aspirations, too. In short, almost anything that is a cognitive process. Some skills or talents, like playing a musical instrument, giving a theatrical performance and some physical acts such as running or successfully hitting a baseball, can be accommodated in the theory as well. Each ability or proficiency is conceived as lying along a continuum that ranges from none at all to complete mastery. In statistical terms, the range is infinite (±00). Figure 1 depicts this notion graphically.
Graphical depiction of ability or proficiency continuum
As cognitive processes, abilities and proficiencies are deeply seated in the brain and cannot be directly observed. For this reason, they are described as latent, and often as latent traits. Some persons working with IRT believe that describing aspects of cognition as traits is too limiting: It does not capture the fact of their malleability or the notion that they may be influenced by environmental and social factors; hence, more generic terms such as abilities and proficiencies are sometimes used. In this essay, the term ability is used.
The notion of mental abilities ranging along a continuum contrasts with classical testing theory (CTT) in which knowledge is conceived as being circumscribed within a domain (e.g., a reading domain), and a true score for any particular examinee can be estimated for that knowledge domain. In CTT, the more precisely the true score is estimated, the less error in measurement, and hence the greater reliability
A principal objective of IRT is to determine the point along the ability continuum that best calibrates a particular individual to the scale. This point—the test score expressed in IRT terms—is interpreted to reflect the individual's ability on whatever is the object of measurement (e.g., reading). Figure 2 illustrates this foundational IRT notion. For mathematical reason, ability is expressed as theta (y) and when referring to individuals, the continuum is called the y scale.
As seen in Figure 2, a given person may be low in the ability (shown as toward the left side of the y scale), whereas another may be in the middle, while a third person may be high (shown as toward the right side of the y scale). There is no assumption about the spread of individuals as in a bell-shaped (normal) distribution: all persons in a population could be high in the ability or all could be in the middle or have any other dispersion.
Of course, determining where a given individual's y is situated along the scale (i.e., his or her ability) requires that some questions (test stimuli: items or exercises) be administered to the individual so that the y estimate maybe calculated. For the calculation, it is necessary to know characteristics of these test stimuli, a feature of IRT called item characteristics. Three commonly-used item characteristics are 1) its level of difficulty along the continuum, 2) its discrimination in detecting differences in ability between examinees (an item that everyone responds to correctly yields no discrimination), and 3) the likelihood of low ability examinees guessing a correct answer. Figure 3 depicts the notion of test characteristics being placed along the continuum.
Just as examinees may be located at any particular point along the continuum, so too may characteristics of items be situated at any point for their characteristic. In other words, any given items may be low in difficulty or have low discriminating power or be relatively simple to guess whereas another item may be middle or high in characteristics. Many items are typically situated all along the scale, reflecting wide dispersion.
To repeat, the three ingredients of IRT are: the scale, the examinees, and the items. As earlier stated, the scale's range is infinite, (zboo). What remains is to determine the y value for each examinee in the tested sample and the characteristics of the employed test items. Characteristics for the items can be discussed next.
In IRT (as well as in some other statistical contexts), item characteristics are plotted on a curve, called an ICC (item characteristics curve). While any number of characteristics may be plotted, it is common to display three of them: the discrimination, the difficulty, and the guessing. By convention, these are labeled as a, b, and c. In a statistical sense, the known characteristics of any given item are representative of a population of like items and hence are labeled parameters. The a, b, and c characteristics are thus labeled as a parameter or b parameter or c parameter. When all three characteristics are estimated, the IRT model is called the three-parameter model (3PL).
In IRT, there are many variations of ICCs. For instance, a common circumstance in IRT work is to estimate only a single item parameter, its difficulty (b parameter). This is the 1PL, and usually falls into an IRT category of estimation called the Rasch model, an eponym for the Danish mathematician Georg Rasch. The 3PL accounts for the most common IRT applications. Item characteristics for the 3PL are plotted in Figure 4.
In Figure 4, there are two scales, represented on the vertical X (abscissa) and horizontal Y (ordinate) axis. The X axis is the probability of y, labeled P(y) meaning the probability of a given y value. It ranges from 0 to 1.0 and is considered the likelihood of getting the item correct (for dichotomous items). The Y axis is the IRT y scale which (theoretically at least) ranges (±oo). For interpret-ability, however, the X scale is expressed in standardized units with a mean of 0 and standard deviation 1; since nearly all of the population is contained within the ±3 standard deviations, this is all of the range of the scale that is typically shown.
Regarding the ICC itself, the reader can see that the curve is shaped like a lazy S for an item with meritorious characteristics (but obviously it can assume almost any shape). Technically, the curve is an ogive, but many authors simply refer to it as the ICC. It begins at the lower left, which indicates the c parameter (guessing by low ability examinees: here almost -3 standard deviations from the mean 0). In the figure, this starting point is about .10, meaning that persons of very low ability still have about a ten percent probability of getting the item correct, even by mere guessing. As the curve progresses to the right, reflecting more and more ability, the curve slopes upward, indicating that as ability increases so does the probability of a correct response. The slope of the ICC represents the a parameter, the item's discriminating power. Next, the reader can observe that the overall location of the curve along the y scale and imagine a vertical line drawn from its mid-point down to the Y axis. In this example, that line would intersect the Y axis at about .5, meaning that this item is best suited to examinees who are slightly more able that are average examinees (about a half standard deviation above the mean ability). It is important to realize from this example that an ICC can be situated anywhere along the y scale, and its mid-point reflects its difficulty, the b parameter. In Figure 4, then, all three item characteristics can be seen: (a) the discrimination, (b) the difficulty, and (c) the guessing parameters.
When a person is learning IRT it is important to appreciate the fact that persons and items are calibrated on the same scale. This allows observers to learn features of each ingredient in IRT (persons and items) from the other. In other words, when test makers know the characteristics of items, they can observe (through the test) which ones a particular examinee gets right and wrong and thereby determine his or her ability. Conversely, when test makers know the y values for a relatively large number of examinees, they can calibrate items to the scale. This reciprocal finding is something akin to saying that if a teacher knows a student's grade point average the teacher can ipso facto identify something about the student's study habits and vice versa (not a perfect indicator but in the main a reliable one).
To persons new to IRT or not experienced in statistics, it may seem perplexing to state that items are calibrated to the scale from what test makers know about examinees' abilities and examinees are fitted to the scale from what test makers know about items. It seems a bit like the question: “Which came first, the chicken or the egg?” This is a relevant observation in IRT. Mathematically, this issue is addressed through the maximum likelihood function, a specialized statistical approach that determines the likelihood of observing a set of data in a hypothesized model. To explain how this works in IRT, one can suppose an item is well crafted and appropriate in its characteristics to a given examinee. The test maker presumes the examinee has a .5 chance of giving a correct response. This is expressed as syntactically as in Equation 1.
Next, the test maker can generalize this notion to a response (either correct or incorrect) to any item. Since the probability rests on ability, the generalization is written as follows.
Equation 2 is read as the probability (P) of a response (U) on any given appropriate item (i) is a function of ability (y). Thus, for the hypothetical examinee in the description of Equation 1 taking the item presumed that P(Ui ǀ y) = .5. Examinees of another ability level would have a different probability.
Tests are composed of more than one item, of course, so the probability function is extended to include a test of any length (n items). Now, the probability of a correct response is conditioned upon several items and is accordingly a joint probability, meaning the probability of a response on all the items. A joint probability is calculated as the probability of a response on the first item times the probability of a response on the second item, and so forth to n items. This is written in Equation 3.
To see Equation 3 in action, the reader may imagine that a particular examinee has 3 items presented to him on a test: a perfectly suited item (one in which the probability is .5 for his ability), another that is very easy for his ability (with, say, probability of .8), and a third item that is difficult relative to his ability, say, probability is .4). The joint probability of responding correctly to all the items on this short test is .16 (.5 x .8 x .4), or about 16 percent. Determining item characteristics is similarly done with a likelihood function, but this time using the examinee's presumed ability to inform an item's characteristics.
The mathematics of solving likelihood equations involves calculus and is not easily done when there are many items on a test. However, if the metric is in log units, the calculations are much simpler. Adding a small constant to the log-produced results yields answers that are nearly identical to what would be acquired in normal metric. Hence, most IRT calculations are done in log metric, and the y scale is expressed in log units, called logits.
The mathematical expression for the 3PL, in logistic units, is as follows.
While Equation 4 appears formidable, it is straightforward. In the equation, most terms are already known, including the probability (P), as well as the a, b, and c parameters. The e merely denotes that the expression is in log units of base e, and the D is a scaling constant to allow the results of log units to closely approximate a normal metric.
From this point on, IRT is mostly a search process wherein examinee responses to items give search to item characteristics and estimated item characteristics search for the best fitting examinee ability. When the process is complete, the test makerknows both IRT ingredients: examinee ability and item characteristics.
IRT is a powerful route to estimating an examinee's ability on the tested construct as well as learning about characteristics of test items. For these reasons, it is commonly used in many national testing programs such as the NAEP (National Assessment of Educational Progress), the SAT (Scholastic Achievement Test), GRE (Graduate Records Examination), LSAT (Law School Admission Test), MCAT (Medical College Admission Test), and many other assessment programs.
See also:Classical Test Theory
Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD.
Embertson, S. E., Reise, S. P. (2000). Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Publications.
Lord, F. M. (1980). Application of item response theory to practical testing problems. Princeton, NJ: Educational Testing Service.
Thissen, D., & Wainer, H. (2001). Test scoring. Mahwah, NJ: Erlbaum.
Washington Virtual Academies
Tuition-free online school for Washington students.
- Kindergarten Sight Words List
- Signs Your Child Might Have Asperger's Syndrome
- Coats and Car Seats: A Lethal Combination?
- Child Development Theories
- GED Math Practice Test 1
- Graduation Inspiration: Top 10 Graduation Quotes
- The Homework Debate
- 10 Fun Activities for Children with Autism
- First Grade Sight Words List
- Social Cognitive Theory