Item analysis is a general term that refers to the specific methods used in education to evaluate test items, typically for the purpose of test construction and revision. Regarded as one of the most important aspects of test construction and increasingly receiving attention, it is an approach incorporated into item response theory (IRT), which serves as an alternative to classical measurement theory (CMT) or classical test theory (CTT). Classical measurement theory considers a score to be the direct result of a person's true score plus error. It is this error that is of interest as previous measurement theories have been unable to specify its source. However, item response theory uses item analysis to differentiate between types of error in order to gain a clearer understanding of any existing deficiencies. Particular attention is given to individual test items, item characteristics, probability of answering items correctly, overall ability of the test taker, and degrees or levels of knowledge being assessed.
There must be a match between what is taught and what is assessed. However, there must also be an effort to test for more complex levels of understanding, with care taken to avoid over-sampling items that assess only basic levels of knowledge. Tests that are too difficult (and have an insufficient floor) tend to lead to frustration and lead to deflated scores, whereas tests that are too easy (and have an insufficient ceiling) facilitate a decline in motivation and lead to inflated scores. Tests can be improved by maintaining and developing a pool of valid items from which future tests can be drawn and that cover a reasonable span of difficulty levels.
Item analysis helps improve test items and identify unfair or biased items. Results should be used to refine test item wording. In addition, closer examination of items will also reveal which questions were most difficult, perhaps indicating a concept that needs to be taught more thoroughly. If a particular distracter (that is, an incorrect answer choice) is the most often chosen answer, and especially if that distracter positively correlates with a high total score, the item must be examined more closely for correctness. This situation also provides an opportunity to identify and examine common misconceptions among students about a particular concept.
In general, once test items have been created, the value of these items can be systematically assessed using several methods representative of item analysis: a) a test item's level of difficulty, b) an item's capacity to discriminate, and c) the item characteristic curve. Difficulty is assessed by examining the number of persons correctly endorsing the answer. Discrimination can be examined by comparing the number of persons getting a particular item correct with the total test score. Finally, the item characteristic curve can be used to plot the likelihood of answering correctly with the level of success on the test.
In test construction, item difficulty is determined by the number of people who answer a particular test item correctly. For example, if the first question on a test was answered correctly by 76% of the class, then the difficulty level (p or percentage passing) for that question is p = .76. If the second question on a test was answered correctly by only 48% of the class, then the difficulty level for that question is p = .48. The higher the percentage of people who answer correctly, the easier the item, so that a difficulty level of .48 indicates that question two was more difficult than question one, which had a difficulty level of .76.
Many educators find themselves wondering how difficult a good test item should be. Several things must be taken into consideration in order to determine appropriate difficulty level. The first task of any test maker should be to determine the probability of answering an item correctly by chance alone, also referred to as guessing or luck. For example, a true-false item, because it has only two choices, could be answered correctly by chance half of the time. Therefore, a true-false item with a demonstrated difficulty level of only p = .50 would not be a good test item because that level of success could be achieved through guessing alone and would not be an actual indication of knowledge or ability level. Similarly, a multiple-choice item with five alternatives could be answered correctly by chance 20% of the time. Therefore, an item difficulty greater than .20 would be necessary in order to discriminate between respondents' ability to guess correctly and respondents' level of knowledge. Desirable difficulty levels usually can be estimated as halfway between 100 percent and the percentage of success expected by guessing. So, the desirable difficulty level for a true-false item, for example, should be around p = .75, which is halfway between 100% and 50% correct.
In most instances, it is desirable for a test to contain items of various difficulty levels in order to distinguish between students who are not prepared at all, students who are fairly prepared, and students who are well prepared. In other words, educators do not want the same level of success for those students who did not study as for those who studied a fair amount, or for those who studied a fair amount and those who studied exceptionally hard. Therefore, it is necessary for a test to be composed of items of varying levels of difficulty. As a general rule for norm-referenced tests, items in the difficulty range of .30 to .70 yield important differences between individuals' level of knowledge, ability, and preparedness. There are a few exceptions to this, however, with regard to the purpose of the test and the characteristics of the test takers. For instance, if the test is to help determine entrance into graduate school, the items should be more difficult to be able to make finer distinctions between test takers. For a criterion-referenced test, most of the item difficulties should be clustered around the criterion cut-off score or higher. For example, if a passing score is 70%, the vast majority of items should have percentage passing values of
p = .60 or higher, with a number of items in the p > .90 range to enhance motivation and test for mastery of certain essential concepts.
According to Wilson (2005), item difficulty is the most essential component of item analysis. However, it is not the only way to evaluate test items. Discrimination goes beyond determining the proportion of people who answer correctly and looks more specifically at who answers correctly. In other words, item discrimination determines whether those who did well on the entire test did well on a particular item. An item should in fact be able to discriminate between upper and lower scoring groups. Membership in these groups is usually determined based on their total test score, and it is expected that those scoring higher on the overall test will also be more likely to endorse the correct response on a particular item. Sometimes an item will discriminate negatively, that is, a larger proportion of the lower group select the correct response, as compared to those in the higher scoring group. Such an item should be revised or discarded.
One way to determine an item's power to discriminate is to compare those who have done very well with those who have done very poorly, known as the extreme group method. First, identify the students who scored in the top one-third as well as those in the bottom one-third of the class. Next, calculate the proportion of each group that answered a particular test item correctly (i.e., percentage passing for the high and low groups on each item). Finally, subtract the p of the bottom performing group from the p for the top performing group to yield an item discrimination index (D). Item discriminations of D = .50 or higher are considered excellent. D = 0 means the item has no discrimination ability, while D = 1.00 means the item has perfect discrimination ability.
In Figure 1, it can be seen that Item 1 discriminates well with those in the top performing group obtaining the correct response far more often (p = .92) than those in the
low performing group (p = .40), thus resulting in an index of .52 (i.e., .92 - .40 = .52). Next, Item 2 is not difficult enough with a discriminability index of only .04, meaning this particular item was not useful in discriminating between the high and low scoring individuals. Finally, Item 3 is in need of revision or discarding as it discriminates negatively, meaning low performing group members actually obtained the correct keyed answer more often than high performing group members.
Another way to determine the discriminability of an item is to determine the correlation coefficient between performance on an item and performance on a test, or the tendency of students selecting the correct answer to have high overall scores. This coefficient is reported as the item discrimination coefficient, or the point-biserial correlation between item score (usually scored right or wrong) and total test score. This coefficient should be positive, indicating that students answering correctly tend to have higher overall scores or that students answering incorrectly tend to have lower overall scores. Also, the higher the magnitude, the better the item discriminates. The point-biserial correlation can be computed with procedures outlined in Figure 2.
In Figure 2, the point-biserial correlation between item score and total score is evaluated similarly to the extreme group discrimination index. If the resulting value is negative or low, the item should be revised or discarded. The closer the value is to 1.0, the stronger the item's discrimination power; the closer the value is to 0,
the weaker the power. Items that are very easy and answered correctly by the majority of respondents will have poor point-biserial correlations.
A third parameter used to conduct item analysis is known as the item characteristic curve (ICC). This is a graphical or pictorial depiction of the characteristics of a particular item, or taken collectively, can be representative of the entire test. In the item characteristic curve the total test score is represented on the horizontal axis and the proportion of test takers passing the item within that range of test scores is scaled along the vertical axis.
For Figure 3, three separate item characteristic curves are shown. Line A is considered a flat curve and indicates that test takers at all score levels were equally likely to get the item correct. This item was therefore not a useful discriminating item. Line B demonstrates a troublesome item as it gradually rises and then drops for those scoring highest on the overall test. Though this is unusual, it can sometimes result from those who studied most having ruled out the answer that was keyed as correct. Finally, Line C shows the item characteristic curve for a good test item. The gradual and consistent positive slope shows that the proportion of people passing the item gradually increases as test scores increase. Though it is not depicted here, if an ICC was seen in the shape of a backward S, negative item discrimination would be evident, meaning that those who scored lowest were most likely to endorse a correct response on the item.
See also:Item Response Theory
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice Hall.
Brown, F. (1983). Principles of education and psychological testing (3rd ed.). New York: Holt, Rinehart, & Winston.
DeVellis, R. (2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks, CA: Sage.
Grunlund, N. (1993). How to make achievement tests and assessments (5th ed.). Boston: Allyn and Bacon.
Kaplan, R., & Saccuzzo, D. (2004). Psychological testing: Principles, applications, and issues (6th ed.) Pacific Grove, CA: Brooks/Cole.
Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical Assessment, Research & Evaluation, 4(10), retrieved April 1, 2008, from http://pareonline.net/getvn.asp?v=4&n=10.
Patten, M. (2001). Questionnaire research: A practical guide (2nd ed.). Los Angeles: Pyrczak.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum.