Reliability is the consistency of a measure. In educational testing, reliability refers to the confidence that the test score will be the same across repeated administrations of the test. There is a close relation between the construct of reliability and the construct of validity. Many sources discuss how a test can have reliability without validity and that a test cannot have validity without reliability. In the theoretical sense, these statements are true but not in any practical sense. A test is designed to be reliable and valid, consistent, and accurate. Practical conceptualizations of reliability cannot be discussed separately from examples with validity.
Reliability without validity would be similar to an archer consistently hitting the target in the same place but missing the bull's eye by a foot. The archer's aim is reliable because it is predictable but it is not accurate. The archer's aim never hits what it is expected to hit. In this analogy, validity without reliability would be the arrows hitting the target in a haphazard manner but close to the bull's eye and centering around the bull's eye. In this second example, it can be seen that the validity is evidence that the archer is aiming at the right place. However, it also demonstrates that, even though the reliability is low, there is still some degree of reliability. That is, at least the arrows are consistently hitting the target. In addition, if the arrows are centered around the bull's eye, the error of one aim leading too far to the right is balanced by another aim leading too far to the left. Looking at the unpainted backside of the target's canvas, someone would be able to identify where the bull's eye was by averaging the distance of all the shots from the bull's eye.
Reliability of a test is an important selling point for publishers of standardized test, especially in high-stakes testing. If an institute asserts that its instrument can identify children who qualify for a special education program, the users of that test would hope that it has a high reliability. Otherwise, some children requiring the special education may not be identified, whereas others who do not require the special education may be unnecessarily assigned to the special education program.
Even in situations perceived as low-stakes testing, such as classroom testing, reliability and validity are serious concerns. Classroom teachers are concerned that the tests they administer are truly reflective of their students' abilities. If a teacher administered a test that was reliable but not valid, it would not have much practical use. An example would be a teacher in a grade-school history class administering, as a midterm exam, a standardized test from a reputable publisher. If that exam was suggested by the test developer as the final exam, the results would most likely be reliable but not valid. The test results would be reliable because they would most likely reflect the students' rank order in class performance. However, the test would not be valid as most students would not be ready for half of the material being tested. If the grade-school history teacher administered as a midterm exam a standardized test recommended by the test developer as a midterm exam for the appropriate grade level, the test could be considered valid. However, if the students (for some strange reason) did not receive uniform instruction in grade-appropriate history, the test would most likely not be reliable.
From these examples, it is clear that it is easier to increase the validity of a reliable measure than to increase the reliability of an otherwise valid measure. The reliable archer could be trained, little by little, to move the aim in the direction of the bull's eye. However, the target could be moved over one foot, so that the bull's eye is at the spot on the target that the archer usually hits. The teacher could take the time necessary (half a school year) to teach the students what they need to know to pass the valid final exam. (This is similar to training the archer to shoot in the right direction.) Another solution for the classroom situation would be for the teacher to adapt the test items in the exam, so that it is more appropriate as a midterm exam, instead of as a final exam. (This is similar to moving the target so that the archer's aim hits the bull's eye.)
In the test publishing world, the reliability of a draft test instrument is often quickly established. However, after the test is used many times, its validity might be questioned. A test designed as a verbal reasoning test may rely heavily on the test-taker's knowledge of music, art, and history. Because the test is reliable, the publishers might redefine what construct it is measuring; in this case, it is a better measure of the students' knowledge of the humanities than of verbal reasoning. The publisher would recall all copies of the Verbal Reasoning test; then, with little change to the test, the publisher could offer it again as the Humanities Achievement test.
Test reliability is explained through the true score theory and the theory of reliability. True score theory states that the observed score on a test is the result of a true score plus some error in measurement. The theory of reliability compares the reliability of a test of human characteristics with the reliability of measuring instruments in the physical sciences.
True score is the exact measure of the test taker's true ability in the area being tested. With a perfect test, the observed score would be equal to the true score. However, there is no perfect test. As one example of where the error may occur, the wording of test items may not be detailed enough for some test takers yet be too detailed for others. The examples of test errors are innumerable. According to true score theory, no one can know what the reliability of the test is unless one knows how much random error exists in the test. One cannot know how much error exists in the test unless one knows what the true score is. As a theoretical concept, the true score cannot be known. Therefore, what the reliability is can never be known with certainty. However, one can still estimate what the reliability is through repeated measures. As the error is assumed to be random, it should be balanced out over many administrations of the same test. If the test-taker's ability measured by the test is unchanging, when the error inflates the observed score one time it can be expected to deflate the observed score to the same degree at another time.
According to the theory of reliability, when a test is administered to a group of individuals, the observed variance in the distribution of scores should be due only to the true variance in the ability levels of the test takers. The degree that the two variances match is the reliability evaluation.
Another principle of the theory of reliability is that the reliability of a test is the ratio of the true score over the observed score. This relates to the true score theory because if the true score equals the observed score, then the error term must equal zero and the ratio of the true score to the observed score will be a perfect 1. Any deviation from this perfect ratio is caused by the strength of the error term (whether a positive or negative value). Therefore, expressing reliability as the ratio of the true score to observed score is in agreement with true score theory.
Reliability can also be expressed as the ratio of the variance of the true score to the variance of the observed score. Still, as noted above, one cannot know what the true score is, so one cannot know what the variance of the true score is. One way to estimate the variance of the true score is to calculate the covariance of two observed scores from the same test with the same test takers with unchanging ability.
The covariance of two measures is the variance that the two measures share. It is the numerator in the calculation of the correlation of the two measures. The denominator of the correlation is the standard deviation of one measure times the standard deviation of the second. Because the measures are assumed to be the same, their standard deviations should be the same. Therefore, the product of the two standard deviations should equal the square of the standard deviation of the one measure. Statistically, this is the same as the variance of the observed score. The conclusion is that the estimation of reliability is the same as the correlation of two distributions of matching scores from the same test.
In summary, according to reliability theory, reliability is equal to the ratio of the variance of the true score to the variance of the observed score. Calculating the ratio of the estimated variance of the true score to the variance of the observed score is the same as calculating the correlation between two observed scores. Therefore, the correlation of two repeated measures of the same test is accepted as an appropriate estimate of the reliability of the test.
Inter-rater (or inter-observer) reliability is an important consideration in the social sciences because there are many conditions for which the best means of measurement is the report of trained observers. Some classes such as gymnastics can only be assessed through the ratings of expert judges. As another example, external observers may be brought into a classroom to assess a student's inappropriate behavior. The observations of only one observer can be challenged from so many points of view. A lone observer may have some personal expectancies of what is supposed to occur. The lone observer may get tired and bored, so that earlier observations are more precise than later ones. It is less likely that the reports of two or more observers would be challenged. Particularly, the acceptability of the reports of two or more observers increases when their observations are similar. The measure of the similarities of the observations coming from two or more sources is the inter-rater reliability.
One method to establish inter-rater reliability is to calculate the proportion of agreement between or among the observers. This is appropriate if the ratings or observations are in mutually exclusive categories. The two observers recording the behavior of the student with the inappropriate behavior would do well to have a common checklist of the likely behaviors. If they agree on the occurrence of 16 out of 20 behaviors, their inter-rater reliability would be 80 percent.
Another method is to calculate the correlation between the ratings of the two or more observers. This is possible if the ratings or observations are two or more sets of interval numbers. The gymnastic judges would have different ratings. Some may be consistently rating high while others consistently rating low. However, there should be some general agreement on the ranking of the different performers. The strength of this agreement would be reflected in the correlation of their ratings.
Inter-rater reliability is increased if the observers have appropriate training. The training should focus on what exactly is meant to be observed. The raters need to be given a clear description of the event to be observed. The classroom observers would need to know what is and is not appropriate behavior. The raters also need concrete examples of what constitutes an occurrence or what constitutes achievement at each criterion level. The gymnastic judges need to know the standards for each element of the gymnastic routine. Training is best when it includes much practice with feedback.
Test-retest reliability is appropriate for tests that measure a construct that is not likely to change. The construct that intelligence tests measure is not expected to change. Another well-known test with an expectedly unchanging construct is the Scholastic Aptitude Test (SAT). Although a test taker is allowed to take the SAT up to three times, the developers claim that the score on repeated administrations will not change. The construct that the SAT is measuring is the predicted adaptability to college. By the time students take the SAT, they are as prepared for college as they are going to be.
Test-retest reliability is described as the correlation between the distribution of scores on one administration and the distribution of scores on a subsequent administration. Test-retest reliability is also an important factor in some experimental designs in which the treatment group is administered a pretest and posttest with treatment in between and the control group only receives the pretest and the posttest. Any analysis of the difference noted in the results of the posttest (compared to the pretest) of the treatment group is confounded unless there is a strong reliability between the pretest and posttest of the control group.
Parallel-forms reliability evaluates the consistency of the results of two tests constructed in the same manner from the same content domain. For every item on the test, a similar item is developed with the same difficulty level. The items from each pair are then randomly assigned to one form of the test or the other. The resulting two tests are the same in content and difficulty but not expression. The reliability is described as the correlation of the two distributions of scores. This type of reliability is important in the development of standardized tests.
Split half reliability is similar to parallel forms except that the two forms are both incorporated into one test. After the test is administered, the scores are divided into the two forms and the correlation between the two distributions of scores is calculated. Like parallel forms it is important in the development of standardized tests. However, it could have classroom applications if the classroom teacher was willing to make the effort to develop a test with twice as many items as an ordinary test. In the classroom, split-half reliability could detect the effect of students' guessing on the test.
Inter-item reliability is another means of evaluating the reliability of one administration of one test. Most tests are made up of items that are related to one another because they are measuring similar concepts. Because the items are similar in design, there should be a measurable correlation between the items in any pair of items. The evaluation of inter-item reliability begins with predicting all correlations between all pairs of items. The inter-item reliability is expressed as the proportion of correct predictions. A classroom teacher might want to use inter-item reliability to identify the items that were not related to any other items or to identify the effects of students' guessing.
Cronbach's Alpha and the Kuder-Richardson methods are systems of reporting internal consistency of a test. The essential results of the internal consistency methods are comparable to the average of all correlations between all pairs of items. These methods can estimate reliability using the results of only one administration of the test. The main difference between the two approaches is how the items are scored. Cronbach's Alpha can be used on items with a range of responses such as a Likert scale. The Kuder-Richardson methods require that all items be scored dichot-omously right or wrong (Borg and Gall, 1983).
The general goal to increase reliability of a measure is to increase the variance while reducing the variance error. Three recommended procedures to accomplish this are: 1) decrease the ambiguity of the test items; 2) increase the number of items per objective; and 3) provide clear test-taking instructions (Kerlinger, 1986).
If an item is ambiguous, it can be interpreted in more than one way. Two test takers of equal ability could conceivably interpret an ambiguous item two different ways, one getting it right and the other getting it wrong. Their score would differ based on their interpretation of the item and not based on their differences in true ability.
Where there is error in a test item, it will have less effect if that item is one among many for the same objective than if that item is one among few. A test taker whose ability is mismeasured by a faulty item will need to balance the effect of that item with the effect of the items that are measuring more accurately.
Clear test-taking instructions help test takers to interpret the test items correctly and to indicate their chosen answers properly. Test instructions might remind the test takers of the types of items that require special attention. In addition, if there is a special procedure for answering such as using an answering sheet, test instructions can remind test takers how to respond correctly.
Borg, W. R., Gall, M. D. (1983). Educational research: An introduction (4th ed.) White Plains, NY: Longman.
Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). Fort Worth, TX: Holt, Rinehart and Winston.
Shaughnessy, J. J., Zechmeister, E. B., & Zechmeister, J. S. (2006). Research methods in psychology (7th ed.). New York: McGraw-Hill.
Rudner, L. M., & Schafer, W. D. (2001). Reliability: ERIC digest. College Park MD: ERIC Clearinghouse on Assessment and Evaluation. ERIC Identifier: ED458213.