There are many ways of reporting test performance. A variety of scores can be used when interpreting students' test performance.

Raw Scores

The raw score is the number of items a student answers correctly without adjustment for guessing. For example, if there are 15 problems on an arithmetic test, and a student answers 11 correctly, then the raw score is 11. Raw scores, however, do not provide us with enough information to describe student performance.

Percentage Scores

A percentage score is the percent of test items answered correctly. These scores can be useful when describing a student's performance on a teacher-made test or on a criterion-referenced test. However, percentage scores have a major disadvantage: We have no way of comparing the percentage correct on one test with the percentage correct on another test. Suppose a child earned a score of 85 percent correct on one test and 55 percent correct on another test. The interpretation of the score is related to the difficulty level of the test items on each test. Because each test has a different or unique level of difficulty, we have no common way to interpret these scores; there is no frame of reference.

To interpret raw scores and percentage-correct scores, it is necessary to change the raw or percentage score to a different type of score in order to make comparisons. Evaluators rarely use raw scores and percentage-correct scores when interpreting performance because it is difficult to compare one student's scores on several tests or the performance of several students on several tests.

Derived Scores

Derived scores are a family of scores that allow us to make comparisons between test scores. Raw scores are transformed to derived scores. Developmental scores and scores of relative standing are two types of derived scores. Scores of relative standing include percentiles, standard scores, and stanines.

Developmental Scores

Sometimes called age and grade equivalents, developmental scores are scores that have been transformed from raw scores and reflect the average performance at age and grade levels. Thus, the student's raw score (number of items correct) is the same as the average raw score for students of a specific age or grade. Age equivalents are written with a hyphen between years and months (e.g., 12–4 means that the age equivalent is 12 years, 4 months old). A decimal point is used between the grade and month in grade equivalents (e.g., 1.2 is the first grade, second month).

Developmental scores can be useful (McLean, Bailey, & Wolery, 1996; Sattler, 2001). Parents and professionals easily interpret them and place the performance of students within a context. Because of the ease of misinterpretation of these scores, parents and professionals should approach them with extreme caution. There are a number of reasons for criticizing these scores.

For a student who is 6 years old and in the first grade, grade and age equivalents presume that for each month of first grade an equal amount of learning occurs. But, from our knowledge of child growth and development and theories about learning, we know that neither growth nor learning occurs in equal monthly intervals. Age and grade equivalents do not take into consideration the variation in individual growth and learning.

Teachers should not expect that students will gain a grade equivalent or age equivalent of one year for each year that they are in school. For example, suppose a child earned a grade equivalent of 1.5, first grade, fifth month, at the end of first grade. To assume that at the end of second grade the child should obtain a grade equivalent of 2.5, second grade, fifth month, is not good practice. This assumption is incorrect for two reasons: (1) The grade and age equivalent norms should not be confused with performance standards, and (2) a gain of 1.0 grade equivalent is representative only of students who are in the average range for their grade. Students who are above average will gain more than 1.0 grade equivalent a year, and students who are below average will progress less than 1.0 grade equivalent a year (Gronlund & Linn, 1990).

A second criticism of developmental scores is the underlying idea that because two students obtain the same score on a test they are comparable and will display the same thinking, behavior, and skill patterns. For example, a student who is in second grade earned a grade equivalent score of 4.6 on a test of reading achievement. This does not mean that the second grader understands the reading process as it is taught in the fourth grade. Rather, this student just performed at a superior level for a student who is in second grade. It is incorrect to compare the second grader to a child who is in fourth grade; the comparison should be made to other students who are in second grade (Sattler, 2001).

A third criticism of developmental scores is that age and grade equivalents encourage the use of false standards. A second-grade teacher should not expect all students in the class to perform at the second-grade level on a reading test. Differences between students within a grade mean that the range of achievement actually spans several grades. In addition, developmental scores are calculated so that half of the scores fall below the median and half fall above the median. Age and grade equivalents are not standards of performance.

A fourth criticism of age and grade equivalents is that they promote typological thinking. The use of age and grade equivalents causes us to think in terms of a typical kindergartener or a typical 10-year-old. In reality, students vary in their abilities and levels of performance. Developmental scores do not take these variations into account.

A fifth criticism is that most developmental scores are interpolated and extrapolated. A normed test includes students of specific ages and grades—not all ages and grades—in the norming sample. Interpolation is the process of estimating the scores of students within the ages and grades of the norming sample. Extrapolation is the process of estimating the performance of students outside the ages and grades of the normative sample.

Developmental Quotient

A developmental quotient an estimate of the rate of development. If we know a student's developmental age and chronological age, it is possible to calculate a developmental quotient. For example, suppose a student's developmental age is 12 years (12 years 12 months in a year = 144 months) and the chronological age is also 12 years, or 144 months. Using the following formula, we arrive at a developmental quotient of 100.

Developmental age 144 months / Chronological age 144 months X 100 = 100

144/144 X 100 = 100

1/1 X 100 = 100

But, suppose another student's chronological age is also 144 months and that the developmental age is 108 months. Using the formula, this student would have a developmental quotient of 75.

Developmental age 108 months/ Chronological age X 100 = 75

108/144 X 100 = 75

Developmental quotients have all of the drawbacks associated with age and grade equivalents. In addition, they may be misleading because developmental age may not keep pace with chronological age as the individual gets older. Consequently, the gap between developmental age and chronological age becomes larger as the student gets older.

Scores of Relative Standing

Percentile Ranks  A percentile rank is the point in a distribution at or below which the scores of a given percentage of students fall. Percentiles provide information about the relative standing of students when compared with the standardization sample. Look at the following test scores and their corresponding percentile ranks.

Student Score Percentile Rank
Delia 96 84
Jana 93 81
Pete 90 79
Marcus 86 75

Jana's score of 93 has a percentile rank of 81. This means that 81 percent of the students who took the test scored 93 or lower. Said another way, Jana scored as well as or better than 81 percent of the students who took the test.

A percentile rank of 50 represents average performance. In a normal distribution, both the mean and the median fall at the 50th percentile. Half the students fall above the 50th percentile and half fall below. Percentiles can be divided into quartiles. A quartile contains 25 percentiles or 25 percent of the scores in a distribution. The 25th and the 75th percentiles are the first and the third quartiles. In addition, percentiles can be divided into groups of 10 known as deciles. A decile contains 10 percentiles. Beginning at the bottom of a group of students, the first 10 percent are known as the first decile, the second 10 percent are known as the second decile, and so on.

The position of percentiles in a normal curve is shown in Figure 4.5. Despite their ease of interpretation, percentiles have several problems. First, the intervals they represent are unequal, especially at the lower and upper ends of the distribution. A difference of a few percentile points at the extreme ends of the distribution is more serious than a difference of a few points in the middle of the distribution. Second, percentiles do not apply to mathematical calculations (Gronlund & Linn, 1990). Last, percentile scores are reported in one-hundredths. But, because of errors associated with measurement, they are only accurate to the nearest 0.06 (six one-hundredths) (Rudner, Conoley, & Plake, 1989). These limitations require the use of caution when interpreting percentile ranks. Confidence intervals, which are discussed later in this chapter, are useful when interpreting percentile scores.

Standard Scores   Another type of derived score is a standard score. Standard score is the name given to a group or category of scores. Each specific type of standard score within this group has the same mean and the same standard deviation. Because each type of standard score has the same mean and the same standard deviation, standard scores are an excellent way of representing a child's performance. Standard scores allow us to compare a child's performance on several tests and to compare one child's performance to the performance of other students. Unlike percentile scores, standard scores function in mathematical operations. For instance, standard scores can be averaged. In the Snapshot, teachers Lincoln Bates and Sari Andrews discuss test scores. As is apparent, standard scores are equal interval scores. The different types of standard scores, some of which we discuss in the following subsections, are:

  1. z-scores: have a mean of 0 and a standard deviation of 1.
  2. T-scores: have a mean of 50 and a standard deviation of 10.
  3. Deviation IQ scores: have a mean of 100 and a standard deviation of 15 or 16.
  4. Normal curve equivalents: have a mean of 50 and a standard deviation of 21.06.
  5. Stanines: standard score bands divide a distribution of scores into nine parts.
  6. Percentile ranks: point in a distribution at or below which the scores of a given percentage of students fall.

Deviation IQ Scores Deviation   Deviation IQ scores are frequently used to report the performance of students on norm-referenced standardized tests. The deviation scores of the Wechsler Intelligence Scale for Children–III and the Wechsler Individual Achievement Test–II have a mean of 100 and a standard deviation of 15, while the Stanford-Binet Intelligence Scale–IV has a mean of 100 and a standard deviation of 16. Many test manuals provide tables that allow conversion of raw scores to deviation IQ scores.

Normal Curve Equivalents  Normal curve equivalents (NCEs) a type of standard score with a mean of 50 and a standard deviation of 21.06. When the baseline of the normal curve is divided into 99 equal units, the percentile ranks of 1, 50, and 99 are the same as NCE units (Lyman, 1986). One test that does report NCEs is the Developmental Inventory-2.However, NCEs are not reported for some tests.

Stanines  Stanines are bands of standard scores that have a mean of 5 and a standard deviation of 2. Stanines range from 1 to 9. Despite their relative ease of interpretation, stanines have several disadvantages. A change in just a few raw score points can move a student from one stanine to another. Also, because stanines are a general way of interpreting test performance, caution is necessary when making classification and placement decisions. As an aid in interpreting stanines, evaluators can assign descriptors to each of the 9 values:

9—very superior

8—superior

7—very good

6—good

5—average

4—below average

3—considerably below average

2—poor

1—very poor

Basal and Ceiling Levels

Many tests, because test authors construct them for students of differing abilities, contain more items than are necessary. To determine the starting and stopping points for administering a test, test authors designate basal and ceiling levels. (Although these are really not types of scores, basal and ceiling levels are sometimes called rules or scores.) The basal level is the point below which the examiner assumes that the student could obtain all correct responses and, therefore, it is the point at which the examiner begins testing.

The test manual will designate the point at which testing should begin. For example, a test manual states, "Students who are 13 years old should begin with item 12. Continue testing when three items in a row have been answered correctly. If three items in a row are not answered correctly, the examiner should drop back a level." This is the basal level.

Let's look at the example of the student who is 9 years old. Although the examiner begins testing at the 9-year-old level, the student fails to answer correctly three in a row. Thus, the examiner is unable to establish a basal level at the suggested beginning point. Many manuals instruct the examiner to continue testing backward, dropping back one item at a time, until the student correctly answers three items. Some test manuals instruct examiners to drop back an entire level, for instance, to age 8, and begin testing. When computing the student's raw score, the examiner includes items below the basal point as items answered correctly. Thus, the raw score includes all the items the student answered correctly plus the test items below the basal point. The ceiling level is the point above which the examiner assumes that the student would obtain all incorrect responses if the testing were to continue; it is, therefore, the point at which the examiner stops testing. "To determine a ceiling," a manual may read, "discontinue testing when three items in a row have been missed."

A false ceiling can be reached if the examiner does not carefully follow directions for determining the ceiling level. Some tests require students to complete a page of test items to establish the ceiling level.