A criterion-referenced test is a test that provides a basis for determining a candidate's level of knowledge and skills in relation to a well-defined domain of content. Often one or more performance standards are set on the test score scale to aid in test score interpretation. Criterion-referenced tests, a type of test introduced by Glaser (1962) and Popham and Husek (1969), are also known as domain-referenced tests, competency tests, basic skills tests, mastery tests, performance tests or assessments, authentic assessments, objective-referenced tests, standards-based tests, credentialing exams, and more. What all of these tests have in common is that they attempt to determine a candidate's level of performance in relation to a well-defined domain of content. This can be contrasted with norm-referenced tests, which determine a candidate's level of the construct measured by a test in relation to a well-defined reference group of candidates, referred to as the norm group. So it might be said that criterion-referenced tests permit a candidate's score to be interpreted in relation to a domain of content, and norm-referenced tests permit a candidate's score to be interpreted in relation to a group of examinees. The first interpretation is content-centered, and the second interpretation is examinee-centered.
Because these two types of tests have fundamentally different purposes, it is not surprising that they are constructed differently and evaluated differently. Criterion-referenced tests place a primary focus on the content and what is being measured. Norm-referenced tests are also concerned about what is being measured but the degree of concern is less since the domain of content is not the primary focus for score interpretation. In norm-referenced test development, item selection, beyond the requirement that items meet the content specifications, is driven by item statistics. Items are needed that are not too difficult or too easy, and that are highly discriminating. These are the types of items that contribute most to score spread, and enhance test score reliability and validity. With criterion-referenced test development, extensive efforts go into insuring content validity. Item statistics play less a role in item selection though highly discriminating items are still greatly valued, and sometimes item statistics are used to select items that maximize the discriminating power of a test at the performance standards of interest on the test score scale.
Some scholars have argued that there is little difference between norm-referenced tests and criterion-referenced tests, but this is not true. A good norm-referenced test is one that will result in a wide distribution of scores on the construct being measured by the test. Without score variability, reliable and valid comparisons of candidates cannot be made. A good criterion-referenced test will permit content-referenced interpretations and this means that the content domains to which scores are referenced must be very clearly defined. Each type of test can serve the other main purpose (norm-referenced versus criterion-referenced interpretations), but this secondary use will never be optimal. For example, since criterion-referenced tests are not constructed to maximize score variability, their use in comparing candidates may be far from optimal if the test scores that are produced from the test administration are relatively similar (see Hambleton & Zenisky, 2003).
Because the purpose of a criterion-referenced test is quite different from that of a norm-referenced test, it should not be surprising to find that the approaches used for reliability and validity assessment are different too. With criterion-referenced tests, scores are often used to sort candidates into performance categories. Consistency of scores over parallel administrations becomes less central than consistency of classifications of candidates to performance categories over parallel administrations. Variation in candidate scores is not so important if candidates are still assigned to the same performance category. Therefore, it has been common to define reliability for a criterion-referenced test as the extent to which performance classifications are consistent over parallel-form administrations. For example, it might be determined that 80% of the candidates are classified in the same way by parallel forms of a criterion-referenced test administered with little or no instruction in between test administrations. This is similar to parallel form reliability for a norm-referenced test except the focus with criterion-referenced tests is on the decisions rather than the scores. Because parallel form administrations of criterion-referenced tests are rarely practical, over the years methods have been developed to obtain single administration estimates of decision consistency (see, for example, Livingston & Lewis, 1995) that are analogous to the use of the corrected split-half reliability estimates with norm-referenced tests.
With criterion-referenced tests, the focus of validity investigations is on (1) the match between the content of the test items and the knowledge or skills that they are intended to measure, and (2) the match between the collection of test items and what they measure and the domain of content that the tests are expected to measure. The “alignment” of the content of the test to the domain of content that is to be assessed is called content validity evidence. This term is well known in testing practices.
Many criterion-referenced tests are constructed to assess higher-level thinking and writing skills, such as problem solving and critical reasoning. Demonstrating that the tasks in a test are actually assessing the intended higher-level skills is important, and this involves judgments and the collection of empirical evidence. So, construct validity evidence too becomes crucial in the process of evaluating a criterion-referenced test.
Probably the most difficult and controversial part of criterion-referenced testing is setting the performance standards, i.e., determining the points on the score scale for separating candidates into performance categories such as “passers” and “failers.” The challenges are great because with criterion-referenced tests in education, it is common on state and national assessments to separate candidates into not just two performance categories, but more commonly, three, four, or even five performance categories. With four performance categories, these categories are often called failing, basic, proficient, and advanced.
What makes the setting of performance standards on criterion-referenced tests controversial is that the process itself is highly judgmental, and the implications are far-reaching. Candidates who fail the test may be denied a high school diploma or a license to practice in the profession they trained for. Teachers and administrators can lose their jobs if student test performance does not meet the performance standards. Perceptions of the quality of education in a state can be affected by large percentages of students being assigned to the failing or basic performance categories. With international assessments such as Trends in Mathematics and Science Study (TIMSS), the educational reputations of countries are based on criterion-referenced test performance.
The process of setting performance standards proceeds through many steps (see Cizek, 2001; Hambleton & Pitoniak, 2006). First, it is common to set a policy about the composition of the panel that will set the performance standards. Here, decisions about the demographic make-up of the panel, such as gender, ethnicity, years of experience, geographical distribution, role (e.g., teachers, administrators, curriculum specialists, parents), are usually considered, as well as other factors. Then a plan is put in place to draw a representative panel to meet the specifications.
Another big decision concerns the choice of standard-setting method. There are probably 10 to 20 major methods, and large numbers of variations of each. The methods include Angoff, Ebel, Nedelsky, contrasting groups, borderline groups, direct consensus, item cluster, booklet selection, extended Angoff, bookmark, and more.
Prior to the meeting of the panel to set the performance standards it is common for a different panel to prepare performance category descriptions. These descriptions lay out for the standard-setting panel what it means to be a failing student, a basic student, and so on. The descriptions provide a basis for the standard-setting panel to carry out its work of determining just how well candidates must perform on the test to demonstrate basic, proficient, and advanced level performance. The descriptions are also helpful in communicating what the expectations are for students in the performance categories, and at the time of score reporting.
Next, the panel is brought together and the chosen method is applied to produce performance standards. A typical panel meeting often begins with discussion of the purpose of the test and exposure to the performance category descriptions. Having the panelists take a portion or even the entire test is another activity that is included as part of the training. Then the method is introduced, and practice is given prior to the panel starting on its task of setting the standards.
The meeting continues, and often two to three days are needed for the panelists to work through the method and related discussions until a final recommended set of performance standards is produced. Validity evidence is compiled about the process and the panelists' impressions of it, a technical manual is often written, and then all of the information is forwarded to a board for setting the final performance standards for the criterion-referenced test. If multiple tests are involved (e.g., mathematics, reading, and science tests at several grade levels), the task of making the complete set of performance standards across subjects and grades consistent or coherent is especially challenging.
Criterion-referenced tests are used in many ways. Classroom teachers use them to monitor student performance in their day-to-day activities. States find them useful for evaluating student performance and generating educational accountability information at the classroom, school, district, and state levels. The tests are based on the curricula, and the results provide a basis for determining how much is being learned by students and how well the educational system is producing desired results. Criterion-referenced tests are also used in training programs to assess learning. Typically pretest-posttest designs with parallel forms of criterion-referenced tests are used. Finally, criterion-referenced tests are used in the credentialing field to determine persons qualified to receive a license or certificate. There are hundreds of credentialing agencies in the United States that are using criterion-referenced tests to make pass-fail credentialing decisions.
See also:Classroom Assessment
Cizek, G. (Ed.). (2001). Setting performance standards: Concepts, methods, and perspectives.
Mahwah, NJ: Erlbaum. Glaser, R. (1963). Instructional technology and the measurement of learning outcomes. American Psychologist, 18, 519–521.
Hambleton, R. K., & Pitoniak, M. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (pp. 433–470). Westport, CT: American Council on Education.
Hambleton, R. K., & Zenisky, A. (2003). Issues and practices of performance assessment. In C. Reynolds and R. Kampaus (Eds.), Handbook of psychological and educational assessment of children (pp. 377–404). New York: Guilford.
Livingston, S., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179–180.
Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6, 1–9.