Classroom assessment is the process, usually conducted by teachers, of designing, collecting, interpreting, and applying information about student learning and attainment to make educational decisions. There are four interrelated steps to the classroom assessment process. The first step is to define the purposes for the information. During this period, the teacher considers how the information will be used and how the assessment fits in the students' educational program. The teacher must consider if the primary purpose of the assessment is diagnostic, formative, or sum-mative. Gathering information to detect student learning impediments, difficulties, or prerequisite skills are examples of diagnostic assessment. Information collected on a frequent basis to provide student feedback and guide either student learning or instruction are formative purposes for assessment, and collecting information to gauge student attainment at some point in time, such as at the end of the school year or grading period, is summative assessment.
The next step in the assessment process is to measure student learning or attainment. Measurement involves using tests, surveys, observation, or interviews to produce either numeric or verbal descriptions of the degree to which a student has achieved academic goals. The third step is to evaluate the measurement data, which entails making judgments about the information. During this stage, the teacher interprets the measurement data to determine if students have certain strengths or limitations or whether the student has sufficiently attained the learning goals. In the last stage, the teacher applies the interpretations to fulfill the aims of assessment that were defined in first stage. The teacher uses the data to guide instruction, render grades, or help students with any particular learning deficiencies or barriers.
Hundreds of books and articles on classroom assessment have been written, but most, if not all, ascribe to an assessment framework articulated in the 1930s and 1940s by Ralph Tyler (1949), who believed that assessment was an integral componentof curriculum and instruction planning. Tyler developed a multistep modelofcurricular and instructional design that began with consideration of what the educator expected the student to be able to know and do after teaching had occurred. He termed these end results of education, “instructional objectives,” which he stated should be crafted by considering both the mental skill, such as “applies” or “creates,” and the subject matter content the student will develop. Good planning, according to Tyler, involved developing a table that specifies the body of objectives students will develop during the course of a school year, semester, or lesson.
After the instructional objectives are formulated, educational experiences can be developed that encompass the teaching materials and instructional opportunities that will be provided to students. Also during this planning stage, teachers must consider how they will determine if students have attained the instructional objectives. Indeed, good objectives are those that clearly define the type of activity the students will accomplish to indicate the degree to which the students have attained the objective. After students experience the learning opportunities provided by the teacher and after assessment has occurred, the teacher's task is to examine the assessment results and decide whether students have sufficiently reached the objectives. If they have not, the teacher can revise the educational experiences until attainment has occurred. Thus, Tyler's model of testing emphasized the formative role of classroom assessment.
Tyler did not organize the mental skills that make up objectives in any meaningful way. Benjamin Bloom, who earlier was a graduate student of Tyler at the University of Chicago, orchestrated a committee during the 1950s to develop a Taxonomy of Educational Objectives (Bloom et al., 1956). The committee organized mental, or intellectual, skills in a hierarchical fashion from the most basic levels, knowledge, and comprehension, to the most advanced levels, applications, analysis, synthesis, and evaluation. The Taxonomy has been widely used to organize the types of objectives students of all ages are expected to attain in schools worldwide.
Selected- and Constructed-response Formats. Teachers have an array of item formats upon which to measure student attainment of objectives (see Linn & Miller, 2005; Oosterhof, 2003). Assessment items can be classified into two categories: selected- and constructed-response formats. It is the student's duty in selected-response items to choose one or a few correct options among multiple alternatives. Examples of selected-response item formats include multiple-choice, ranking of options, interpretive exercises, matching, true-false, alternate-choice, embedded alternate-choice, sequential true-false, and checklists. In constructed-response items, students must supply an answer to a question prompt. Short answer and essay items are common constructed-response items. Essay items can require students to write either extended or restricted responses. Responses can be restricted by limiting the amount of space available to supply the answer, dictating the number of acceptable answers (“state three reasons …”), or by qualifying in the prompt the expected response length (“briefly describe …”). Restricted-response essays are useful for measuring student attainment of factual knowledge and basic comprehension. Extended-response essays are more appropriate if the goal is to measure students' skills at analyzing, synthesizing, constructing, or evaluating information because they offer students greater latitude in how to organize and present their thoughts.
Performance assessments are another type of constructed-response item. With this format, students are expected to perform an activity or set of activities. They can be asked to perform a process, such as delivering a public speech, or produce a product, such as a science notebook or work of art. Many performance assessments, but not all, attempt to represent real-life contexts or applications and are therefore considered authentic assessments. Because students perform activities during these assessment tasks, performance assessments can be integrated well with regular instructional activities.
Scoring. Constructed-response items must be scored by a judge, using either a norm- or criterion-referenced scoring procedure. In norm referencing, the teacher compares the quality of a student's response to a reference group, which might include the other students currently in the class or to prior students the teacher has taught. The teacher then assigns a score to the student's response based on how the response ranks or where it falls in the distribution of responses in the reference group. Criterion-reference scoring involves basing a student's score on the degree to which the student has demonstrated the attainment of specified knowledge or skills. Academic standards stipulate what students should know and be able to do, and performance standards specify the degree to which they have mastered the academic expectations.
The criteria or expectations often are defined in a scoring rubric, which provide descriptions of responses on a scale. Teachers can use either holistic or analytic scoring rubrics to render criterion-referenced scores. An analytic rubric allows the teacher to score the constructed response on separate and multiple dimensions, such as organization, accuracy, and voice. For holistic scoring, the teacher produces one overall score. A holistic rubric could be based on multiple dimensions, but the teacher considers all of the dimensions simultaneously to yield the score. Analytic rubrics are more useful if the goal is to provide more extensive and deeper feedback to the student, because the student gets separate scoreson multiple dimensions. Holistic scoring takes less time, typically, because only one score per response is made. It works, however, only when there is a high relationship among the dimensions for the responses. For example, if students who are high on organization also tend to be high on accuracy and voice, then holistic scoring can work effectively. If the dimensions are not correlated well (e.g., responses can be high on voice but low on accuracy), analytic scoring is more suitable.
Advantages and Limitations of Test Formats. There are advantages and limitations with each item format, and teachers should choose the format that best suits the purposes for assessment. If teachers have less time to score the assessments, selected-response questions are advantageous because they can be scored faster than constructed-response items. Selected-response items also are superior to constructed-response items if the goal is to measure basic levels of Bloom's Taxonomy, such as knowledge or comprehension. Students can respond more quickly to selected-response items, allowing the teacher to assess a broader range of objectives across a given timeframe. Selected-response items also are considered more objective than constructed-response questions because the latter items require teachers to score the responses, introducing rater error to the scores. Because reliability is increased by having more items with less error, selected-response items tend to yield more consistent scores relative to constructed-response items.
But given that selected-response items present both correct and incorrect options to students, those items are more prone to guessing than constructed-response items. The probability that students can guess correctly depends on the number of distracters for each question, the test-taking skills of the student, and the quality of the distracters. Constructed-response items also take less time to create, so if teachers have little time to construct an exam, they should consider including more of those items on the test. Crafting reasonable and high-quality distracters and selected-response items that are not prone to guessing is an arduous and time-consuming process. Also, because students must supply an answer for constructed-response items, the format is more suited for measuring more advanced levels of Bloom's Taxonomy in a direct manner. For example, if students are to demonstrate their evaluation skills or show that they can apply their knowledge in a novel situation, teachers must rely on constructed-response questions. Students would only be able to demonstrate that they can identify a proper application or accurate evaluation with selected-response items. Constructed-response items test the recall of information and actual demonstration of advanced skills, whereas selected-response items focus on mental recognition and serve, at best, as indirect indicators of advanced intellectual skills.
Report Card Grades. Teachers typically must assign grades indicating student performance based on assessment information. Often the types of grades to be assigned on report cards are determined by the district office. Many districts rely on letter grades, which require the teacher to report student performance in ordinal categories (e.g., A-E), while other districts use percentage grades (0–100), pass-fail marks, checklists, or narratives. It is not uncommon for report cardsto consist of multiple grading methods (Guskey & Bailey, 2000). A relatively new form of grading is standards-based reporting. With this method, teachers report student performance on state or district academic standards using performance levels such as “Falls Below Expectations”, “Approaches Expectations,” “Meets Expectations,” and “Exceeds Expectations.” Many districts have moved to this newer method to encourage teachers to focus on academic standards and toprovide students and parents with an alternative report onstudents' performance on the standards besides state achievement tests.
Though districts often determine the grading method, teachers usually have considerable freedom in deciding on how they will transform student performance into grades. Teachers can employ either norm-referenced or criterion-referenced scoring procedures. Norm-referenced methods first require teachers to rank students from the highest to lowest performers. Curving is perhaps the most conventional normative method. After ranking students, teachers set thresholds between performance levels based on percentages that roughly follow the normal (i.e., bell-shaped) distribution. For example, the top 10 to 15 percent of students would be assigned A's, the next 20–30 percent of students would be assigned B's and so on. Teachers can modify curving by changing the proportions of students who receive various grades. Percentage scores can be administered based on norm referencing, that is, by assigning students percentage scores based on their percentile standing in the class distribution.
Many teachers have moved away from norm-reference grading because it encourages competition for a limited number of desirable grades and because it provides limited information regarding what students actually have learned. Most grading in classrooms in the early 2000s is based on criterion-reference scoring. The point system probably is the most prevalent grading procedure used by teachers. This method involves assigning maximum possible points for each assignment or exam that comprises the final grade, allocating points for each of the assignments or exams for students based on their performance, and then tallying the total number of earned points for each student. If letter grades are used for reporting, teachers can assign A's to those students with 90 percent or greater of earned points, B's to students who earned between 80 and 90 percent of the points, and so on. Percentage grades also can be assigned by reporting the percent of total earned points per student. Other criterion-referenced methods can be used by teachers to produce standards-based grades (Ainsworth & Viegut, 2006).
Teacher classroom assessment commonly is compared to external achievement tests to articulate its strengths and weaknesses. Such comparisons, however, are misguided because the two types of assessment serve quite different purposes. Being standardized assessments, external achievement tests serve to compare the achievement levels of students across many schools, districts, states, or countries at discrete points in time (usually fall and spring) on broad knowledge and skills. Although these tests can serve a formative role, they typically are used for summative purposes. Classroom assessments usually are developed to reflect if students developed the knowledge and skills taught in a given classroom and, thus, are more focused on the specific curriculum and instruction delivered by the teacher. Assessment in the classroom also is an ongoing and continuous process, so its strength is the provision of formative information about student learning and teacher instruction.
Items on external achievement tests are subjected to extensive development and review processes. Items are carefully examined for content accuracy and lack of bias and other test flaws. Often external test items are field tested and statistically analyzed before they can be used operationally on test forms. Developers of external tests also expend considerable effort to systematize the scoring of constructed response items. Often they employ multiple judges who have received extensive training to calibrate their stringency levels and increase their reliability.
Teachers commonly develop their own items or use items provided in teachers' manuals that accompany textbooks. Items found on most classroom assessments, thus, have not been constructed with the same level of quality control compared to external tests. Further, teachers usually score constructed responses by themselves without applying preliminary procedures to reduce scorer error. Not only are external achievement tests developed with more deliberation, they typically contain more items than classroom final exams, quizzes, or graded assignments. For these reasons, scores from classroom assessments tend to be much less reliable than scores from external tests.
Besides yielding less reliable scores relative to external tests, scores across teachers often are not comparable. Teachers in the same school and teaching the same grade can administer tests that differ considerably in terms of item difficulty, cognitive demand, and scoring methods. Thus, students with the same levels of achievement can earn different grades in different classrooms. This situation would be unlikely if the students took the same standardized achievement test. Indeed, the lack of comparability across high school grades led to the development of standardized college admissions tests.
There are advantages, however, of assessments that are unique to classroom curriculum, instruction, and teacher expectations. Because most teacher tests are tailored to what students learned in the classroom, they usually provide teachers with richer information about student learning within the context of students' classroom experiences. This more targeted information can be used more effectively by the teacher to modify instruction to actual student needs. Teacher tests, therefore, likely produce more valid scores of the degree to which students attained the instructional objectives generated by the teacher.
Frequency of Testing. Though external tests contain more items than classroom assessments, the teacher has the opportunity to administer more items representing a far greater array of item formats during the school year. External tests typically are administered once or twice at most in a given year, and they usually contain one to three item formats. If teachers assess frequently and use an array of formats, they can collect a body of student information that has four major advantages.
First, frequent assessment allows the teacher to track student growth and to detect areas in need of more or different instruction. Second, assessing often yields learning information that teachers can use to give constructive feedback to students. Timely feedback focused on what students have mastered and where they need to improve has been linked to greater learning gains (Black & William, 1998). Third, if teachers base final grades on frequent small assessments containing items representing various formats, the final grades likely would be as or more reliable than external achievement tests, given teachers' opportunity to gather more information regarding student attainment than a single test administration. Finally, besides increasing reliability, this assessment approach yields more valid scores. As Campbell and Fiske (1959) noted, validity is delimited by relying on a sole item format or test method because scores are influenced to some degree by those factors. By using various methods, including different item formats and paper-pencil as well as oral testing and observation, teachers can generate information about each student's learning that transcends the method type.
Ultimately, it is the prerogative of the teacher to maximize the strengths and limit the weaknesses of classroom assessments. Teachers must make a concerted effort to integrate testing into their teaching plans and practices. Research indicates that teachers who prioritize assessment and use test results to improve their instruction tend to be more effective instructors (Black & William, 1998). Unfortunately teachers vary greatly in the degree to which they value assessment. Some teachers are opposed to testing, while others assess in a haphazard manner. Still others use the same, favorite item format for all assessments, consequently limiting the validity of students' scores. By contrast, those teachers who systematize assessment and rely on it to guide their practice likely produce the highest quality of information available on student learning.
Ainsworth, L., & Viegut, D. (2006). Common formative assessments: How to connect standards-based instruction and assessment. Thousand Oaks, CA: Corwin Press.
Black, P., & William, D. (1998). Assessment and classroom learning. Assessment in education: Principles, policy, and practice, 5(1), 1–34.
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives, Book 1: Cognitive domain. New York: Longman.
Campbell, D., & Fiske, D. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.
Guskey, T. R., & Bailey, J. M. (2000). Developing grading and reporting systems for student learning. Thousand Oaks, CA: Corwin Press.
Linn, R. L., & Miller, M. D. (2005). Measurement and assessment in teaching. (9th ed.). Upper Saddle River, NJ: Merrill/Prentice Hall.
Oosterhof, A. (2003). Developing and using classroom assessments. Upper Saddle River, NJ: Merrill/Prentice Hall.
Tyler, R. W. (1949). Basic principles of curriculum and instruction. Chicago: University of Chicago Press.