Subjective test items are more commonly called constructed response (CR) items. They require examinees to create their own responses, rather than selecting a response from a list of options (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999). No single wording (or set of actions) can be regarded as the only correct response, and a response may earn full or partial credit. Responses must be scored subjectively by content experts. The term constructed response item may refer to an essay item or performance assessment. Measurement experts traditionally distinguish between two variations of these subjective item types: the restricted response item and the extended response item.
Restricted response items. On restricted response items examinees provide brief answers, usually no more than a few words or sentences, to fairly structured questions. An example in seventh grade science could be: Why are day lengths shorter in December than in July in the northern hemisphere?
Extended response items. Extended response items require lengthy responses that count heavily in scoring. Ideally these items focus on major concepts of the content unit and demand higher level thinking. Typically examinees must organize multiple ideas and provide supporting information for major points in crafting responses. An example of such an item from 12th grade literature might be: The title of Steinbeck's novel, The Winter of Our Discontent, is found the opening line of the Shakespeare's play, Richard III. Having read both works, explain why you do, or do not, think this title is appropriate. Support your reasoning by comparing main characters, plots, and use of symbolism in these two works.
Performance assessment, as conceived in personnel psychology, requires the examinee to create a product or deliver a performance in a real world situation or simulation that could be evaluated using specified criteria. Raters typically score the performance using checklists or rating scales (Fitzpatrick & Morrison, 1971). When educators transported this procedure to classroom settings, their early performance exercises reflected this definition as shown in the following examples: determine the cost of carpeting a classroom, given a tape measure and carpet cost per square foot; determine the chemical composition of an unknown powdered compound.
Gradually, educators' views of performance assessment evolved to include pencil and paper items couched in real-world contexts. Coffman (1971) successfully argued that an essay item could be a performance assessment in some content areas. In performance assessments in the early 2000s, examinees may respond to questions containing diagrams, data tables, written scenarios, or text passages, but quite often their responses are written essays. In modern usage, the defining characteristic of a performance assessment is that it requires behaviors that are meaningful end-products of instruction derived from content standards (Lane & Stone, 2006). Performance assessments may also include portfolios or assigned out-of-class projects, but the principles for construction and scoring are the same as for essay items.
Content standards and test specifications operationally define the domain of subject matter knowledge and levels of cognitive complexity that are sampled by the achievement test items. Within these parameters, the test developer must develop the questions (or prompts), create scoring rubrics (or keys), and plan the scoring process. Welch (2006) provides a comprehensive summary of this process.
Developing the prompt. The prompt for a subjective item poses a question, presents a problem, or prescribes a task. It sets forth a scenario or set of circumstances to provide a common context for framing responses. Action verbs direct the examinee to focus on the desired behavior (e.g., solve, interpret, compare, contrast, discuss, or explain). Appropriate directions indicate expected length and format of the response, allowable resources or equipment, time limits, and features of the response that count in scoring (e.g., originality, organization, grammar, labeling diagrams, or numeric precision; see Gronlund & Linn, 1995).
Creating the scoring rubric. Scoring rubrics are usually analytic or holistic in nature. For an analytic rubric the item writer lists desired features of the response with a number of points awarded for each specific feature. A holistic rubric provides a scale for assigning points to the response based on overall impression. A range of possible points is specified (e.g. 0–8 or 0–3), and verbal descriptors are developed to characterize a response located at each possible point on the scale. Illustrative responses that correspond to each scale point are often developed or selected from actual examinee responses. These exemplars are called anchor papers because the scorer uses them as benchmarks for comparison when deciding where an examinee's response falls on the score scale.
Scoring responses. During subjective scoring at least four types of rater errors may occur as the rater (a) becomes more lenient or severe over time or scores erratically due to fatigue or distractions; (b) has knowledge or belief about an examinee that influences perception of the response; (c) is influenced by the examinee's good or poor performance on items previously scored; or (d) is influenced by the strength or weakness of a preceding examinee's response. To reduce these effects, a scoring process recommended for classroom teachers includes the following:
- Mask student names to facilitate “blind” scoring;
- Use the key on a trial basis for a small sample of papers and revise as necessary;
- Grade all responses to a single item at one sitting if possible;
- Shuffle papers between scoring different items so that examinees' responses are scored in varying order;
- Mask the scores after initial scoring and rescore at least a sample of responses.
In large-scale testing programs, many raters participate in the scoring process. Prior to scoring, raters are trained to use common standards in extensive practice sessions using previously scored anchor papers. During scoring, each response is typically scored by at least two raters. Rater performance is monitored throughout the scoring process (Lane & Stone, 2006).
Using constructed response items presents issues that do not arise with objective item formats. Lane and Stone (2006) identify and discuss a number of these, but a few selected examples are as follows: Should the assessment present a few important tasks that demand complex, lengthy responses or more items, requiring briefer responses that provide broader sampling of the content and more reliable scores? Should examinees have a choice of prompts or should all respond to the same prompts? Does handwriting quality affect scores? Should examinees have a choice between handwriting or composing their responses at a computer keyboard? Do electronic scoring programs yield comparable results to those of human raters?
Such questions spark continuing debate for two reasons. First, research results from published studies on the issue may be conflicting or may not apply to other testing situations or populations. Second, these issues involve often value judgments rooted in differing educational philosophies. The final decision on such issues in a specific situation should rest on a rationale that weighs available research evidence, viewpoints of stakeholders, and possible consequences.
Despite the item writer's best efforts, subjective item prompts and rubrics will contain flaws. The most serious flaws are: mismatch to the content standards; ambiguity of wording; incorrect information in rubrics; inappropriate level of difficulty; content or wording that is offensive to some examinees, and potential for creating gender or ethnic bias due to the problem context or particular wording. Classroom teachers may ask colleagues to review a draft of items and critique them for such flaws. In large scale testing programs such reviews are conducted by multiple independent panels of experts (Welch, 2006).
After item administration, the examinees' numeric scores on each item provide a data set that can be analyzed to determine if the item functioned properly. This analysis typically includes computations of (a) mean item score and distribution statistics; (b) rater consistency indices (i.e., percentage of examinees receiving identical and contiguous scores from multiple raters or the correlations between the raters' scores); (c) consistency of examinee performance across different subjective items; and (d) relationship between item score and score on an objective section of the test (Schmeiser & Welch, 2006). In large scale assessments, these analyses are conducted after the items are field tested prior to live use. Flawed items can be revised or eliminated. At the classroom level, when the item analysis reveals a problem, adjustments to the scoring rubric can be made and responses can be rescored before examinees' scores are reported.
See also:Classroom Assessment
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (1999). Standards for educational and psychological tests. Washington DC: author.
Coffman, W. E. (1971). Essay examination. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 271–302). Washington DC: American Council on Education.
Fitzpatrick, R., & Morrison, E. J. (1971). Performance and product evaluation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 237–270). Washington DC: American Council on Education.
Gronlund, N. E., & Linn, R. L. (1995). Measurement and evaluation in teaching. Englewood Cliffs, NJ: Merrill.
Lane, S., & Stone, C. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387– 431). Washington DC: American Council on Education.
Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 307–353). Washington DC: American Council on Education.
Welch C. J. (2006). Item and prompt development in performance testing. In S. M. Downing & T. M. Haladyna (Eds.). Handbook of test development (pp. 303–327). Mahwah, NJ: Erlbaum.