Psychological and educational assessment of students has been prominent since the early 1900s. The results garnered from these assessments have been used for a myriad of purposes, such as identifying children who (a) are suspected of having learning difficulties, (b) qualify for gifted programs or programs requiring specific talents (e.g., enrollment in a music or arts program), or (c) may be suffering from emotional distress (e.g., depression, anxiety). Further, group-based data focusing on a chosen index (e.g., standardized achievement scores, school drop-out rates) have also been compared across school districts and/or across states to examine learning outcomes. In common among all assessment processes is the gathering of data to make informed educational and mental health decisions (Neukrug & Fawcett, 2006).

The term assessment includes a broad array of methods. Sattler (2001) outlined four main types, or pillars, of assessment, each of which is complementary and adds unique information not found in the other methods. These pillars include (a) interviews with parents, teachers, and children, (b) observations of the child's behavior, and (c) informal assessment procedures such as reviewing class work, school records, or personal documents, for example, diaries, drawings, and self-report logs. The fourth and most frequently used method is the administration of norm-referenced tests (most notably intelligence and achievement measures) for diagnostic purposes. Such tests compare a child's results, in standardized form, against an established norm group. Considering the emphasis that teachers, parents, and schools often place on the results of norm-referenced tests, this entry focuses on this particular assessment method.

The results derived from norm-referenced tests often have been used to establish or support many legislative and educational policies (see Cohen & Swerdlik, 2002). The ramifications of these policies to schools, students, and the larger community underscore the need for tests to demonstrate sound psychometric properties (i.e., evidence of reliability and validity) and appropriateness for their intended use (Anastasi, 1988; Moreland, Fowler, & Hon-aker, 1997). A key factor in this regard is the manner in which tests are administered. Indeed, from the beginning of psychological testing much attention has been devoted to ensuring that test scores are invariant regardless of whether they are administered individually or in a group setting (Geisinger, 2000; Kline, 2005). Although strident measures have been taken to control method variance, each format contains inherent advantages and disadvantages. This entry reviews the most salient aspects of each format.


The history of psychological testing is one marked by necessity followed by innovation (Geisinger, 2000). Although the lineage of psychological measurement dates back at least 4,000 years (Thorndike & Lohman, 1990), the first modern test of intelligence was created at the turn of the 20th century, in response to problems reported by French public schools experiencing a sharp increase in student enrollment. The overcrowded conditions prompted a decision to remove children from the regular education classroom who were not learning at a satisfactory level. The 1905 work of Alfred Binet (1857–1911) and Theophile Simon led to the construction of a test containing a series of brief mental tasks, which could be individually administered to children ages 3 to 11. Revisions to this original test included expanding the age range to adulthood, increasing the number of items comprising each test, providing standard directions for administration, and introducing the concept of mental age, which was the chronological age of a group of typically developed children who performed at the same level as the examiner (Aiken, 2006).

A short time later, American psychologists made significant revisions to the original Binet-Simon scale. These changes included arranging the items in order of difficulty and assigning points for the level of correctness of an examinee's response. The most durable of these changes was provided by Lewis Terman and colleagues, who gathered extensive normative data on the Binet-Simon scale from hundreds of children in the Stanford, California, area. From this data, numerous revisions were made, most notably the inclusion of an intelligence quotient or IQ—a quantitative score that was defined as the ratio of the child's chronological age and mental age (Neukrug & Fawcett, 2006). Later known as the Stanford-Binet Intelligence Test, Terman's revised test served as the standard of testing for the first two decades of the 20th century (Aiken, 1996) and ushered in both the intelligence testing movement and the clinical testing movement (Boake, 2002; Cohen, Swerdlik, & Smith, 1992). The original Binet-Simon scales and their subsequent modifications also served as the prototype for most modern intelligence tests.

Group testing also was developed to address a pressing concern. The involvement of the United States in World War I prompted the need for a group-based cognitive test that could quickly determine if recruits were fit for military service and to identify those who could be trained as officers. Although considered crude by 21st century standards, two multiple-choice “intelligence” tests were created and administered to almost 2 million recruits. The Army Alpha was designed and administered to recruits who were literate and proficient in English, while the Army Beta was administered to foreign-born recruits or those who could not read with proficiency. Although both versions underwent significant revisions, their basis served as the foundation for most group-based intelligence tests used today. Further, these early tests served as the prototype for group tests examining other constructs (e.g., achievement, specific aptitude, psychopa-thology) across various environments (e.g., schools, mental health clinics)(Neukrug & Fawcett, 2006).


Individually administered norm-referenced cognitive and achievement tests have been used for a wide variety of purposes. In conjunction with other norm-referenced tests and assessment methods, cognitive and achievement tests have typically been used to provide in-depth information on student's (a) intellectual functioning or academic standing in comparison to same-age peers, (b) ability to process certain mental or academic tasks, which can facilitate diagnostic impressions such as learning disabilities, giftedness, or mental retardation, and (c) initial or continuing eligibility for special education services. In most cases, cognitive and achievement tests provide a wide array of specific tasks that the student is asked to perform either within a given time frame or according to specific scoring guidelines. Each test is concluded when the allotted time is reached (i.e., a timed test) or the student gives a continuous number of failed responses (i.e., a power test). Most tests aggregate the task scores to yield a general or overall score, which is assumed to indicate a student's global level of cognitive or achievement functioning. Nevertheless, as the general score can be influenced by extremely high (or low) performance on one or more specific tasks, composite scores also are computed to assess functioning within a specific domain. For example, cognitive tasks that measure a student's verbal reasoning abilities and general knowledge would be combined to form a verbal composite. Likewise, achievement tasks that assess a student's aptitude for solving math problems or for identifying numbers would be combined to form a math composite.

There are numerous advantages for using individually administered cognitive and achievement tests. First, direct one-to-one attention allows the student and examiner to establish solid rapport, which is essential for obtaining valid results (Sattler, 2001). Also, the examiner has direct control of the testing environment, which includes ensuring that the environment itself is conducive to optimal student performance (e.g., making sure the temperature of the room is not too hot or cold, eliminating non-relevant stimuli that would distract the student). Second, the one-to-one attention allows the examiner to observe student behaviors that may be interfering with task performance but not reflected in the score (e.g., fatigue) or assist in diagnosis (e.g., difficulty remaining in seat). Third, because most task items are orally administered by the examiner, little reading is required by the student, which makes it possible to test very young students or those with limited reading skills (Thorndike, 2005). Finally, scores from individual tests can be interpreted across a variety of levels. For example, in addition to determining the general cognitive or achievement level of the student, composite and even subtest scores can be examined to determine specific processing deficits. Thus, individually administered tests yield detailed information on a student's cognitive or achievement functioning that is not typically obtained from group administered tests.

Nevertheless, there are some disadvantages to using individually administered tests. Perhaps the biggest limitation is the cost of the tests themselves, both monetarily and with respect to time. With the cost of most major tests close to or exceeding $1,000 (U.S.), purchasing these tests places a financial burden on school districts, especially those with limited economic resources. Further, learning to administer and interpret the results from the tests (particularly the cognitive tests) requires extensive training, and administration time ranges from one to four hours. Finally, the tests purposely include a wide array of tasks so that an adequate sampling of important cognitive or achievement domains are covered (Anastasi, 1988). Nevertheless, one of the most persistent criticisms (particularly with regard to cognitive tests) is that the underlying conceptual framework of most tests (and thus the tasks included) is largely atheoretical or based on different theories of intelligence (Flanagan & Ortiz, 2006). For example, the commonly used scales in the Wechsler series were originally based on clinical practice rather than a specific theory, while the original Stanford-Binet scales were based on the theory of general intelligence, or g, which proposed that all mental abilities can be explained by a single global intellectual functioning (Thorndike, 2005). More contemporary cognitive tests are based on neurophysiological modeling of the brain, while others are based on theories that emphasize broad fluid (i.e., innate) and crystallized (i.e., learned) abilities (Harrison & Flanagan, 2005). Thus, the use of one test may not adequately address domains covered by another test. In the early 2000s, attempts were made to create a cross-battery approach, whereby examiners are not relegated to using one test but instead use portions of multiple tests to ensure that specific domains are covered (e.g., McGrew & Flanagan, 1998). However, more research is needed to verify its clinical utility.


Most students will be administered a group administered cognitive or achievement test during their studies. Indeed, of the millions of cognitive tests that are administered to students annually, only a small fraction of these are individually administered (Cohen & Swerdlik, 2002). Considering their practicality, group tests are used across a variety of environments, including military, industrial/ organizational, and educational. Thus, group administered tests have a broader application than individual tests (Aiken, 2006). Like their individually administered counterparts, most group administered tests consist of subtests that assess a variety of cognitive or academic domains and are either timed or power tests. However, the scoring format for most group administered tests is multiple-choice, which is less flexible and yields much less diagnostic information. For this reason, school-based group administered tests are used as screeners to determine whether further evaluation (often using an individually administered test) is warranted.

From their inception, it was clear that group administered tests could address some of the limitations inherent in individually administered tests. For example, by using only printed materials and following a standardized administration procedure, the financial and personnel resources are much less than the costs associated with individually administered tests. Further, most group administered tests have standardized and computerized scoring systems, which reduces the time necessary to score the protocols and thus minimizes scoring error. Moreover, given the nature of the format, group administered tests can be given to as many students who can comfortably fit into a room, which reduces test administration time and increases testing efficiency. Finally, considering the potentially unlimited number of students who would be administered a group administered test, the norms created are often based on a sample that is much larger than individually administered tests. This advantage allows for a direct comparison of scores across select demographic variables (e.g., race, disability status) that may not be possible when using individually administered tests.

Nevertheless, there are important disadvantages when considering group administered tests. For example, the format does not allow for in-depth observations of individual students as they complete the test. Thus, behaviors such as fatigue, low motivation, anxiety, hunger, and other states that may interfere with performance are not observed. Further, because the examiner may be less trained in the nuances of the test (in comparison to those who administer individual tests), the examiner may break standardization and inadvertently (and inappropriately) answer students' queries or not be able to monitor the testing environment with the same fidelity as can be given to the individual testing environment. Another limitation is the restriction of responses to multiple choice, whereas items on many individually administered tests have different levels of scoring depending on the complexity of the response. In this regard, group administered items may unduly penalize creative or original thinkers. Further, although the sample size of a group administered test may be large, it may also not be representative of children of a particular demographic. For example, many group administered cognitive and achievement tests are normed by students who take the test in the fall and in the spring. However, many students may choose not to take the test (when given a choice) or not be motivated to perform their best on the test (Aiken, 2006). Finally, the results of group administered tests can be used inappropriately. For example, the data obtained from such tests can be used to diagnose and place students into special programs, which should only occur from individually administered tests (Cohen & Swerdlik, 2002).


Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan.

Boake, C. (2002). From the Binet-Simon to the Wechsler-Bellevue: Tracing the history of intelligence testing. Journal of Clinical and Experimental Neuropsychology, 24, 383–405.

Cohen, R. J., & Swerdlik, M. E. (2002). Psychological testing and measurement: An introduction to tests and measurement (5th ed.). New York: McGraw-Hill.

Cohen, R. J., Swerdlik, M. E., & Smith, D. K. (1992). Psychological testing and measurement: An introduction to tests and measurement (2nd ed.). Mountain View, CA: Mayfield.

Flanagan, D. P., & Ortiz, S. O. (2006). Best practices in intellectual assessment: Future directions. In J. Grimes & A. Thomas, Alex (Eds.), Best practices in school psychology IV (pp. 1351–1372). Washington, DC: National Association of School Psychologists.

Geisinger, K. F. (2000). Psychological testing at the end of the millennium: A brief historical review. Professional Psychology: Research and Practice, 31, 117–118.

Harrison, P. L., & Flanagan, D. P. (2005). Contemporary intellectual assessment: Theories, tests, and issues (2nd ed.). New York: Guilford.

Kline, T. J. B. (2005). Psychological testing: A practical approach to design and evaluation. Thousand Oaks, CA: Sage.

McGrew, K. S., & Flanagan, D. P. (1998). The intelligence test desk reference (ITDR): Gf-Gc cross-battery assessment. Needham Heights, MA: Allyn & Bacon.

Moreland, K., Fowler, R. D., & Honaker, L. M. (1997). Future directions in the use of psychological assessment for treatment planning and outcome assessment: Predictions and recommendations. In M. E. Maruish (Ed.), The use of psychological testing for treatment planning and outcomes assessment (2nd ed.) (pp. 1415–1436). Mahwah, NJ: Erlbaum.

Neukrug, E. S., & Fawcett, R. C. (2006). Essentials of testing and achievement: A practical guide for counselors, social workers, and psychologists. Belmont, CA: Thompson Brooks/Cole.

Roid, G. H. (2003). The Stanford-Binet Intelligence Scales (5th ed.). Itasca, IL: Riverside.

Sattler, J. M. (2001). Assessment of children: Cognitive applications (4th ed.). San Diego, CA: Author.

Thorndike, R. M. (2005). Measurement and evaluation in psychology and education (7th ed.). Upper Saddle River, NJ: Pearson Education.

Thorndike, R. M., & Lohman, D. F. (1990). A century of ability testing. Chicago: Riverside.