Any book that focuses on psychology is incomplete without a discussion of testing, including the topic of test bias. Psychologists, more than professionals in other disciplines, are the primary administers of intelligence tests, with schools being the primary user. In the high-stakes testing of the early 2000s, employment opportunities, high school graduation, grade promotion, college admission, gifted education placement, and special education placement rely extensively on test results. Thus, the discussion of how tests impact the decisions of test users and the opportunities of those tested is by no means insignificant. Stated another way, “an intelligence test is a neutral, inconsequential tool until someone assigns significance to the results derived from it. Once meaning is attached to a person's score, that individual will experience many repercussions, ranging from superficial to life changing. These repercussions will be fair or prejudiced, helpful or harmful, appropriate or misguided— depending on the meaning attached to the test score” (Gregory, 2004, p. 240).
Regarding intelligence tests, this entry presents an overview of issues surrounding test bias primarily related to African Americans. It defines bias, gives examples of test bias, and recommends ways to reduce bias. Two caveats are in order. First, clearly test bias is not unique to African Americans, but the bulk of research and discussions focus on this group. Thus, the focus here on African Americans is not a slight to other culturally and linguistically diverse (CLD) groups. Second, different types of tests exist beyond intelligence tests—aptitude, achievement, career/vocational—and they are not exempt from discussions about bias. However, the focus here is specifically on bias regarding intelligence tests, as this is the most controversial type of test. One complication is that intelligence tests and the meaning attached to the word intelligence carry more significance than those associated with achievement tests. Intelligence tests (also called cognitive ability and ability tests) are often associated with genetic endowment and capacity, while achievement tests are more often associated with learning opportunities and educational experiences—the environment—and their effect on test performance. As Gregory (2004) noted, beyond a doubt, no practice in modern psychology has been more assailed than psychological testing. Commentators reserve a special and often vehement condemnation for ability testing in particular. Additionally, Jensen (1980) contended that test bias is the most common rallying point for critics.
The test bias controversy and debate has its origins in the observed differences in average IQ scores between various racial groups (Blacks) and ethnic groups (immigrants) in the early 1900s (Cole & Zieky, 2001). Specifically, several studies indicate that African Americans score, on average, 15 points lower than their White counterparts on traditional intelligence tests—tests with high linguistic/verbal and cultural loadings (Flanagan & Ortiz, 2001). This finding of differential group test score performance in intelligence heightened the controversy over test bias (Gregory, 2004). Under scrutiny have been all versions and editions of traditional intelligence tests, including the Wechsler tests (e.g., WISC-IV, WAIS, WPPSI), the Binet tests (e.g., Stanford-Binet, Binet-IV), Otis Lennon School Aptitude Test, and Peabody Picture Vocabulary Test. Non-verbal intelligence tests have also been examined for bias (e.g., Ravens Progressive Matrices; Naglieri Non-Verbal Intelligence Test) (e.g., Bracken & Naglieri, 2003).
In addition to coming under professional scrutiny, intelligence tests have been challenged legally. One of the most famous cases is Larry P. v. Wilson Riles (1979) in which 9th U.S. District Judge Robert F. Peckham in California ruled that intelligence tests used for the assessment of Black children for special education classes for the educable mentally retarded are culturally biased. One year later, Judge John F. Grady in Illinois ruled in Parents in Action in Special Education v. Joseph P. Hannon that intelligence tests are not racially biased; they do not discriminate against Black children. Set one year apart, the opposing positions of these two cases helped to create or sustain the debates that continued into the early 2000s.
Opponents of using intelligence tests with Black and other CLD groups often focus on the social and educational consequences—fairness and disparate impact. The primary argument and belief is that persons from backgrounds other than the culture in which the test was developed will always be penalized; they will likely score lower on the test and, thus, have their opportunities limited and face misinterpretations about their worth and potential (academically, as students, as employees, etc.). They argue that too few intelligence tests have been normed with representative numbers (not just percentages) of CLD populations. Therefore, the test scores are not valid and reliable for them, rendering the test inappropriate to use. This argument or position also applies to topics other than race and ethnicity. For example, if few in the norming group are low income or linguistically diverse, then the test is viewed as inappropriate and potentially useless and harmful to that group. Further, if few gifted students or students with learning disabilities were in the norming group, the test's usefulness for them is questionable (Ford, 2004, 2007).
Recognizing that Black students in particular were and are negatively affected by their test performance or scores, the Association of Black Psychologists (Williams, 1970) charged that Black students were/are subsequently denied many educational opportunities; they charged that intelligence tests are not valid measures for Black students and that they are more harmful than helpful. This notion of tests being harmful goes against the principles of fair and equitable testing, a key feature of professional testing standards (e.g., American Psychological Association, American Educational Research Association, National Council on Measurement in Education, 1999). Simply put, tests should be used to help not harm; they should benefit the test taker.
Proponents of intelligence tests maintain that tests are valid and reliable tools for all groups. According to Armour-Thomas and Gopaul-McNicol (1998), support for this position falls into at least three categories or assumptions: (1) tests are culturally fair and items do not favor a particular cultural group; (2) the tasks assess the cognitive abilities underlying intellectual behavior for all groups; and (3) the tests accurately predict performance for all groups.
It is also important to note that test construction is grounded in the assumption of homogeneity and equal opportunity to learn and acquire knowledge and experiences (Armour-Thomas & Gopaul-McNicol, 1998; Flanagan & Ortiz, 2001), meaning that (a) the test items measure the everyday experiences of populations and (b) everyone has had an equal opportunity to learn and be exposed to the tasks in the tests and its format (Ford, 2004). Essentially, it is believed that tests are not discriminatory.
It is important to keep in mind that tests are often viewed as being biased against Black and other culturally and linguistically diverse groups, and against low-income students, but biased in favor of White and middle class students. Gregory (2004) defined test bias as “objective statistical indices that examine the patterning of test scores for relevant subpopulations” (p. 242). He adds that consensus exists about the statistical criteria that indicate when a test is biased. A review of definitions indicates that test bias can be categorized in two ways: technically and socially. Technically, test bias refers to differential validity for definable, relevant subgroups of persons (Sattler, 1992, p. 616). Hence, a test would be considered biased if the scores from subpopulations did not fall upon the same regression line or a relevant criterion.
Bias is present when a test score has meanings or implications for a relevant, definable subgroup of test takers that are different from the meanings or implications for the remainder of test takers. Thus, bias is the differential validity of a given interpretation of a test score for any definable, relevant subgroup of test takers (Cole & Moss, 1998, cited in Gregory, 2004, p. 242). When a test is biased, from a social or social values viewpoint, the concern relates to denial of opportunity and the false negative hypothesis. Two other terms or concepts are relevant to discussions regarding testing CLD groups. It can be argued that while a test might not be biased technically, it can still be unfair (see Cole & Zieky, 2001). Test fairness is fundamentally about the social consequences of test results (Gregory, 2004, p. 249; Hunter & Schmidt, 1976). Test fairness is the extent to which the social consequences of test usage are considered fair or unfair to relevant subgroups; test fairness is especially important to consider when used for selection or placement decisions. From a legal point of view, this is related to the notion of disparate impact (see Griggs v. Duke Power, 1971). If a test negatively affects opportunities for a group to participate in, for example, gifted education, then it has a disparate impact and should not be used. Out of Griggs v. Duke Power came the fundamental question: “If a group consistently performs poorly on a test, why do we continue to use it?”
Fundamentally, all concerns about bias relate to differential performance between and among groups. Why does one group perform differently than another group (Black /CLD or White, female or male, high income or low income) on a consistent basis? Attempts to account for differential performance target the individual characteristics of examinees, the testing environment, and/or characteristics of the test or test items (Scheuneman, 1985). Four types of bias are often discussed.
Bias in Construct Validity. Bias in construct validity is present when a test is shown to measure different hypothetical constructs or traits for one group than another; this type of bias also exists when the test measures the same trait for groups but with differing degrees of accuracy. Statistics regarding factor structure are often employed here. Specifically, a biased test will show different factor structures across subgroups. There will be a lower degree of similarity for the factor structure and the rank or item difficulty across groups (Sattler, 1992). The basic question here is: Does the item or test measure what it is intended to measure? A key illustration relates to language. Testing a student in English who has yet to become proficient in English is problematic. An intelligence test then becomes a language test. Certain students or groups may have the knowledge and experiences needed to answer the item correctly but cannot do so if they do not understand the question due to language barriers.
Bias in Content Validity. Bias in content validity is present when an item or subscale is relatively more difficult for members of one groups than another after the general ability level of the two is held constant. For example, if asked the question, “How are soccer and football alike?” a student or group who has never played or watched or had discussions about soccer is at a disadvantage. Lack of exposure and experience place them at a disadvantage. Reynolds (1998) defined content bias in this way: “an item or subscale of a test is considered biased when it is demonstrated to be relatively more difficult for members of one group than another when the general ability of both groups is held constant and no reasonable theoretical rationale exists to explain group differences on the item or subscale in question” (cited in Gregory, 2004, p. 243). Reynolds (1998) lists three examples of content bias:
The items ask for information that minority persons have not had equal opportunity to learn;
The scoring of the item is inappropriate, because the test author/developer had arbitrarily decided on the only correct answer and minority groups are inappropriately penalized for given answers that would be correct in their own culture;
The wording of questions in unfamiliar, and minority groups who may know the answer may not be able to respond because they do not understand the question(s) and/or are unfamiliar with the test format.
Bias in item Selection. Bias in item selection is present when the items and tasks selected are based on the learning experiences and language of the dominant group. This bias is closely related to content validity, but addresses more directly concerns regarding the appropriateness of individual items. While the overall test may not be biased statistically, a few items in them can be. Essentially, this issue concerns how an item gets included in a test but another item does not.
Bias in Predictive or Criterion-Related Validity. Bias in predictive or criterion-related validity is present when the inference drawn from the test score is not made with the smallest feasible random error or when there is constant error in an inference or prediction as a function of membership in a particular group. The overarching question here is: “Does the test scores accurately predict how the student or group will perform on a task in the future?” It is often presumed that a high intelligence score predicts a high grade point average and success in college and on the job, and so much more. A concern of opponents is that intelligence tests are given too much power, and if a student or group scores low on an intelligence test, there is a high probability that they will be denied an opportunity to access a program or service because expectations for them are low. In other words, a test is considered “unbiased if the results for all relevant subpopulations cluster equally well around a single regression line … an unbiased test predicts performance equally for all groups, even though their means may be different” (Gregory, 2004, p. 244).
In newer editions of intelligence test, most producers endeavor to ensure that their tests are low in bias, and their manuals address such efforts. No matter how diligent these efforts are, there is no such thing as a bias-free test; nonetheless, we must aim for bias-reduced tests. Some suggestions for achieving this goal are as follows:
Translate tests into the language of the examinee;
Use interpreters to translate test items for examinees;
Examine all test items/tasks to see if groups perform differently and eliminate those items/tasks;
Eliminate items that are offensive to examinees;
When interpreting test scores, always consider the examinee's background experience;
Do not support the assumption of homogeneous experience or equal opportunity to learn; groups have different backgrounds and experiences that affect their test performance;
Never base decisions on one test and/or one score. One piece of information or lone score cannot possibly be useful in making effective and appropriate decisions;
Do not interpret test scores in isolation; collect multiple data and use this comprehensive method to make decisions;
When an individual or group scores low, consider that the test may be the problem; it may be inappropriate and should be eliminated;
If a group consistently performs poorly on an intelligence test, explore contributing factors and the extent to which it is useful/helpful for that group (Griggs Principle);
Always consider the technical and social merits of tests. A test can be technically unbiased and simultaneously unfair (i.e., have a disparate impact);
Review norming data and sample sizes; while diverse groups can be proportionately represented in the standardization sample, their actual numbers may be too small to be representative, which hinders generalizability;
Include culture-fair or culture-reduced tests in the assessment or decision making process; these tests are designed to minimize irrelevant influences of cultural learning and social climate and, thereby, produce a clearer separation of ability or performance from learning opportunities; non-verbal intelligence tests fall into this category, with their reduced cultural and linguistic loadings (see Bracken & Naglieri, 2003; Flanagan & Ortiz, 2001);
Always use and interpret test scores with testing principles and standards in mind, such as those published by the American Psychological Association and others (1999), which address professional responsibility and ethics, as well as working effectively with culturally diverse populations (Ford & Whiting, 2006; Whiting & Ford, 2006).
As of 2004, culturally diverse students comprised some 43% of the U.S. public school population, and demographers predicted that this percentage would increase. Given the rapid changes in school demographics and the ever-increasing reliance on tests for decision-making purposes, the discussion of test bias was anticipated to continue. Testing is here to stay, and high-stakes testing is on the rise as of the early 2000s. Thus, the power of tests to open or close doors is increasing and of increasing concern.
While test developers increasingly work to decrease biases in their tests and, in effect, to increase the usefulness of their measures, controversy continues. It has been argued that tests in and of themselves are harmless tools, a philosophical viewpoint that often fails to hold true in actual practice. “Unfortunately, the tendency to imbue intelligence test scores with inaccurate and unwarranted connotations is rampant … Test results are variously over-interpreted or under-interpreted, viewed by some as a divination of personal worth but devalued by others as trivial and unfair” (Gregory, 2004, p. 240). While not intended for this purpose, in practice, tests do serve as gatekeepers, often resulting in closed doors and limited options for Black and other diverse groups (Ford & Joseph, 2006). Moreover, if misuse and misinterpretation were not problematic, there would be no need for task forces and standards to hold educators accountable (see works by Association of Black Psychologists, and the joint testing standards of APA, AERA, and NCME, 1999).
Despite the best intentions to develop tests that are low or reduced in bias, human error—stereotypes and prejudice—undermine test administration, interpretation, and use. More often than not, African American and other culturally diverse students are the recipients of this inequity.
American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
Armour-Thomas, E., & Gopaul-McNicol, S. (1998). Assessing intelligence: Applying a bio-cultural model. Thousand Oaks, CA: Sage.
Bracken, B. A., & Naglieri, J. A. (2003). Assessing diverse populations with nonverbal tests of general intelligence. In C. R. Reynolds & R. W. Kamphaus (Eds.), Handbook of psychological and educational assessment of children (2nd ed.). New York: Guilford.
Cole, N. S., & Zieky, M. J. (2001). The new faces of fairness. Journal of Educational Measurement, 38(4), 369–382.
Ford, D. Y. (2004). Intelligence testing and cultural diversity: Concerns, cautions, and considerations. Storrs, CT: University of Connecticut, National Research Center on the Gifted and Talented.
Ford, D. Y. (2007). Intelligence testing and cultural diversity: The need for alternative instruments, policies, and procedures. In VanTassel-Baska, J. L. (Ed.), Alternative assessments with gifted and talented students (pp. 107–128). Waco, TX: Prufrock Press and the National Association for Gifted Children.
Ford, D. Y., & Joseph, L. M. (2006). Non-discriminatory assessment: Considerations for gifted education. Gifted Child Quarterly, 50(1), 41–51.
Ford, D. Y., & Whiting, G. W. (2006). Under-representation of diverse students in gifted education: Recommendations for nondiscriminatory assessment (part 1). Gifted Education Press Quarterly, 20(2), 2–6.
Gregory, R. J. (2004). Psychological testing: History, principles, and applications. Boston: Allyn & Bacon.
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Reynolds, C. R. (1998). Cultural bias in testing of intelligence and personality. In A. Bellack & M. Hersen (Series Eds.) & C. Belar (Vol. Ed.), Comprehensive clinical psychology: Sociocultural and individual differences. New York: Elsevier Science.
Sattler, J. M. (1992). Assessment of children (3rd ed.). San Diego: Jerome M. Sattler.
Scheuneman, J. (1985). Exploration of causes of bias in test items. GRE Board Professional Report GREB No. 81–21P, ETS Research Report 85–42. Princeton, NJ: Educational Testing Service.
Whiting, G. W., & Ford, D. Y. (2006). Under-representation of diverse students in gifted education: Recommendations for nondiscriminatory assessment (part 2). Gifted Education Press Quarterly, 20(3), 6–10.
Williams, R. (1970). Danger: Testing and dehumanizing Black children. Clinical Child Psychology Newsletter, 9(1), 5–6.