Testing and Individual Differences for AP Psychology (page 2)
Review questions for this study guide can be found at:
Standardization and Norms
Psychometrics is the measurement of mental traits, abilities, and processes. Psychometricians are involved in test development in order to measure some construct or behavior that distinguishes among people. Constructs are ideas that help summarize a group of related phenomena or objects; they are hypothetical abstractions related to behavior and defined by groups of objects or events. For example, we can't measure happiness, honesty, or intelligence in feet or meters. If someone tells the truth in a wide variety of situations, however, we might consider that person honest. Although we cannot observe happiness, honesty, or intelligence directly, they are useful concepts for understanding, describing, and predicting behavior. Psychological tests include tests of abilities, interests, creativity, personality, and intelligence. A good test is standardized, reliable, and valid. After many questions for a test have been written, edited, and pretested, questions are thrown out if nearly everyone answers them correctly or if very few answer them right because these types of questions do not tell us anything about individual differences. Tests that differentiate among test takers and that are composed of questions that fairly test all aspects of the behavior to be assessed are assembled. They are then administered to a sample of hundreds or thousands of people who fairly represent all of the people who are likely to take the test. This sample is used to standardize the test. Standardization is a two-part test development procedure that first establishes test norms from the test results of the large representative sample who initially took the test, then assures that the test is both administered and scored uniformly for all test takers. Norms are scores established from the test results of the representative sample, which are then used as a standard for assessing the performances of subsequent test takers; more simply, norms are standards used to compare scores of test takers. For example, the mean score for the SAT is 500 and the standard deviation is 100, whereas the mean score for the Wechsler Adult Intelligence Scale (IQ test) is 100 and the standard deviation is 15, based on the "standardization" sample. When administering a standardized test, all proctors must give the same directions and time limits and provide the same conditions as all other proctors. All scorers must use the same scoring system, applying the same standards to rate responses as all other scorers. Thus, we should earn the same test score no matter where we take the test or who scores it.
Reliability and Validity
Not only must a good test be standardized, it must also be reliable and valid.
If a test is reliable, we should obtain the same score no matter where, when, or how many times we take it (if other variables remain the same). Several methods are used to determine if a test is reliable. In the test-retest method, the same exam is administered to the same group on two different occasions and the scores compared. The closer the correlation coefficient is to 1.0, the more reliable the test. The problem with this method of determining reliability or consistency is that performance on the second test may be better because test takers are already familiar with the questions. In the split-half method, the score on one half of the test questions is correlated with the score on the other half of the questions to see if they are consistent. One way to do that might be to compare the score of all the odd-numbered questions to the score of all the even-numbered questions. In the alternate form method or equivalent form method, two different versions of a test on the same material are given to the same test takers, and the scores are correlated. The SAT given on Saturday is different from the SAT given on Sunday in October; there are different questions on each form. Although this does not happen, if the same people took both exams and the tests were highly reliable, the scores should be the same on both tests. This would also necessitate high interrater reliability, the extent to which two or more scorers evaluate the responses in the same way.
Tests can be very reliable, but if they are not also valid, they are useless for measuring the particular construct or behavior. Psychometricians must present data to show that a test measures what it is supposed to measure accurately, and that the results can be used to make accurate decisions. Because there are no universal standards against which test scores can be compared, validation is most frequently accomplished by obtaining high correlations between the test and other assessments. Validity is the extent to which an instrument accurately measures or predicts what it is supposed to measure or predict. Just as there are several methods for measuring reliability, there are also several methods for measuring validity.
- Face validity is a measure of the extent to which the content of the test measures all of the knowledge or skills that are supposed to be included within the domain being tested, according to the test takers. For example, we expect the AP Psychology exam to ask between five and seven questions dealing with testing and individual differences on the multiple-choice section of the test, as defined by the content outline for the course, which sets the structure and boundaries for the content domain.
- Content validity is a measure of the extent to which the content of the test measures all of the knowledge or skills that are supposed to be included within the domain being tested, according to expert judges.
- Criterion related validity is a measure of the extent to which a test's results correlate with other accepted measures of what is being tested.
- Predictive validity is a measure of the extent to which the test accurately forecasts a specific future result. For example, the SAT is designed to predict how well someone will succeed in his/her freshman year in college. High scores on the SAT should predict high grades for the first year in college.
- Construct validity, which some psychologists consider the true measure of validity, is the extent to which the test actually measures the hypothetical construct or behavior it is designed to assess. The MMPI-2 has a clinical trial set of questions for schizophrenia. This test has construct validity if this subset of questions successfully discriminates people with schizophrenia from other subjects taking the MMPI-2. Many people question whether intelligence tests have construct validity for measuring intelligence.
Types of Tests
Ask different psychometricians to categorize types of tests and they may give different answers, because tests can be categorized along many dimensions.
Performance, Observational, and Self-Report Tests
Psychological tests can be sorted into the three categories of performance tests, observational tests, and self-report tests. For a performance test, the test taker knows what he/she should do in response to questions or tasks on the test, and it is assumed that the test taker will do the best he/she can to succeed. Performance tests include the SATs, AP tests, Wechsler intelligence tests, Stanford-Binet intelligence tests, and most classroom tests, including finals, as well as computer tests and road tests for a driver's license. Observational tests differ from performance tests in that the person being tested does not have a single, well-defined task to perform, but rather is assessed on typical behavior or performance in a specific context. Employment interviews and formal on-the-job observations for evaluation by supervisors are examples of observational tests. Self-report tests require the test taker to describe his/her feelings, attitudes, beliefs, values, opinions, physical state, or mental state on surveys, questionnaires, or polls. The MMPI-2 exemplifies the self-report test.
Performance tests in which there is a correct answer for each item can be divided into two types, speed tests and power tests. Speed tests generally include a large number of relatively easy items administered with strict time limits under which most test takers find it impossible to answer all questions. Given more time, many test takers would probably score higher, so differences in scores among test takers are at least partly a function of the speed with which they respond. This differs from power tests, which allot enough time for test takers to complete the items of varying difficulty on the test, so that differences in scores among test takers are a function of the test taker's knowledge, and possibly good guessing.
Today on Education.com
- Kindergarten Sight Words List
- Signs Your Child Might Have Asperger's Syndrome
- Coats and Car Seats: A Lethal Combination?
- Child Development Theories
- GED Math Practice Test 1
- Graduation Inspiration: Top 10 Graduation Quotes
- The Homework Debate
- 10 Fun Activities for Children with Autism
- First Grade Sight Words List
- Social Cognitive Theory