Questions to Ask of Any Test
Whether you are about to take a test or build one, there are three questions to keep in mind. However, note that at this point we are considering the test as an information-gathering instrument—we are not talking about the test results.
Formative or Summative
The first question to ask is whether the test is to be formative or summative. Why does that matter? Certainly it matters to you as a student, right? And it would similarly matter to your students because formative tests do not carry that “evaluative” weight of summative tests. But as a test designer it is a particularly important question to ask. As we have already seen, formative and summative tests serve different purposes, and so it is important that you know just what you are trying to accomplish with the test you are about to write. Is it to get information for designing instruction or are you trying to determine your students’ level of achievement so that you can assign a grade? Clearly, the answer to this question of formative or summative has implications for what happens after the test.
Internal Consistency
The next question concerns what is called the internal consistency of the test. That is, it refers to matters about the integrity of the instrument (the test) for doing what it is supposed to do. Internal consistency is addressed in terms of validity and reliability.
Validity
Validity, quite simply, is a matter of whether the test measures what it is supposed to measure. If you are going to write a test for what you taught in science but include a few questions from social studies “just to see whether they were paying attention,” then the test is not valid. If you want to write a test about that same science lesson but decide to include something that wasn’t, for instance, from the chapter the class actually studied—or was content from the chapter that you didn’t get to—then the test is not valid. You could announce that whether or not you discussed it in class anything from the chapter could be on the test, but then you would have to ask yourself whether the teacher is really necessary or whether we should just assign readings and give the students a test. We don’t want to seem silly about this, but you and a classmate have very likely walked away from taking a test and complained that something was on there that you hadn’t expected. Well, now that you are writing the test, we want to be sure that only the appropriate material is included—that is, that the test is valid. So, before giving a test you should ask yourself whether the test you have written is valid rather than waiting for your students to tell you that it isn’t.
Content Validity
As background, there are three basic categories of validity, though your test writing will almost always be focused on one type in particular. Content validity refers to a test of knowledge or comprehension of knowledge. Spelling, math, history, science—these are all content areas. Obviously, the vast majority of tests that you write will revolve around the idea of content validity. That means that you will want to be sure that the test addresses what you have actually done in class (or assigned) and nothing else.
Construct Validity.
Constructs are areas that can’t be assessed by just asking for knowledge-based responses. Musical ability is a construct. So is creative thinking. Another is intelligence. For these areas, and others like them, an assessment instrument will ask a variety of questions or perhaps require the demonstration of various skills, and from the responses we infer a degree of intelligence, or of creative thinking, or of whatever construct is under consideration.
Predictive Validity
Similarly, you will rarely—if ever—need to design a test based upon predictive validity. It could be argued that mastering the work in one grade level could predict success at the next grade level, but that would not really be accurate. Mastering work at one level provides the foundation for attempting the next level. So the one is not really making a prediction as much as it might just set up an expectation. The tests you may have taken to get into college, on the other hand, were designed to predict your future academic success. As we all know, the predictions are not always accurate, though that is the nature of predictions. On the whole, however, a high score on something such as the SAT or the ACT tends to correlate positively with college success.
Reliability
Supposing that we feel pretty confident that a particular test is valid, the next question is that of reliability, which is the question of whether or not a test gets consistent results. A test that every class aces each time it is administered is reliable. So is a test that everybody fails each time. Even one where students routinely have so-so scores may be considered reliable. Admittedly, this is a concern that is most closely related to standardized tests that will be used in their same form over and over again. If on such a test the scores from one administration to the next vary widely, the test designers would have to consider whether the reason for that variation is because of the different populations taking the test or whether it is because something about the test itself is coming across differently to different groups. It would not be good assessment practice to attach a particular score to a student’s performance— something that could have significant implications for that child’s further academic work—only to decide some years later that the test was flawed. Reliability becomes a very important consideration.
During your teacher-education program you likely had a course in tests and measurement. At that time you may have addressed several methods of determining the reliability of a test such as test–retest, split-half, and alternate forms. With that in mind, and considering that this book is not intended as a course in assessment, we are going to briefly consider those three approaches in a nonstatistical manner.
Test–Retest Reliability
The test–retest approach to checking the reliability of an assessment is not the same as a pretest/posttest situation. In the latter, you may be giving your students a test about some topic before you teach the topic (that’s the pretest). Your intent is to find out what they already know (or don’t know) so that you can design appropriate instructional experiences. Following your teaching you will again test the students (the posttest) to see what progress they have made. The key element here is that instruction on the topic occurs between the pretest and the posttest.
In a test–retest consideration of reliability the key element is that no instruction (on that particular topic) occurs between the test and the retest. Having finished a particular lesson or unit you may decide to test your students. You put together a nice valid test of the content taught and let the students have at it. Then you let some time go by. Could be that everybody goes home for a holiday vacation or you continue on with some other topics to be presented. In either case, after a week or so you give the same test to the students. They, of course, will be wondering why they are taking the same test again. You will know that the reason you have readministered the test is not actually to check the students but instead to check the reliability of the test.
There probably aren’t many teachers who really want to see how the students do a week or two later (expecting the scores to drop precipitously), but keep two things in mind: First, if the scores are relatively parallel in terms of their distribution (even if, e.g., they are ten points lower across the board), then you probably have a reliable test that you can feel comfortable using again—as long as the instructional experiences you provide to the next set of students is consistent. Second, it’s that second set of scores that will really tell you what your students learned. And that could be very worthwhile to know.
If the scores fluctuate substantially, however, you may want to take a look at the test to see whether there is something about the wording or the tasks involved that is coming across differently to students the second time through. In terms of a paradigm, test–retest uses one test, administered to one group, two times.
Split-Half Reliability
Don’t want to take the time to test them twice? Well, there are other options. If you are writing a test for a subject in which multiple questions could be asked that are parallel in what they assess (e.g., a math test that could have twenty or thirty variations of two-digit by three-digit multiplication problems) then the split-half approach would give you an idea of the reliability of the test. Let’s use the example of two-digit by three-digit multiplication problems. Write out perhaps twenty problems, all of which follow the same twodigit by three-digit format. However, when you score the tests, score Items 1–10 and write out a result for each student (5/10, 7/10, etc.) and then score Items 11–20 separately. If the scores are fairly similar, then chances are good that the collection of problems you have assembled is a reliable test for that topic. Hold on to it and keep using it instead of staying up late every year coming up with a new test.
If the scores seem significantly different, then you need to look over the problems to see whether there was a pattern of what problems were missed. Keep in mind, just as we said about test–retest, that you’re interest here is in the test instrument more so than in student achievement. You are in the process of designing a good, strong test for that particular topic. A paradigm for this approach: one test, given to one group, one time.
Alternate Forms Reliability
The third approach is a little more labor intensive. On the upside, it can leave you with two or more good tests to use so that throughout the day (if you are teaching middle school or high school), or year-to-year (for you elementary school teachers), students are not tipping off other students about the upcoming test. What we are talking about is the alternate forms approach to checking test reliability.
Alternate forms reliability can work well with subject areas that require responding to questions (as opposed to solving problems). Let’s suppose you are writing a test for a social studies lesson and will use a short-answer format. This approach to checking for reliability will require that you write out two (or more) separate tests. On both tests you will ask questions about the same topic but in slightly different ways. The questions need to be roughly parallel, that is, both tests should ask about the same items rather than covering some items on one test and others on the second version. Give some of your students Form A and some Form B. When you compare the results, scores should be essentially equivalent on both forms. If that is not the case, examine the two tests to find out where there seemed to be a disparity. For some reason, those particular questions are not coming across to the students as you had intended.
As you can probably see, there is a real advantage to doing this. Once you have constructed two tests that yield equivalent results, you could use both at the same time (alternating from one student’s desk to another) and eliminate cheating during the test—at least once you’ve pointed out that the students all around them have different tests. This was likely the case when you took standardized tests. The test designers have many forms of those tests, and even if they use just one during your particular test session, the one they use for the next test session is likely different. The paradigm for alternate forms: two tests, given to one group, one time.
Validity and reliability are important considerations for test designers. Even if we are talking about the tests that you write just for one time use with a single class— or even for one student—ensuring that the test you write is assessing what it is supposed to assess and that it is a reliable assessment instrument are part of your responsibilities as a professional educator.
Statistical Referents
The third question that should be asked is a matter of how the results will be referenced. This question addresses several concerns. That is, will students have to reach a particular score to receive a particular grade, no matter how the other students do on the test (criterion referencing), or will each student’s score be determined by the overall performance of the group taking the test, sometimes known as grading on the curve in the classroom setting (group referencing)? The overwhelming majority of tests that you write will be criterion referenced. For example, the test is worth 100 points and students must reach 93 or above to receive an A. Though an interesting mix of criterion-referenced and group-referenced tests are used in the schools, the classroom tends to emphasize an individual’s performance against an announced standard rather than comparing students to each other.
Also of concern to you, particularly as a test giver rather than as a test taker, will be the summary statistics (the measures of central tendency and the standard deviation). A look at the summary statistics will help determine whether you need to make some adjustments in your teaching or test preparation. In other cases you may find that the test simply was a good detector of student strengths and weaknesses. And that, by the way, is the kind of test we want to use.
Without a doubt you will probably calculate the mean for the distribution of scores to give you a sense for how the class as a whole fared on the test. But you should also take the time to look at the mode and the median as well. If all three measures of central tendency are the same value, then you likely have a symmetrical distribution of scores. That is an indication that the test itself may have been a good assessment instrument. On the other hand, means that are above or below the mode will start pointing you in the direction of considering whether the test was simply too difficult or too easy. Note: We are not suggesting that you write tests where only a few students do well and everybody else is average! We are simply saying that you should verify that the results of the test you are using truly reflect academic achievement rather than simply being a test that was not well prepared.