Suppose teachers wished to determine which of two methods of reading instruction was most effective—one that involved 20 minutes of direct instruction in phonics each day throughout the academic year in grade 1 or one that involved the current practice of having the teacher read a book to the class for 20 minutes each day throughout the year in grade 1. Similarly, suppose they wished to determine whether children learn better in a small class (i.e., with 15 students) or a large class (i.e., with 30 students). Finally, suppose they wished to determine whether requiring students to take a short quiz during each meeting of a college lecture class would result in better performance on the final exam than not giving quizzes.

Each of these situations can be examined best by using experimental research methodology in which investigators compare the mean performance of two or more groups on an appropriate test. In experimental research, it is customary to distinguish between the independent variable and the dependent measure. The independent variable is the feature that is different between the groups—for example, whether 20 minutes of time each day is used for phonics instruction or reading aloud to students, whether the class size is small or large, or whether a short quiz is given during each class meeting. The dependent measure is the score that is used to compare the performance of the groups—for example, the score on a reading test administered at the end of the year, the change in performance on academic tests from the beginning of the year to the end of the year, or the score on a final exam in the class. When researchers compare two or more groups on one or more measures, they use experimental research methodology.


Experimental research is based on a methodology that meets three criteria: (a) random assignment—the subjects (or other entities) are randomly assigned to treatment groups, (b) experimental control—all features of the treatments are identical except for the independent variable (i.e., the feature being tested), and (c) appropriate measures—the dependent measures are appropriate for testing the research hypothesis. For example, in the class size example, random assignment involves finding a group of students and randomly choosing some to be in small classes (i.e, consisting of 15 students) and some to be in large classes (i.e., consisting of 30 students). The researcher cannot use pre-existing small or large classes because doing so would violate the criterion of random assignment. The problem with violating random assignment is that the groups may systemically differ; for example, students in the smaller classes may be at more wealthy schools that also have more resources, better teachers, and better-prepared students. This violation of the random assignment criterion, sometimes called self-selection, is a serious methodological flaw in experimental research.

In the class size example, the criterion of experimental control is reflected in having the classes equivalent on all relevant features except class size. That is, large and small classes should have teachers who are equivalent in teaching skill, students who are equivalent in academic ability, and classrooms that are physically equivalent; they should also have equivalence in support services, length of school day, percentages based on gender, English language proficiency, ethnicity, and so on. If the groups differ on an important variable other than class size, determining whether differences in test performance can be attributed to class size will be difficult. This violation of the experimental control criterion, called confounding, is a serious methodological flaw in experimental research.

Finally, in the class size example, the dependent measure should test the research hypothesis that class size affects academic learning, so an appropriate measure would be to give an achievement test covering the curriculum at the start and end of the year. The appropriate measures criterion would be violated if the dependent measure were a survey asking students how well they enjoyed school this year or an ungraded portfolio of their artwork over the year. When a test does not measure what is intended, the test lacks validity; invalid tests represent a serious methodological flaw in experimental research.


Experimental research is generally recognized as the most appropriate method for drawing causal conclusions about instructional interventions, for example, which instructional method is most effective for which type of student under which conditions. In a careful analysis of educational research methods, Richard Shavelson and Lisa Towne concluded that “from a scientific perspective, randomized trials (we also use the term experiment to refer to causal studies that feature random assignment) are the ideal for establishing whether one or more factors caused change in an outcome because of their strong ability to enable fair comparisons” (2002, p. 110). Similarly, Richard Mayer notes: “experimental methods— which involve random assignment to treatments and control of extraneous variables—have been the gold standard for educational psychology since the field evolved in the early 1900s” (2005, p. 74). Mayer states, “when properly implemented, they allow for drawing causal conclusions, such as the conclusion that a particular instructional method causes better learning outcomes” (p. 75). Overall, if one wants to determine whether a particular instructional intervention causes an improvement in student learning, then one should use experimental research methodology.

Although experiments are widely recognized as the method of choice for determining the effects of an instructional intervention, they are subject to limitations involving method and theory. First, concerning method, the requirements for random assignment, experiment control, and appropriate measures can impose artificiality on the situation. Perfectly controlled conditions are generally not possible in authentic educational environments such as schools. Thus, there may be a tradeoff between experimental rigor and practical authenticity, in which highly controlled experiments may be too far removed from real classroom contexts. Experimental researchers shouldbe sensitivetothis limitation, by incorporating mitigating features in their experiments that maintain ecological validity.

Second, concerning theory, experimental research may be able to tell that one method of instruction is better than conventional practice, but may not be able to specify why; it may not be able to pinpoint the mechanisms that create the improvement. In these cases, it is useful to derive clear predictions from competing theories so experimental research can be used to test the specific predictions of competing theories. In addition, more focused research methods—such as naturalistic observation or in-depth interviews—may provide richer data that allows for the development of a detailed explanation for why an intervention might have a new effect. Experimental researchers should be sensitive to this limitation, by using complementary methods in addition to experiments that provide new kinds of evidence.


Three common research designs used in experimental research are between subjects, within subjects, and factorial designs. In between-subjects designs, subjects are assigned to one of two (or more) groups with each group constituting a specific treatment. For example, in a between-subjects design, students may be assigned to spend two school years in a small class or a large class. In within-subjects designs, the same subject receives two (or more) treatments. For example, students may be assigned to a small class for one year and a large class for the next year, or vice versa. Within-subjects designs are problematic when experience with one treatment may spill over and affect the subject's experience in the following treatment, as would likely be the case with the class size example. In factorial designs, groups are based on two (or more) factors, such as one factor being large or small class size and another factor being whether the subject is a boy or girl, which yields four cells (corresponding to four groups). In a factorial design it is possible to test for main effects, such as whether class size affects learning, and interactions, such as whether class size has equivalent effects for boys and girls.


Experimental research helps test and possibly provide evidence on which to base a causal relationship between factors. In the late 1940s, Ronald A. Fisher (1890–1962) of England began testing hypotheses on crops by dividing them into groups that were similar in composition and treatment to isolate certain effects on the crops. Soon he and others began refining the same principles for use in human research.

To ensure that groups are similar when testing variables, researchers began using randomization. By randomly placing subjects into groups that say, receive a treatment or receive a placebo, researchers help ensure that participants with the same features do not cluster into one group. The larger the study groups, the more likely randomization will produce groups approximately equal on relevant characteristics. Nonrandomized trials and smaller participant groups produce greater chance for bias in group formation. In education research, these experiments also involve randomly assigning participants to an experimental group and at least one control group.

The Elementary and Secondary Education Act (ESEA) of 2001 and the Educational Sciences Reform Act (ERSA) of 2002 both established clear policies from the federal government concerning a preference for “scientifically based research.” A federal emphasis on the use of randomized trials in educational research is reflected in the fact that 70% of the studies funded by the Institute of Education Sciences in 2001 were to employ randomized designs.

The federal government and other sources say that the field of education lags behind other fields in use of randomized trials to determine effectiveness of methods. Critics of experimental research say that the time involved in designing, conducting, and publishing the trials makes them less effective than qualitative research. Frederick Erickson and Kris Gutierrez of the University of California, Los Angeles argued that comparing educational research to the medical failed to consider social facts, as well as possible side effects.

Evidence-based research aims to bring scientific authority to all specialties of behavioral and clinical medicine. However, the effectiveness of clinical trials can be marred by bias from financial interests and other biases, as evidenced in recent medical trials. In a 2002 Hastings Center Report, physicians Jason Klein and Albert Fleischman of the Albert Einstein College of Medicine argued that financial incentives to physicians should be limited. In 2007 many drug companies and physicians were under scrutiny for financial incentives and full disclosure of clinical trial results.


CONSORT Transparent Reporting of Trials. (2008). Retrieved April 22, 2008, from http://www.consortstatement.org.

Constas, M. A. (2007). Reshaping the methodological identity of education research. Evaluation Review, 31(4), 391–399.

Erickson, F., & Gutierrez, K. (2002). Culture, rigor, and science in educational research. Educational Researcher, 31(8), 21–24.

Healy, B. (2006, September 11). Who says what's best? U.S. News and World Report, 141.9. 75.

Klein, J.E., & Fleischman, A.R. (2002). The private practicing physician-investigator: ethical implications of clinical research in the office setting. Hastings Center Report, 32(4), 22–26.

Kopelman, L. M. (2004). Clinical trials. In S. Post (Ed.) Encyclopedia of bioethics (3rd ed.), pp. 2334–2343. New York: MacMillan Reference USA.

National Cancer Institute. (2006). Clinical trials: questions and answers. Retrieved February 11, 2008, from http://www.cancer.gov/cancertopics/factsheet/Information/clinical-trials.

Finally, a quasi-experiment has the trappings (and some of the advantages) of an experiment but may not fully meet all of the criteria, such as a study in which matched groups are used (rather than randomly assigned groups) or a study that compares people based on a characteristic (such as differences between boys and girls or high and low-achieving students).


In educational research, it is customary to distinguish between experimental and observational research methods, quantitative and qualitative measures, and applied versus basic research goals.

First, if experimental methods are preferred for testing causal hypotheses, what is the role of observational methods, in which a researcher carefully describes what happens in a natural environment? Observational methods can be used in an initial phase of research, as a way of generating more specific hypotheses to be tested in experiments, and observational methods can be used in conjunction with experiments to help provide a richer theoretical explanation for the observed effects. However, a collection of observations, such as portions of transcripts of conversations among students, is generally not sufficient for testing causal hypotheses. An important type of observational method is a correlational study, in which subjects generate scores on a variety of measures. By looking at the pattern of correlations, using a variety of statistical techniques, it is possible to see which factors tend to go together. However, controlled experiments are required in order to determine if the correlated factors are causally related.

Second, should educational research be based on quantitative measures (e.g., those involving numbers) or qualitative measures (e.g., those involving verbal descriptions)? Experiments may use either type of measure, depending on the research hypothesis being tested, but even qualitative descriptions can often be converted into quantitative measures by counting various events.

Third, should educational research be basic or applied? In a compelling answer to this question, Donald Stokes argues for “use-inspired basic research” (1997, p. 73). For example, in educational research, experimental researchers could examine basic principles of how instruction influences learning, that is, experiments aimed at the basic question of how to help people learn within the practical setting of schools.


Applying experimental research methods to questions about human behavior is recognized as one of the greatest scientific advances of the 20th century. Between 1975 and 2005, in particular, experimental research methodology has enabled an explosion of educationally relevant findings on how to design effective instruction in subject areas such as reading, writing, mathematics, and science.

In spite of these advances, Peggy Hsieh and colleagues (2005) found that the percentage of articles based on randomized experiments declined from 40 percent in 1983 to 26 percent in 2004 in primary educational psychology journals and from 33 percent in 1983 to 4 percent in 2004 in primary educational research journals. The authors conclude that “the use of experimental methodology in educational research appears to be on the decline” (Hsieh et al., 2005, p. 528). They characterize the decline as “unfortunate” especially in light of growing concerns about “the untrustworthiness of educational research findings” (Hsieh, et al., 2005, p. 528). In a slightly earlier report to the National Research Council, Shavelson and Towne also noted the consensus view that the “reputation of educational research is quite poor” (2002, p. 23). The decline in training in experimental research methods in schools of education can be seen as an example of the deskilling of educational researchers, marginalizing one of the most powerful and productive research methodologies and ultimately marginalizing educational researchers as well.

Valerie Reyna notes that, as a reaction against the perceived low quality of educational research, members of the U.S. Congress passed bills that were signed into law in 2001 and 2002 requiring that educational practices in the United States be based on “scientifically-based research” (2005, p. 30) Reyna shows that the definition of scientifically based research includes research using “experimental or quasi-experimental designs in which individuals … are assigned to different conditions and with appropriate controls to evaluate the effects of the condition of interest” and using “measures … that provide reliable and valid data” (2005, p. 38). According to Reyna “two landmark pieces of legislation were passed that could substantially change educational practice” not by endorsing a particular program or policy but rather by calling for educational researchers to “embrace … the scientific method for generating knowledge that will govern educational practice in classrooms” (2005, p. 49). Similarly, in their report to the National Research Council, Shavelson and Towne call for “evidence-based research” in education—the fundamental principle of science that hypotheses should be tested against relevant empirical evidence rather than ideology, opinion, or random observation (2002, p. 3).

Early 21st-century trends in experimental research include the use of effect size, meta-analysis, randomized field trials, and net impact.

The Use of Effect Size. Effect size is a measure of the strength of an effect in an experiment. Jacob Cohen (1988) suggested a simple measure of effect size—referred to as Cohen's d—in which the mean of the control group is subtracted from the mean of the treatment group and this difference is divided by the pooled standard deviation of the groups. According to Cohen, effect sizes can be classified as small (d = .2), medium (d =.5) and large (d = .8). Use of effect size allows educational policy makers to determine if an instructional treatment causes a statistically significant effect and if it has a practical effect. Hsieh et al. reported an increase in studies reporting effect size in educational psychology journals between 1995 and 2005, starting with 4 percent in 1995 to 61 percent in 2004, whereas the rate remained steady at about 25 percent from 1995 to 2004 for a primary educational research journal (2005).

The Use of Meta-analysis. The effect size measure allows for a particular instructional effect to be compared across experiments using a common metric, yielding a new kind of literature synthesis called meta-analysis. In meta-analysis, researchers tally the effect sizes of the same comparison across many different experiments, yielding an average effect size. For example, Gene Glass and Mary Smith (1978) reported a pioneering meta-analysis of research on class size revealing small positive effects of smaller class size. In the early 2000s meta-analysis is commonly used to review and summarize experimental research.

The Use of Randomized Trials. Randomized field trials (RFT), randomized clinical trials (RCT) and randomized trials (RT) refer to a particularly rigorous form of experimental research in which students (or other entities) are randomly assigned to treatments within an authentic field setting. Gary Burtless states that “a randomized field trial … is simply a controlled experiment that takes place outside a laboratory setting” (2002, p. 180).

Although randomized trials have been used in medical research and research on public policy, they are rarely used in educational research. However, there are some notable exceptions such as a study of effects of class size conducted in Tennessee, reported by Jeremy Finn and Charles Achilles (1999). As part of the study, 11,600 students in 79 schools across the state were assigned along with their teachers to small classes (13–17 students), regular classes (22–26 students), or regular classes with full-time teacher aides. Students stayed in the program from kindergarten through third grade, and then all were returned to regular classes. Importantly, the study showed that students in the small classes outperformed those in the regular classes, with or without aides, and the effects were greatest for minorities. Frederick Mosteller called the Tennessee class size study “one of the most important educational investigations ever carried out” (1995, p. 113) In the foreword to Evidence Matters by Fredrick Mosteller and Robert Boruch, the authors observe, “When properly conducted, randomized field trials—often called the gold standard in research involving human subjects—allow for fairly precise estimates of programmatic effects” (2002, p. vi). Using an appropriate unit of measure (for example, individual students, classrooms, or schools) is an important consideration in research using randomized field trials.

Net Impact. Judith Gueron (2002, p. 18) distinguishes between an intervention's outcomes (e.g., the percentage of students graduating from a school or passing a certification test) and its net impact (e.g., the percentage who graduate or who pass a certification test who would not have without the intervention). Gueron argues that “administrators often know and tout their program's outcomes, but they rarely know the program's net impacts” (p. 18). When administrators focus on the question, “Is the new intervention effective?” they focus only on outcomes. When they focus on the question, “Does the new intervention have more impact than the current practice?” they focus on net impact. In order to determine an intervention's net impact, experimental researchers compare the outcomes with current practice (e.g., current instructional method) to the outcomes with the new intervention (e.g., the new instructional method). In short, Gueron argues that the question “Compared to what?” is an important and profound issue in experimental research.

In their analysis of educational research methodologies, Shavelson and Towne note: “decisions about education are sometimes instituted with no scientific basis at all, but rather are derived from ideology or deeply held beliefs” (2002, p. 17). In contrast, experimental research methodology has the potential to be a tool for promoting effective change in education in which decisions about instructional interventions are guided by scientific evidence and grounded in research-based theory. In the preface to Gary Phye, Daniel Robinson, and Joel Levin's Empirical Methods for Evaluating Educational Interventions, Gary Phye observed: “we are on the cusp of a reaffirmation that experimental research strategies provide the strongest evidence” for testing the effects of educational interventions (2005, p. xi). Finally, Robert Boruch quotes Walter Lippman who, in the 1930s, said: “Unless we are honestly experimental, we will leave the great questions of society and its improvement to the ignorant opponents of change on the one hand, and the ignorant advocates of change on the other” (2005, p. 189). In short, the experimental research methodology that fueled an explosion of scientific research about humans in the 1900s remains a powerful and indispensable tool for educational researchers in the new millennium.


Burtless, G. (2002). Randomized field trials for policy evaluation: Why not in education? In F. Mosteller & R. Boruch (Eds.), Evidence matters: Randomized trials in educational research (pp. 179–197). Washington, DC: Brookings Institution Press.

Boruch, R. (2005). Beyond the laboratory or classroom: The empirical basis of educational policy. In G. D. Phye, D. H. Robinson, & J. Levin (Eds.), Empirical methods for evaluating educational interventions (pp. 177–192). San Diego: Elsevier Academic Press.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. 2nd ed. Mahwah, NJ: Erlbaum.

Finn, J. D., & Achilles, C. M. (1999). Tennessee's class size study: Findings, implications, misconceptions. Educational Evaluation and Policy Analysis, 21, 97–109.

Glass, G. V., & Smith, M. L. (1978). Meta-analysis of research on the relationship of class size and achievement. San Francisco: Far West Laboratory of Educational Research and Development.

Gueron, J. M. (2002). The politics of random assignment: Implementing studies and affecting policy. In F. Mosteller & R. Boruch (Eds.), Evidence matters: Randomized trials in educational research (pp. 15–49). Washington, DC: Brookings Institution Press.

Hsieh, P., Acee, T., Chung, W., Hsieh, Y., Kim, H., Thomas, G. D., et al. (2005). Is educational intervention research on the decline? Journal of Educational Psychology, 97, 523–530.

Mayer, R. E. (2005). The failure of educational research to impact educational practice: Six obstacles to educational reform. In G. D. Phye, D. H. Robinson, & J. Levin (Eds.), Empirical methods for evaluating educational interventions (pp. 67–81). San Diego: Elsevier Academic Press.

Mosteller, F. (1995). The Tennessee study of class size in the early school grades. The Future of Children, 5, 113–127.

Mosteller, F., & Boruch, R. (2002). Evidence matters: Randomized trials in educational research. Washington, DC: Brookings Institution Press.

Phye, G. D., Robinson, D. H., & Levin, J. (Eds.). (2005). Empirical methods for evaluating educational interventions. San Diego: Elsevier Academic Press.

Reyna, V. F. (2005). The no child left behind act, scientific research, and federal education policy: A view from Washington, D.C. In G. D. Phye, D. H. Robinson, & J. Levin (Eds.), Empirical methods for evaluating educational intervention (pp. 29–52). San Diego: Elsevier Academic Press.

Shavelson, R. J., & Towne, L. (Eds.). (2002). Scientific research in education. Washington, DC: National Academy Press.

Stokes, D. E. (1997). Pasteur's quadrant: Basic science and technological innovation. Washington, DC: Brookings Institution Press.