When you do an experiment or a survey comparing two sets of people or things, the job of statistics is to show whether you have a significant difference between the two sets. A small difference between the averages or means mayor may not be significant. How do you decide?
The experts use a system that has two basic stages:
- They examine the data to find out how much variation there already is among the specimens.
- They use that variation as a basis for deciding that the experimental difference, or survey difference, is enough to be a significant difference.
Another useful statistic is the median. In the test of reaction times by the ruler-drop method, we asked each partner to measure five catches by the other partner, then took the average, or mean. We might instead have chosen the number in the middle, which is called the median. It can often be as useful as the mean and takes less time to calculate.
Statistically Meaningful Results
The example of taking five measurements across a room in the measurement chapter may be thought of as a simple method among all the possible measurements that could be made. We could also set up a program of making many such measurements, or of many people each making many measurements, so that one might eventually have thousands or millions of measurements.
The large, unknown number of measurements of which anyone measurement is considered a sample can be known only from the sample. This is like eating cookies from a cookie jar. No matter how enjoyable the first, second, or third, we will never know how good the remaining cookies are from the samples only. We can only predict or infer that the uneaten cookies, the population from which the sample came, are like the sample.
How do scientists judge whether their sample of measurements (or findings expressed other ways) fairly represents all possible measurements? Let's say that Alice is doing an experiment as her science project in which she has planted popcorn seeds in two planters to test the value of a fertilizer. She is going to compare the two plantings, one with the fertilizer and the other without but otherwise grown under uniform conditions. To keep the numbers small for quick, easy measuring, let's say that each planter has five healthy, growing plants. Suppose that Alice measures the heights of the five plants in one of the planters and finds the following:
What? All the same? Most of us who have had experience with growing things would immediately say that this is highly improbable, that it is just a coincidence that all the plants would be precisely the same height. Correct! It is a matter of chance or probability. Probability, you will find, is the main theme in the evaluation of scientific findings. Suppose, now, that Alice's measuring had brought the following results:
"That's more like it," we would say. We expect differences in things, especially in living, growing things. That is, it is highly probable that the heights would not be all the same.
Now, whether we like the sample or not, it is all we know about the larger population of plants that Alice's supply of seed might grow. Suppose, again, that Ken planted 100 seeds from the same supply as Alice's and under very much the same conditions. Then suppose he went to work measuring them at the same stage as Alice's plants. We would like to see how the sizes vary in this much larger sample, so we make a frequency distribution (see figure l3.1) showing the sizes. That is, an "X" mark is made for each corn plant over its height measurement, which is listed along the bottom of the chart.

We see that there are not many of the shortest and tallest plants but more of each size in the middle of the range. If we drew a line over the tops of the columns of sizes (and if we had many more specimens measured and recorded) the lines, or line graphs, would look something like the one in figure 13.2.

Such a distribution of a large number of things (and it must be large, preferably in the thousands) is called a normal distribution. Many things show normal distribution when they are measured and graphed like this, for example, the heights of large numbers of people picked at random and the amounts of food eaten per person per year. This widespread nature of things to show normal distribution has been used by scientists and statisticians to work out ever more meaningful designs for science investigations. Most modern scientists are thinking about the statistics they will use to analyze their findings from the beginning or planning stages of their investigations. They are saying something like this: "I don't want my experiment to come out as some queer, quirky thing that proves nothing. How must I plan now so that in the end my results will be statistically meaningful?" Scientists know, however, that there can be no perfect answer to their questions. They can always, just by chance, get results that show unexpected quirks.
Nevertheless, as a scientist does her investigation, she is trying to uncover some meaningful results. This means more than just saying, "Yes" or "No" to the hypothesis. It means going beyond the small number of subjects she may be dealing with in her experiment or survey. It means having confidence that her findings may be stretched, or generalized, to any larger group of similar subjects. Did ingredient Q seem to prevent sunburn in the experimental group of people who used it? If so, and if that experimental group fairly represents the larger popu1ation, we may then reasonably expect that ingredient Q will prevent sunburn in most of the larger population.
The use of random choices in the first stages of an investigation means more than just helping to keep the scientist's prejudices from affecting the results. It helps to assure that the sample of people, or other subjects, used in the investigation will allow us to generalize to the larger group that the sample is intended to represent.
Can You Prove It?
Let's say that Alice is doing an experiment as her science project in which she has planted popcorn seeds in two planters to test the value of a fertilizer. She uses the controlled experiment design.
To the experimental group she adds a chemical fertilizer, urea, a nitrogen compound that may be put into the soil or dissolved in the water given the plants. Her independent variable is the addition of the urea to the experimental group. Her dependent variable, if she observes one, is the difference in growth rate (height or weight) of the plants in her two planters.
At a proper time in her experiment, she measures the heights of the plants with the following results:

We see that there is a difference between the means (commonly called average) of the two groups. The difference is 1.9 cm in favor of the experimental group; the average height of the plants in that group is 1.9 cm taller than the height of the plants in the control group. This looks good.
"See!" Alice says. "Adding urea to the experimental planting has made the corn grow faster." Can she be sure of this? No, she cannot. Maybe it was a chance happening that she got five taller growing plants in the experimental group and five shorter growing plants in the control. She should not make any decision just yet. She should get someone to make a good statistical treatment (unless she can do it herself) that would go beyond comparing the mean heights of the two groups.
A statistical analysis would show how much the heights vary among themselves. Then it would show how the means compare with a larger "population" of plants like Ken's 100 plants. Where would this larger population be found? It would be imagined, inferred, or hypothetical: it would be created out of the variability, the range, the scatter of her sample and the size of the sample. It would be created by the use of equations in statistics books.
Furthermore, a judgment would be made about the chance, or the probability, that the difference Alice found was or was not simply a chance difference. This, too, would be done by reference to appropriate tables in statistics books. Actually, the number of plants in Alice's experiment is too small (only five) to make it worth all of that analysis, yet her results are supported by agricultural research by professional scientists and by the experiences of the thousands of farmers who have found it useful to apply urea and other nitrogen compounds to their corn plantings.
With all of that support, why wouldn't scientists declare that they have proven the value of this treatment of corn? The problem lies partly in this question: How can you know when you have proven a thing to be true? And it lies partly in the way the words "prove" and "true" are used in mathematics and logic as compared to the way they are used in ordinary speech.
First, the mathematics and logic. You and I can agree that this is a true statement in arithmetic: 148 + 293 + 167 = 608. That is, we follow certain rules of mathematics to prove whether the statement is an equality. Mathematicians would not agree, however, that we had proven it by following the rules of addition. They are more concerned about the sources of those rules. In the end, they would show that the statement was proven by agreeing on certain things about arithmetic and its rules.
In logic of the formal sort, proof would be much the same, as in this example:
If all wangtups have gitly speekrongs,
Then Q has gitly speekrongs.
Even though the statements do not mean anything in real life, if we accept the first and second statements as true, then the conclusion, the third statement, is also true. The "proof" is all right there in the statement. It has nothing to do with real people or things and their mixed-up ways.
Still, these simple examples do not do justice to mathematics and logic. Both are fascinating and powerful tools of thought or reasoning that humankind has created. The proof or truth of these examples, however, is so very much different from the kinds of proof that scientists are seeking that it becomes awkward to try to use the same language to describe them all. Even though mathematicians and logicians got there first with the terms "prove" and "true," scientists in recent times have pulled away from using these terms.
In ordinary experience as well there is a problem with these key words. Most people would say, "See, Alice proved it! It is true that urea makes corn grow faster." Or they might say, "That proves it! Hocus is better for a headache than Pocus," even though they may have used the medication only one time and their test has serious weaknesses. Or, again: "That proves it! Dreams do foretell the future. I knew that you were coming because I dreamed about it!"
These difficulties with the language, however, do not provide the main objection to the use of "prove" in scientific work. When we talk about "proving" something in science we are, in effect, predicting the future as well as examining the present. How much can we depend on something happening in the future just because today's scientific findings show it to be probable now?
In Alice's experiment, for example, she used only five plants in each planter. Such a small sample cannot tell us much about the larger population of future corn plantings, no matter how much statistical analysis we apply to it. However, let's do some more analysis of Alice's results to see how this helps us to learn about the predictive value of her findings. Let's rearrange the measurements of the corn plants according to height (see table 13.2).
Does this tell us more than a simple comparison of the means? Suppose her results in the experimental group had been as in table 13.3 (also ranked by height).
Here we see that the difference between the means of the two groups is the same as in table 13.2. But notice the range of heights in table 13.3. The experimental plants are not as uniformly taller than the control plants as they were in table 13.2. There is more variability. These results would provide a less reliable basis for predicting about future plantings.

I hope that you begin to agree, if you had not already known, that statistical treatment of data can reveal useful information. Finding the means and their difference is statistical analysis. Ranking the heights and comparing the pairs of plants is statistical analysis. These two ways of analyzing data are very elementary (even antiquated) when compared with the methods used by people with more mathematical and statistical knowledge.
Replicating and Expanding on Experiments
How could Alice "prove" more, besides just making statistical analyses of her data? She could replicate the experiment. This would raise the predictive power of her test if results were as good as the first test or better, even though it would still not finally prove anything. We must accept this because there is always uncertainty about the future. Some things are more highly probable than others, of course. We are all fairly sure that the sun will come up tomorrow, while we may not be so sure that another planting of corn, treated as Alice's was treated, will turn out the same. So we are always dealing in probabilities.

Scientists like to show that their findings allow them to predict, or generalize, in another way than in the simple replication of an experiment or other investigation. Alice could expand her research in several ways:
Plan A: One experimental level of urea, applied in water (Alice's first plan).
| Planter |
Description |
| 1 |
Control: no urea |
| 2 |
Experimental: 2 g (grams) per liter of water used to water the plantings |
Plan B: Three experimental levels of urea, applied in water
| Planter |
Description |
| 1 |
Control; no urea |
| 2 |
Experimental: 2 g urea per liter of water |
| 3 |
Experimental: 4 g urea per liter of water |
| 4 |
Experimental: 6 g urea per liter of water |
Plan C: Three experimental levels of urea, applied in soil
| Planter |
Description |
| 1 |
Control; no urea |
| 2 |
Experimental: 109 urea mixed in the soil |
| 3 |
Experimental: 20 g urea mixed in the soil |
| 4 |
Experimental: 30 g urea mixed in the soil |
If she were to test both variables—two ways of applying urea and three different levels of urea—at the same time Alice would need an arrangement of planters (or outdoor plots) as in figure 13.3.

You may be interested in figuring out how many different experiments would be needed to test each of these plans one at a time against a control and against each different level of urea. Also consider that there are other nitrogen compounds that should be compared with urea; each should be applied in different amounts. Then there are other kinds of soil, other varieties of corn, other planting methods, other methods of applying the fertilizer, and other chemicals that may be as important as nitrogen for promoting healthy growth in the corn. Many of these variables would best be tested in combination with certain other variables. Therefore, the designs in some cases would be more complicated than in the above Plan C. For the most significant results, most of the experiments would be conducted all the way through to the mature stage of the crop. Therefore, the testing would need to be done outdoors in plots of land large enough to accommodate farm equipment.
Surely under these expanded conditions there would be enough "population" to make the results prove something! Well, perhaps not surely, but more so. And yet these methods would create other problems. Rarely would individual plants be measured in order to determine results. Instead, more gross measures, such as weighing the grain from each plot or weighing the grain and other plant matter, would be used. This would increase our confidence in the results in that they would not be affected so much by variations among individual plants as in Alice's small groups. Nevertheless, the different plots might vary as to quality of soil, drainage conditions, and the like, and so scientists have found that each "treatment" must be used over several smaller plots that are spread in a randomized pattern around an entire field. For example, instead of two larger plots, one experimental and one control, the experimental plot is divided into five smaller experimental plots (each given the same treatment) and the control plot is divided into five smaller control plots. These plots are distributed randomly throughout the entire field. As a consequence, we find that we are dealing with a small number of things (five plots) as in Alice's experiment with five corn plants. While this gives important improvements to the overall plan, it still shows somewhat the same problem of a small sample (small number of plots). As a consequence, the statistical treatment for such a study must be highly developed if you are to squeeze the most meaning out of the results.
All said and done, there is still uncertainty about the evaluation or results as there is elsewhere in scientific method. We must not be disheartened about this uncertainty, however. Unfortunately, many people have been oversold on science and its powers for finding out the "truth" about things. Others have shown disappointment over the way science has not been able to solve more problems. It is important to understand that scientific methods are the best that have been found so far for learning about many things, and that they are superior to ordinary, everyday, "commonsense" methods. That's why scientific methods are called "scientific"—they are better than unscientific methods. Yet, by comparison, humankind has been working with scientific methods only a short time. Not all kinds of human problems can be solved by using scientific kinds of knowledge, but those problems that might be solved by scientific methods seem to be limitless. Nevertheless, in spite of the uncertainty of science and the limited speed with which scientists can move into new areas, we must use scientific methods to find out all we can about the world and the people and things in it. Even with its uncertainty, it is still the best we have for trying to resolve many of the problems of humankind.