**Introduction to Sound Data and Sampling Frames**

In the real world, data is gathered by a process called *sampling*. It's important that the sampling process be carried out correctly, and that errors of all kinds be minimized (unless the intent is to deceive).

When conducting a statistical experiment, four steps are followed. They must be done in order. Here they are:

- Formulate the question. What do we want to know, and what (or who) do we want to know it about?
- Gather the data from the required places, from the right people, and over a sufficient period of time.
- Organize and analyze the data, so it becomes
*information*.
- Interpret the information gathered and organized from the experiment, so it becomes
*knowledge*.

**Source Data**

**Primary Versus Secondary**

If no data are available for analysis, we have to collect it ourselves. Data collected by the statistician is called *primary source data*. If data are already available and all a statistician has to do is organize it and analyze it, then it is called *secondary source data*. There are certain precautions that must be taken when using either kind of data.

In the case of primary source data, we must be careful to follow the proper collection schemes, and then we have to be sure we use the proper methods to organize, evaluate, and interpret it. That is, we have to ensure that each of the above Steps 2, 3, and 4 are done properly. With secondary source data, the collection process has already been done for us, but we still have to organize, evaluate, and interpret it, so we have to carry out Steps 3 and 4. Either way, there's plenty to be concerned about. There are many ways for things to go wrong with an experiment, but only one way to get it right.

**Sampling Frames**

The most common data-collection schemes involve obtaining samples that represent a population with minimum (and ideally no) bias. This is easy when the population is small, because then the entire population can be sampled. However, a good sampling scheme can be difficult to organize when a population is large, and especially when it is not only huge but is spread out over a large region or over a long period of time.

The term *population* refers to a particular set of items, objects, phenomena, or people being analyzed. An example of a population is the set of all the insects in the world. A sample of a population is a subset of that population. Consider, as a sample from the foregoing population, the set of all the mosquitoes in the world that carry malaria.

It can be useful in some situations to define a set that is intermediate between a sample and a population. This is often the case when a population is huge. A *sampling frame* is a set of items within a population from which a sample is chosen. The idea is to whittle down the size of the sample, while still obtaining a sample that fairly represents the population. In the mosquito experiment, the sampling frame might be the set of all mosquitoes caught by a team of researchers, one for each 10,000 square kilometers (10^{4} km^{2}) of land surface area in the world, on the first day of each month for one complete calendar year. We could then test all the recovered insects for the presence of the malaria protozoan.

In the simplest case, the sampling frame coincides with the population (Fig. 5-1A). However, in the mosquito experiment described above, the sampling frame is small in comparison with the population (Fig. 5-1B). Occasionally, a population is so large, diverse, and complicated that two sampling frames might be used, one inside the other (Fig. 5-1C). If the number of mosquitoes caught in the above process is so large that it would take too much time to individually test them all, we could select, say, 1% of the mosquitoes at random from the ones caught, and test each one of them.

**Choosing Frames**

The choice of sampling frames is important, because each frame must be a fair (unbiased) representation of the next larger element set in which it is contained. Let's look at another frames-within-frames example that can be descriptive, even though it does not lend itself to illustration.

Imagine that we want to evaluate some characteristic of real numbers. The population is the set of all real numbers. Sampling frames are a matter of choice. How about the irrational numbers? Or the rational numbers? How about the set of all real numbers that are square roots of whole numbers? Suppose we choose the set of rational numbers as the sampling frame. Within this set, we might further specify subframes. How about the set of integers? Or the set of rational numbers whose quotients have denominators that are natural numbers between and including 1 and 100? How about the set of even integers? Or do you prefer the set of odd integers? Finally, within this set, we choose a sample. How about the set of integers divisible by 100? Or the set of odd integers that are 1 greater than every integer divisible by 100?

Throughout this process, we must keep one thing in mind: All the sampling frames we choose, and the final sample as well, must be an unbiased representation of the population for the purposes of our experiment. Depending on what this purpose happens to be, the whittling-down process we choose might be satisfactory, or it might be questionable, or it might put us way off the track.

In any real-life experiment, the sample should not be too large or too small. If the sample is too large, it becomes difficult to collect all the data because the process takes too many human-hours, or requires too much travel, or costs too much. If the sample is too small, it will not be a fair representation of the population for the purposes of the experiment. As the sample gets smaller, the risk of its being a poor representation increases.

**Source Data and Sampling Frames Practice Problems**

**Practice 1**

Suppose you want to describe the concept of a "number" to pre-college school students. In the process of narrowing down sets of numbers described above into sampling frames in an attempt to make the idea of a number clear to a child, name a few possible assets, and a few limitations.

**Solution 1**

Think back to when you were in first grade. You knew what a whole number is. The concept of whole number might make a good sampling frame when talking about the characteristics of a number to a six-year-old. But by the time you were in third grade, you knew about fractions, and therefore about rational numbers. So the set of whole numbers would not have been a large enough sampling frame to satisfy you at age eight. But try talking about irrational numbers to a third grader! You won't get far! A 12th-grader would (we hope) know all about the real numbers and various subcategories of numbers within it. Restricting the sampling frame to the rational numbers would leave a 12th-grader unsatisfied. Beyond the real numbers are the realms of the complex numbers, vectors, quaternions, tensors, and transfinite numbers.

**Practice 2**

Suppose you want to figure out the quantitative effect (if any) that cigarette smoking has on people's blood pressure. You conduct the experiment on a worldwide basis, for all races of people, female and male. You interview people and ask them how much they smoke, and you measure their blood pressures. The population for your experiment is the set of all people in the world. Obviously you can't carry out this experiment for this entire population! Suppose you interview 100 people from each country in the world. The resulting group of people constitutes the sampling frame. What are some of the possible flaws with this scheme? Pose the issues as questions.

**Solution 2**

Here are some questions that would have to be answered, and the issues resolved, before you could have confidence in the accuracy of this experiment.

- How do you account for the fact that some countries have far more people than others?
- How do you account for the fact that the genetic profiles of the people in various countries differ?
- How do you account for the fact that people smoke more in some countries than in others?
- How do you account for the fact that the average age of the people in various countries differs, and age is known to affect blood pressure?
- How do you account for differences in nutrition in various countries, a factor that is also known to affect blood pressure?
- How do you account for differences in environmental pollutants, a factor that may affect blood pressure?
- Is 100 people in each officially recognized country a large enough sampling frame?

Practice problems for these concepts can be found at:

Sampling and Estimation Practice Test