**Introduction to Sample Surveys**

*The results of surveys are presented almost daily in newspapers, over the radio, and on television. From surveys, the proportion p of the population with a certain trait or opinion can be estimated. In fact, if the sample size is 1,500, we can be almost sure that our estimate is within 0.03 of the population proportion. Remarkably, being able to estimate the population proportion with this precision does not depend on the size of the population. A sample of 1,500 people is sufficient whether we are drawing inference to the people living in a particular state, to the people living within the United States, or to the people living on Earth, provided the sample is taken properly. Taking a proper sample is challenging. In this lesson, we will learn more about conducting sample surveys.*

**Margin of Error**

From June 24 through 26, 2005, the Gallup Organization contacted 1,009 adults nationally and asked them, "How patriotic are you? Would you say —extremely patriotic, very patriotic, somewhat patriotic, or not especially patriotic?" Of the respondents, 72% said "extremely or very patriotic." Thus, = 0.72 is the estimate of the proportion *p* of adults in the United States who would state they are extremely or very patriotic. The sample proportion is a point estimate of the population proportion. A *point estimate* of a population parameter is a single number that is based on sample data and represents a plausible value of the parameter.

The Gallup Organization also reported that there was a ±3 percentage point margin of error associated with the survey. The margin of error provided by this and other media descriptions of survey results has two important characteristics. First, the difference between the sample proportion and the population proportion *p* is less than the margin of error about 95% of the time; that is, for about 19 of every 20 random samples of the same size from the same population, the sample proportion will be within the margin of error of the population proportion. Second, the sample proportion will differ from the population proportion by more than the margin of error about 5% of the time; that is, for about one in every 20 samples of the same size from the same population, the difference in the sample proportion and the population proportion will be greater than the margin of error.

The margin of error can be used to obtain an interval of plausible values for the parameter of interest. For the survey on patriotism, the point estimate was 0.72, and the margin of error was 0.03. Thus, the interval of plausible values based on this sample is 0.69 to 0.75.

**Example**

A sample of high school students was randomly selected from a very large city. Each student was asked, "Are you employed either part time or full time during the school year?" Of those sampled, 38% reported that they had a part-time or a full-time job during the school year. The margin of error was reported to be 5%. Give a point estimate and an interval of reasonable values for the proportion of this city's high school students having employment that, with 95% certainty, includes the true proportion.

**Solution**

The point estimate of the proportion of the high school students in this city who are employed, either part time or full time, is 0.38. An interval of plausible values for this proportion is between 0.38 – 0.05 = 0.33 and 0.38 + 0.05 = 0.43.

**Census versus Sample Survey**

In a census, every unit in the population is included in the sample. This is the only way to determine a parameter exactly. If our goal is to determine a parameter's value, why do we usually sample and not take a census? There are various reasons that we must, or want to, sample instead of taking a census.

It may not be feasible to take a census. When a nurse draws blood for a test, you certainly want her to be satisfied with a sample and not to take all of your blood as a census would require. A manufacturer who takes a census to determine the mean lifetime of the batteries the company produces will have nothing left to sell!

Many times, a census takes too long to complete. Suppose we want to know what proportion of the cotton plants in a 160-acre field has at least one insect on them. (The number of plants per acre can vary from 30,000 to 58,000.) It would take days to check each plant. By then, the plants first inspected that had insects may or may not still have insects on them, and the plants inspected early that did not have insects might now have insects on them. The U.S. Census, which is completed every ten years, takes years to plan and more time to compile the results after the data are collected; it would not be feasible to census the U.S. population each year.

A census is often not as accurate as a sample survey. A small group of interviewers can be trained more easily than can a large one. Finding a small number of nonrespondents is a much more manageable task than finding a large number of nonrespondents. For the U.S. Census, it is difficult to actually count all citizens. Some do not have a home; others do not want to be counted. Various techniques have been used to count these people. This has led some to argue that a more accurate count of the U.S. population would be obtained if it were estimated from a sample; others disagree.

**Simple Random Samples**

Earlier, we noted that the sample proportion will be within the margin of error of the population mean *provided* that the sample was properly taken. Before describing some of the methods that can be used to select samples properly, we need to think more carefully about some of the elements of sampling.

In a sample survey, the target population is the set of units that is of interest. The sampled population is the set of units from which the sample is selected. Although we want the target and sampled populations to be the same, this is rarely the case. As an illustration, suppose the target population is every household in the United States. If a telephone survey is conducted using the white pages from phone books across the nation, only households with telephone numbers in the white pages are part of the sampled population. Households without a telephone or with unlisted numbers are not part of the sampled population. The sample frame is a list of all units from which the sample is drawn; it is a list of the units in the sampled population.

Generally, the purpose of a sample survey is to draw inferences about some population characteristic(s). For a relatively small sample to accurately reflect the characteristics of a large population, the sample cannot be drawn haphazardly. Proper sampling methods, specifically, probability sampling plans, must be used. A *probability sampling plan* is one in which every unit of the sampled population has a known probability of being included in the sample. In Lesson 2, we learned that a simple random sample is one in which every set of units of size *n* in the sampled population has an equal chance of being selected; a random sample is a probability sample.

Suppose we want to take a simple random sample of size 30 from the people who have donated funds to the local public radio station within the past year. Working with the station, we could write each contributor's name on a slip of paper, place it in a bowl, mix the pieces of paper thoroughly, and draw out 30 slips. The names on the 30 slips of paper constitute the people in the sample. This approach becomes impractical as the population of interest becomes large. Writing the names of all residents of a city, much less a state or nation, on slips of paper would take a prohibitive amount of time. Instead, the sample frame (list of names) is usually generated from one or more sources, such as tax rolls or residential addresses, and the computer is used to make selections from the list in a manner that permits every listed unit (person) to have an equal chance of being chosen. Those units (people) selected by the computer constitute the simple random sample.

Generating the sample frame is a major challenge, especially if the population is large and/or geographically dispersed. The resources available to create the frame may not be sufficient. Sometimes, even if they are sufficient, it is impossible to create the sample frame, at least within the desired time frame. Other sampling methods have been developed as alternatives to simple random sampling. These tend to be more complicated both in selecting the sample and in obtaining parameter estimates from the sample. Depending on the circumstance, they may have some advantages over simple random sampling. We will consider four such methods: stratified random sampling, cluster sampling, systematic sampling, and multistage sampling.

**Stratified Random Sampling**

Sometimes, the population has natural groups, called *strata*. As an illustration, when estimating the literacy rate for the nation, estimates of the literacy rate for each state may be useful in assessing state and regional differences. A *stratified random sample* is one in which the population is first divided into groups (strata) and then a simple random sample is taken within each stratum. Estimates are made for each stratum and then combined to obtain the population estimate. Because the sizes of strata usually vary, a weighted average of the stratum estimates, with weights proportional to the strata sizes (not a simple average), is used to estimate the population parameter.

**Cluster Sampling**

Although cluster sampling is often confused with stratified random sampling, it is very different. In cluster sampling, a population is divided into groups, called *clusters*, a random sample of clusters is selected, and only units in those clusters are measured. In most applications of stratified random sampling, the population is divided into a few large strata and a simple random sample is selected from each stratum. In contrast, in most applications of cluster sampling, the population is divided into many small clusters, a sample of clusters is randomly selected, and every unit in the cluster is measured.

Cluster sampling is often used because it is easier and more cost effective than other alternatives. For example, suppose we want to sample the households in a large city, using door-to-door interviews. It may be very expensive to construct a list of all households, select *n* addresses at random, and visit each selected household. A cluster sample in which blocks within the city are randomly selected and all households within each selected block are interviewed may be more cost effective. Once a block is selected, the interviewer can conduct several interviews before moving to the next block, reducing the time needed to obtain interviews from the same number of households. However, households within the same block may tend to be more alike than households in different blocks. This tendency of units in the same cluster to be more alike than units in different clusters must be addressed in the analysis. Such approaches are available in books on sampling.

**Systematic Sampling**

Suppose that you have a sample frame consisting of a list of 5,000 names and want to draw a sample of 100. To use a systematic sampling plan, we would divide the list into 100 consecutive segments of size = 50, choose a random point in the first segment, and include that unit in the study and every unit at the same point in all segments. Upon completion, the sample would consist of 100 units equally spaced throughout the list. Systematic sampling is also used in many natural resource studies. Here, a grid of points is randomly placed over the region. To randomize, one point, say a corner point, is randomly assigned a location within a small area, and the whole grid is set relative to the random placement of that point.

Systematic sampling can be a good alternative to simple random sampling. If the sample units are randomly listed in the sample frame, the systematic sample is usually treated as a simple random sample. Care must be taken as systematic random sampling could lead to biases. The potential biases associated with treating a systematic sample as a simple random sample when using a grid have been discussed in the natural resources literature.

**Multistage Sampling**

Many large surveys use a combination of the methods we have discussed. As an illustration, a large national survey may first stratify by regions of the country. Within each regional stratum, we might then stratify by state. Within each state, we could stratify by urban, suburban, and rural areas. We could then randomly select communities within each of the urban, suburban, and rural strata. Finally, we could randomly select blocks or fixed areas within each selected community and interview everyone within that fixed area. A *multistage sampling* plan is one that combines methods as illustrated here.

**Random-Digit Dialing**

Most of the national polling organizations and many of the government surveys in the United States now use a sampling plan called random-digit dialing. This method approximates a random sample of all households in the region of interest that have telephones. To initiate a random-digit dialing plan, the polling organization must first get a list of all telephone exchanges in the region of interest. A telephone exchange consists of the area code and the next three digits. Using the numbers listed in the white pages, the proportion of all households in the region with that specific exchange can be approximated. That proportion is used to determine the chance that the telephone exchange is randomly selected for inclusion in the sample. Next, the same process is followed to randomly select banks within each exchange. A telephone bank consists of the next two numbers. Finally, the last two digits are randomly selected from 00 to 99. Although the process is quite involved, it has been computerized, and random telephone numbers can be generated rapidly.

Once a telephone number has been generated, pollsters should make multiple attempts to reach someone at that household if no one responds initially. They may ask to speak to a male or an adult because females and children are more likely to answer the phone, potentially biasing the results because they are overrepresented in the sample.