Introduction to Populations, Samples, and Variables
Numerical information permeates our lives. The morning weather report forecasts the chance of rain, and we make a decision as to whether or not to take an umbrella. Given the latest study results on the health risks of over-the-counter painkillers, we decide whether take something to reduce the pain from a sore knee. A friend wants to attend a very selective university and wonders whether an SAT score of 1,400 or higher will ensure her admittance. A neighbor was told that there was a peculiar shadow on an X-ray and must decide whether to have a biopsy taken. The stock market has had several days of losses, and an investor wonders whether this trend will continue. Our understanding of these and many other issues will be deeper as we learn more about the discipline of statistics. But what is statistics? Learning the answer to this question, as well as some fundamental terms in statistics, such as population, sample, and variable, is the focus of this lesson.
Populations and Samples
Statistics is the science of collecting, analyzing, and drawing conclusions from data. This process of collecting, analyzing, and drawing conclusions begins with the desire to answer a question about a specific population. In statistics, a population is the collection of individuals or objects of interest. These individuals or objects may be referred to as members or units of the population. If we are able to record all desired information on each unit in the population, then we have taken a census. The problem is that we rarely have the ability to gather the information from every unit of the population, due to financial constraints, time limitations, or some other reason. We must be satisfied with observing the information for only a sample, or a subset of the population of interest.
Care must be taken in obtaining the sample if we are to be able to draw solid conclusions from it. For example, if we are interested in whether a majority of the voters in a particular state would favor increasing the minimum driving age, then we would not want to simply call several households and ask the person answering the phone whether he or she favored increasing the minimum driving age. In households with children, the children are more likely to answer the phones than the adults, and the views of these nonvoters might be quite different from their voting parents. Deciding how to select the sample from the population is an important aspect of data collection.
Example
A proposal before a state's legislature would increase the gasoline tax. The additional funds would be used to improve the state's roads. Some state legislators are concerned about how the voters view this proposal. To gain this information, a pollster randomly selects 1,009 registered voters in the state and asks each whether or not he or she favors the additional tax for the designated purpose. Describe the population and sample.
Solution
The population is all registered voters in the state. The sample is made up of the 1,009 registered voters who were polled.
Types of Variables
There are two primary branches of statistics: descriptive statistics and inferential statistics. Once data have been collected or an appropriate data source identified, the information should be organized and summarized. Tables, graphs, and numerical summaries allow increased understanding and are efficient ways to present the data. Descriptive statistics is the branch of statistics that focuses on summarizing and displaying data.
Sometimes, description alone is not enough. People want to use data to answer questions or to evaluate decisions that have been made. Inferential statistics is the branch of statistics that uses the information gathered from a sample to make statements about the population from which it was selected. Because we have seen only a portion of the population (the sample), there is a chance that an incorrect conclusion can be made about the population. One role of statistics is to quantify the chance of making an incorrect conclusion.
If every population unit (person or object) were identical, no need would exist for statistics. For example, if all adult men in the United States were exactly the same number of inches tall, we could measure the height of one adult male in the United States and then know exactly how tall all U.S. adult males are. Obviously, that will not work. The heights of men vary. Some are taller than 70 inches; some are shorter than 70 inches; a small proportion is 70 inches tall. It is this variability in heights that makes determining height characteristics about the population of U.S. adult males a statistical challenge.
Every person or object in a given population typically has several characteristics that might be studied. Suppose we are interested in studying the fish in a lake. The length, weight, age, gender, and the level of methyl mercury are but a few of the characteristics that could be recorded from each. A variable is a characteristic that may be recorded for each unit in the population, and the observed value of the variable is generally not the same for all units. The length, weight, age, gender, and mercury level of the fish are five variables, some or all of which might be of interest in a particular study.
Data consist of making observations on one or more variables for each sampled unit. A univariate data set consists of observations collected regarding only one variable from each unit in the sample or population. A bivariate data set results from observations collected regarding two variables from each unit in the sample or population. When observations are collected on three or more variables, then we have a multivariate data set. (Sometimes, bivariate data sets are called multivariate data sets. Because multi implies more than one, this is an acceptable use of the term.)
When working with bivariate or multivariate data, the variables may have different uses. For the fish data, the goal of the study may be to predict the level of methyl mercury in fish; that is, methyl mercury level is the response variable. A response variable, or outcome variable, is one whose outcome is of primary interest. The methyl mercury level could depend on many factors, including the environment and traits of the fish. Fish length, age, weight, and gender may be potentially useful in explaining the level of methyl mercury and are called explanatory variables. An explanatory variable is one that may explain or cause differences in the response variable.
Notice, in the fish example, that the natures of the variables differ. Length, weight, and age are numerical (or quantitative) variables; that is, each observation for these variables is a number. A numerical variable is said to be continuous if the set of possible values that may be observed for the variable has an uncountable number of points; that is, the set of possible values of the variable includes one or more intervals on the number line. Length and weight represent two continuous variables. Both must be positive. Although the sensitivity of the measuring device may limit us to recording observations to the nearest millimeter or gram, the true values could be any value in an interval.
A numerical value is said to be discrete if the set of possible values that may be observed for this variable has a countable number of points. The ages of fish are often determined by growth rings on the scales. In the summer, fish grow rapidly, forming a band of widely separated, light rings. During the winter, slower growth is indicated by narrow separations between the rings, resulting in a dark band. Each pair of rings indicates one year. Because fish spawn at a specific time of year, during the spring for many species, age is generally recorded by year. Age 0 fish are less than a year old, age 1 fish are between 1 and 2 years, etc. Thus, age is a discrete variable with possible values of 0, 1, 2, . . . Although there is undoubtedly an upper limit to age, we have represented the possible ages as being a countably infinite number of values.
Gender is a different type of variable; it is categorical (or qualitative) in nature. A variable is categorical if the possible responses are categories. Each fish is in one of two categories: male or female. We may arbitrarily associate a number with the category, but that does not change the nature of the variable. Car manufacturers, brands of battery, and types of injury are other examples of categorical variables.
Populations, Samples, and Variables In Short
This lesson has provided a brief overview of some of the key ideas in statistics. As with any science, terms have special meaning, and a number of the common statistical terms have been introduced in this lesson. Both the ideas and terms will be encountered frequently throughout this text, helping you to become more comfortable with them.
Find practice problems and solutions for these concepts at Populations, Samples and Variables Practice Exercises.