Describing and Displaying Bivariate Data Study Guide (page 2)
Introduction to Describing and Displaying Bivariate Data
Often, two or more numerical measurements are taken on each observational unit. For example, the weight and length of each fish taken from a lake may be recorded. In studies of wetlands, the phosphorus and nitrogen concentrations at several sites might be observed. In another study, the goal was to determine whether a person's ear grows throughout life. For each person in the sample, age and the length of the left ear were recorded. In these studies, we are interested in not only the values of each of the variables assumed, but also in how they relate to each other. In this lesson, graphical displays for bivariate data and a measure of the relationship of two variables will be explored.
Suppose that more than one, say two, numerical values are recorded for each unit in the study. Sometimes, both of the variables are responses.At other times, we are interested in how the response variable, or dependent variable, relates to the explanatory variable, or independent or predictor variable. In the latter case,we may want to predict the value of the response variable for a specific value of the explanatory variable.More than one explanatory variable may exist for each response variable.
When working with univariate data, we saw that plots (pie charts, bar charts, dotplots, stem-and-leaf plots, histograms, and boxplots) aided our understanding of the data. Although we could construct these types of graphs for each variable and gain a better understanding of each variable, such graphs would not aid our understanding of how the two variables are related. A scatter plot is an effective graph for gaining insight into bivariate data. A scatter plot is a graph in which each observation (pair of numbers) is represented by a point in a rectangular coordinate system. The horizontal axis is identified with the x-axis, and the axis is scaled to cover the range of values of X. The vertical axis is identified with the y-axis, and the axis is scaled to cover the range of values of Y. If both an explanatory and a response variable exist, the x-axis is used for the explanatory variable, and the y-axis is used for the response variable. The point representing the observation (x,y) is placed at the intersection of the vertical line through x on the x-axis and the horizontal line through y on the y-axis. Figure 8.1 shows the point representing the observation (2.5,3) with the corresponding vertical and horizontal lines. These reference lines are not usually included in the plot; they are here only for illustration. All data points would be plotted using the same approach.
A group of students wanted to know whether there was a relationship in the height from which a ball was dropped and its rebound height. Using a basketball, they dropped the ball from each of 11 heights three times and measured how high it rebounded. Both the height from which the ball was dropped and the height of the rebound were measured, in inches, from the bottom of the ball. The data are given in Table 8.1.
- Specify whether drop height and rebound height are explanatory or response variables.
- Create a scatter plot of the data and describe the relationship between drop height and rebound height.
Drop height is the explanatory variable because this is the variable that is controlled during the study. Rebound height is the response variable because the rebound height was measured for a given drop height. Thus, drop height is on the x-axis and rebound height is on the y-axis. The data are plotted in Figure 8.2.
The ball did not always have the same rebound height when it was dropped repeatedly from a specific drop height; there was variability in the rebound heights for a given drop height. The rebound height tends to increase in a linear manner as the drop height increases, though the relationship is certainly not an exact one.
Pearson's Correlation Coefficient
One of the challenges in working with two or more variables is that they could have different units of measurements (inches, pounds, liters, etc.), means, standard deviations, or other characteristics. It is often
Pearson's correlation coefficient is defined to be:
helpful to have all variables on a common scale. Although there are many possible scales,we transform the original values of each variable so that the mean is zero and the standard deviation is one. z-scores are the transformed values of a random variable that have a mean of zero and a standard deviation of one; that is,
If all population values are known, the population mean and standard deviation are used to find the z-score; if sample values are available, the sample mean and sample standard deviation are used to find the z-scores.
Let (x1,y1), (x2,y2), . . . , (xn,yn) be a random sample of n (x, y) pairs. Suppose we replace each x-value by its z-score, zX, by subtracting the sample mean, , and dividing by the sample standard deviation, sX. (Note that the subscript on s indicates the variable, here X, for which s is the sample standard deviation. Subscripts are often used in this manner when more than one variable is of interest in order to avoid confusion.) Similarly, suppose that each y-value is replaced by its z-score, zY. Note that, if x (or y) is larger than the sample mean (or ), zX (or zY) is positive. Likewise, if x (or y) is smaller than the sample mean (or ), zX (or zY) is negative.
Consider the sample of (x,y) pairs displayed in the graph in Figure 8.3. It is clear that there is a strong positive relationship between X and Y. The dashed horizontal line through and the dashed vertical line through divide the graph into four quadrants, which are labeled I, II, III, and IV. In quadrant I, both x and y are above their respective sample means; thus, zX and zY are positive, and zXzY is positive. For (x,y) in quadrant II, x is below its sample mean and y is above its sample mean; therefore, zXzY is negative. Notice that zXzY is positive in quadrant III because zX and zY are both negative and the product of two negative numbers is a positive number. Finally, because x is above its mean and y is below its mean, zXzY is negative in quadrant IV. Notice that, for the rebounding ball example, almost all of the points are in quadrants I and III, so would be positive. In contrast, if most of the points lie in quadrants II and IV, would be negative.
These ideas are the foundation for Pearson's correlation coefficient r, which provides a measure of the strength of the linear relationship between X and Y. Pearson's correlation coefficient is defined to be .
The correlation coefficient has some important properties. First, the value r is unitless; that is, it does not depend on the unit of measurement of either variable. X and Y can be measured in inches, meters, or light years, and the value of r would not change. Second, it does not matter which variable is labeled X and which is labeled Y; the value of r will be the same. Third, Pearson's r is always between –1 and +1. A value of one or –1 occurs when an exact linear relationship exists between X and Y. If r = 1, the slope of the line is positive; if r = –1, the slope of the line is negative. The closer r is to 1 or –1, the stronger the linear relationship between X and Y is. Finally, it is important to realize that r measures only the linear relationship in X and Y. It is possible for X and Y to have a very strong relationship and for r to be near zero. In these cases, the strong relationship is not linear in nature. Some scatter plots with the associated r values are shown in Figure 8.4.
- Kindergarten Sight Words List
- First Grade Sight Words List
- 10 Fun Activities for Children with Autism
- Signs Your Child Might Have Asperger's Syndrome
- Definitions of Social Studies
- A Teacher's Guide to Differentiating Instruction
- Curriculum Definition
- Theories of Learning
- What Makes a School Effective?
- Child Development Theories