Describing and Displaying Bivariate Data Study Guide (page 3)
Introduction to Describing and Displaying Bivariate Data
Often, two or more numerical measurements are taken on each observational unit. For example, the weight and length of each fish taken from a lake may be recorded. In studies of wetlands, the phosphorus and nitrogen concentrations at several sites might be observed. In another study, the goal was to determine whether a person's ear grows throughout life. For each person in the sample, age and the length of the left ear were recorded. In these studies, we are interested in not only the values of each of the variables assumed, but also in how they relate to each other. In this lesson, graphical displays for bivariate data and a measure of the relationship of two variables will be explored.
Suppose that more than one, say two, numerical values are recorded for each unit in the study. Sometimes, both of the variables are responses.At other times, we are interested in how the response variable, or dependent variable, relates to the explanatory variable, or independent or predictor variable. In the latter case,we may want to predict the value of the response variable for a specific value of the explanatory variable.More than one explanatory variable may exist for each response variable.
When working with univariate data, we saw that plots (pie charts, bar charts, dotplots, stem-and-leaf plots, histograms, and boxplots) aided our understanding of the data. Although we could construct these types of graphs for each variable and gain a better understanding of each variable, such graphs would not aid our understanding of how the two variables are related. A scatter plot is an effective graph for gaining insight into bivariate data. A scatter plot is a graph in which each observation (pair of numbers) is represented by a point in a rectangular coordinate system. The horizontal axis is identified with the x-axis, and the axis is scaled to cover the range of values of X. The vertical axis is identified with the y-axis, and the axis is scaled to cover the range of values of Y. If both an explanatory and a response variable exist, the x-axis is used for the explanatory variable, and the y-axis is used for the response variable. The point representing the observation (x,y) is placed at the intersection of the vertical line through x on the x-axis and the horizontal line through y on the y-axis. Figure 8.1 shows the point representing the observation (2.5,3) with the corresponding vertical and horizontal lines. These reference lines are not usually included in the plot; they are here only for illustration. All data points would be plotted using the same approach.
A group of students wanted to know whether there was a relationship in the height from which a ball was dropped and its rebound height. Using a basketball, they dropped the ball from each of 11 heights three times and measured how high it rebounded. Both the height from which the ball was dropped and the height of the rebound were measured, in inches, from the bottom of the ball. The data are given in Table 8.1.
- Specify whether drop height and rebound height are explanatory or response variables.
- Create a scatter plot of the data and describe the relationship between drop height and rebound height.
Drop height is the explanatory variable because this is the variable that is controlled during the study. Rebound height is the response variable because the rebound height was measured for a given drop height. Thus, drop height is on the x-axis and rebound height is on the y-axis. The data are plotted in Figure 8.2.
The ball did not always have the same rebound height when it was dropped repeatedly from a specific drop height; there was variability in the rebound heights for a given drop height. The rebound height tends to increase in a linear manner as the drop height increases, though the relationship is certainly not an exact one.
Pearson's Correlation Coefficient
One of the challenges in working with two or more variables is that they could have different units of measurements (inches, pounds, liters, etc.), means, standard deviations, or other characteristics. It is often
Pearson's correlation coefficient is defined to be:
helpful to have all variables on a common scale. Although there are many possible scales,we transform the original values of each variable so that the mean is zero and the standard deviation is one. z-scores are the transformed values of a random variable that have a mean of zero and a standard deviation of one; that is,
If all population values are known, the population mean and standard deviation are used to find the z-score; if sample values are available, the sample mean and sample standard deviation are used to find the z-scores.
Let (x1,y1), (x2,y2), . . . , (xn,yn) be a random sample of n (x, y) pairs. Suppose we replace each x-value by its z-score, zX, by subtracting the sample mean, , and dividing by the sample standard deviation, sX. (Note that the subscript on s indicates the variable, here X, for which s is the sample standard deviation. Subscripts are often used in this manner when more than one variable is of interest in order to avoid confusion.) Similarly, suppose that each y-value is replaced by its z-score, zY. Note that, if x (or y) is larger than the sample mean (or ), zX (or zY) is positive. Likewise, if x (or y) is smaller than the sample mean (or ), zX (or zY) is negative.
Consider the sample of (x,y) pairs displayed in the graph in Figure 8.3. It is clear that there is a strong positive relationship between X and Y. The dashed horizontal line through and the dashed vertical line through divide the graph into four quadrants, which are labeled I, II, III, and IV. In quadrant I, both x and y are above their respective sample means; thus, zX and zY are positive, and zXzY is positive. For (x,y) in quadrant II, x is below its sample mean and y is above its sample mean; therefore, zXzY is negative. Notice that zXzY is positive in quadrant III because zX and zY are both negative and the product of two negative numbers is a positive number. Finally, because x is above its mean and y is below its mean, zXzY is negative in quadrant IV. Notice that, for the rebounding ball example, almost all of the points are in quadrants I and III, so would be positive. In contrast, if most of the points lie in quadrants II and IV, would be negative.
These ideas are the foundation for Pearson's correlation coefficient r, which provides a measure of the strength of the linear relationship between X and Y. Pearson's correlation coefficient is defined to be .
The correlation coefficient has some important properties. First, the value r is unitless; that is, it does not depend on the unit of measurement of either variable. X and Y can be measured in inches, meters, or light years, and the value of r would not change. Second, it does not matter which variable is labeled X and which is labeled Y; the value of r will be the same. Third, Pearson's r is always between –1 and +1. A value of one or –1 occurs when an exact linear relationship exists between X and Y. If r = 1, the slope of the line is positive; if r = –1, the slope of the line is negative. The closer r is to 1 or –1, the stronger the linear relationship between X and Y is. Finally, it is important to realize that r measures only the linear relationship in X and Y. It is possible for X and Y to have a very strong relationship and for r to be near zero. In these cases, the strong relationship is not linear in nature. Some scatter plots with the associated r values are shown in Figure 8.4.
- Find the z-scores associated with the drop heights and the rebound heights of the dropped basketball.
- Find the Pearson's correlation coefficient for these data.
- Relate the correlation coefficient r to the graph.
- Many calculators have a built-in function that can be used to compute the correlation coefficient. We will not use that function here, but instead demonstrate a way to organize the computations required to find r when such a function key is not available. First, we need to find the sample mean and sample deviation for the drop height and for the rebound height. The sample mean for the drop height is 42 inches, and the sample standard deviation of drop height is 19.2678489 inches. The sample mean for the rebound height is 22.9090909 inches, and the sample standard deviation is 11.4390057 inches. Notice that we are carrying all of the decimal places at this point to reduce the effect of a rounding error on our value for r. If we were to report these values, we would round to about one decimal place. Using the sample means and sample standard deviations, we find the z-scores for drop height and rebound height for each observation as well as the product of the two. These are presented in Table 8.3.
The values of zDrop and zRebound sum to zero because the transformation of the sample values was designed to give a mean of zero, making the total also zero.
Pearson's r is found using the total of the product of these two values; that is,
- This high value of r, with the scatter plot of the data, allows us to conclude that a line would provide a good model for these data, at least in the range of drop heights considered in this study. It is important to look both at the scatter plot and r, not just r by itself, before drawing a conclusion of a linear relationship as the following example demonstrates.
Because we have a sample for the rebound heights, Pearson's r is an estimate of the population correlation coefficient rho, or ρ. The population correlation coefficient has the same basic properties as r. However, it is important to remember that ρ does not change; it is a population characteristic. In contrast, r is based on a sample. If another sample is drawn from the same population, the value of r is very likely tochange. Each sample provides an estimate of rho. In the example of dropping a basketball, we have a sample of size n = 33.We can use the data to estimate the population correlation. If we were to conduct the study again,we would undoubtedly obtain a similar but different value of r, which would also be an estimate of the population correlation.
We have collected the following (x,y) pairs: (–8.7,–0.6), (–8.2,–1.3), (–6.1,–2.0), (–4.1,–4.0), (–1.6,–5.5), (–0.2,–6.0), (0.7,4.6), (1.4,4.2), (3.8,3.8), (6.5,3.1), (8.2,2.0), and (9.1,0.6).
- Construct a scatter plot of the data and discuss the relationship in X and Y based on the plot.
- Find Pearson's correlation coefficient r.
- Discuss the relationship in r and what was observed in the graph.
The scatter plot of the data is shown in Figure 8.5.
It appears that there are two groups of responses. The group in the upper right portion of the graph seems to have decreasing y-values as x increases. Similarly, the group in the lower left portion of the graph seems to have decreasing y-values as x increases.
To find Pearson's correlation coefficient, we organize the data in tabular form; find zx, zy, and zx zy, and the columns totaled (see Table 8.4).
As expected, the sums of zx and zy are zero. The product of the two is used to find Pearson's r:
This value of r would tend to lead us to believe that there is a moderately strong, positive, linear relationship in X and Y. This would be the wrong conclusion for these data. This is why it is so important to look at the scatter plot when interpreting r. If we consider the two groups separately, r for the group with negative x values is –0.992, and r for the group with positive x values is –0.941. Both of these suggest a strong, negative relationship in X and Y.When something like this happens, the researcher must try to determine what the difference is in the two groups. If the observations were taken from people, potential factors such as gender, age, and disease would represent the two groups.
Describing and Displaying Bivariate Data In Short
Bivariate data arise often in studies. The relationship in the two variables is often of great interest. Scatter plots are visual displays of the data that help us understand how the two variables might be related. Pearson's correlation coefficient r is a measure of the strength of the linear relationship in the variables. It is important to look at the scatter plot when interpreting the meaning of r.
Find practice problems and solutions for these concepts at Describing and Displaying Bivariate Data Practice Questions.
- Kindergarten Sight Words List
- First Grade Sight Words List
- 10 Fun Activities for Children with Autism
- Definitions of Social Studies
- Signs Your Child Might Have Asperger's Syndrome
- Curriculum Definition
- Theories of Learning
- Child Development Theories
- A Teacher's Guide to Differentiating Instruction
- Netiquette: Rules of Behavior on the Internet