Correlation for AP Statistics (page 2)
Practice problems for these concepts can be found at:
- Two-Variable Data Analysis Multiple Choice Practice Problems for AP Statistics
- Two-Variable Data Analysis Free Response Practice Problems for AP Statistics
- Two-Variable Data Analysis Review Problems for AP Statistics
- Two-Variable Data Analysis Rapid Review for AP Statistics
In this lesson, we want to do some numerical analysis of the data in an attempt to understand the relationship between the variables better.
In AP Statistics, we are primarily interested in determining the extent to which two variables are linearly associated. Two variables are linearly related to the extent that their relationship can be modeled by a line. Sometimes, and we will look at this more closely later in this chapter, variables may not be linearly related but can be transformed in such a way that the transformed data are linear. Sometimes the data are related but not linearly (e.g., the height of a thrown ball, t seconds after it is thrown, is a quadratic function of the number of seconds elapsed since it was released).
The first statistic we have to determine a linear relationship is the Pearson product moment correlation, or more simply, the correlation coefficient, denoted by the letter r. The correlation coefficient is a measure of the strength of the linear relationship between two variables as well as an indicator of the direction of the linear relationship (whether the variables are positively or negatively associated).
If we have a sample of size n of paired data, say (x,y), and assuming that we have computed summary statistics for x and y (means and standard deviations), the correlation coefficient r is defined as follows:
Because the terms after the summation symbol are nothing more than the z-scores of the individual x and y values, an easy way to remember this definition is:
Example: Earlier in the section, we saw some data for hours studied and the corresponding scores on an exam. It can be shown that, for these data, r = 0.864. This indicates a strong positive linear relationship between hours studied and exam score. That is, the more hours studied, the higher the exam score.
The correlation coefficient r has a number of properties you should be familiar with:
- –1 ≤ r ≤ 1. If r = –1 or r = 1, the points all lie on a line.
- Although there are no hard and fast rules about how strong a correlation is based on its numerical value, the following guidelines might help you categorize r:
Remember that these are only very rough guidelines. A value of r = 0.2 might well indicate a significant linear relationship (that is, it's unlikely to have gotten 0.2 unless there really was a linear relationship), and an r of 0.8 might be reflective of a single influential point rather than an actual linear relationship between the variables.
- If r > 0, it indicates that the variables are positively associated. If r < 0, it indicates that the variables are negatively associated.
- If r = 0, it indicates that there is no linear association that would allow us to predict y from x. It does not mean that there is no relationship—just not a linear one.
- It does not matter which variable you call x and which variable you call y. r will be the same. In other words, r depends only on the paired points, not the ordered pairs.
- r does not depend on the units of measurement. In the previous example, convert "hours studied" to "minutes studied" and r would still equal 0.864.
- r is not resistant to extreme values because it is based on the mean. A single extreme value can have a powerful impact on r and may cause us to overinterpret the relationship. You must look at the scatterplot of the data as well as r.
Example: To illustrate that r is not resistant, consider the following two graphs. The graph on the left, with 12 points, has a marked negative linear association between x and y. The graph on the right has the same basic visual pattern but, as you can see, the addition of the one outlier has a dramatic effect on r—making what is generally a negative association between two variables appear to have a moderate, positive association.
Example: The following computer output, again for the hours studied versus exam score data, indicates R-sq, which is the square of r. Accordingly, r = = 0.864. There is a lot of other stuff in the box that doesn't concern us just yet. We will learn about other important parts of the output as we proceed through the rest of this book. Note that we cannot determine the sign of r from R-sq. We need additional information.
("R-sq" is called the "coefficient of determination" and has a lot of meaning in its own right in regression. It is difficult to show that R-sq is actually the square of r. We will consider the coefficient of determination later in this chapter.)
Correlation and Causation
Two variables, x and y, may have a strong correlation, but you need to take care not to interpret that as causation. That is, just because two things seems to go together does not mean that one caused the other—some third variable may be influencing them both. Seeing a fire truck at almost every fire doesn't mean that fire trucks cause fires.
Example: Consider the following dataset that shows the increase in the number of Methodist ministers and the increase in the amount of imported Cuban rum from 1860 to 1940.
For these data, it turns out that r = .999986.
Is the increase in number of ministers responsible for the increase in imported rum? Some cynics might want to believe so, but the real reason is that the population was increasing from 1860 to 1940, so the area needed more ministers, and more people drank more rum.
In this example, there was a lurking variable, increasing population—one we didn't consider when we did the correlation—that caused both of these variables to change the way they did. We will look more at lurking variables in the next chapter, but in the meantime remember, always remember, that correlation is not causation.
Practice problems for these concepts can be found at: