Introduction to Correlation Principles
Let's examine correlation a little more closely now. When two things are correlated, does one cause the other? Does a third phenomenon cause both? Is there any cause-and-effect relationship at all? People often conclude that there is a cause-and-effect relationship when they see a correlation. But this is not necessarily true.
Quantitative Versus Qualitative
Correlation (often symbolized by the italicized, lowercase letter r) can be numerically defined only between variables that can be quantified. Examples of quantitative variables include time, temperature, and average monthly rainfall.
It's possible to qualitatively express the correlation between two variables if one or both of them cannot be quantified. But it's not possible to quantitatively express correlation unless both variables and their relationship can be quantified. Even if it seems obvious that two variables are correlated, there is a big difference between saying that, for example, ''rudeness and violence are strongly correlated'' and ''the correlation between rudeness and violence is +0.75.'' Violence can be quantified on the basis of crime statistics, but rudeness is a more elusive variable to numerically express.
Imagine that a massive social experiment is conducted over a period of years, and researchers come to the conclusion that people develop schizophrenia more often in some geographic regions than in others. Suppose, for example, that there are more people with this disorder living at high elevations in the mountains, where there is lots of snow and the weather is cool all year round, than there are at low elevations near tropical seas, where it rains often and the weather is warm all year. Both of these variables – schizophrenia and environment – are difficult or impossible to quantify. In particular, if you took 100 psychiatrists and asked them to diagnose a person who behaves strangely, you might end up with 40 diagnoses of ''schizophrenia,'' 10 diagnoses of ''paranoid psychosis,'' 15 diagnoses of ''depression,'' 5 diagnoses of ''bipolar disorder,'' 25 diagnoses of ''normal but upset,'' and 5 verdicts of ''not enough information to make a diagnosis.'' While the largest proportion (40%) of the doctors think the person has schizophrenia in this breakdown, that is not even a simple majority. Such a diagnosis is not absolute, such as would be the case with an unmistakable physical ailment such as a brain tumor.
Correlation Range
The first thing we should know about correlation, as shown or implied by a scatter plot, was suggested earlier in this book. But it's so important that it bears repetition. Correlation can be expressed as a numerical value r such that the following restriction holds:
This means the mathematical correlation can be equal to anything between, and including, –1 and +1. Sometimes percentages are used instead, so the possible range of correlation values, r%, is as follows:
A correlation value of r = –1 represents the strongest possible negative correlation; r = +1 represents the strongest possible positive correlation. Moderately strong positive correlation might be reflected by a figure of r = +0.7; weak negative correlation might show up as r% = –20%. A value of r = 0 or r% = 0% means there is no correlation at all. Interestingly, the absence of any correlation can be more difficult to prove than the existence of correlation, especially if the number of samples (or points in a scatter plot) is small.
It's impossible for anything to be correlated with anything else to an extent beyond the above limits. If you ever hear anyone talking about two phenomena being correlated by ''a factor of –2'' or ''r = 150%,'' you know they're wrong. In addition to this, we need to be careful when we say that two effects are correlated to ''twice the extent'' of two other effects. If two phenomena are correlated by a factor of r = +0.75, and someone comes along and tells you that changing the temperature (or some other parameter) will ''double the correlation,'' you know something is wrong because this suggests that the correlation could become r = +1.50, an impossibility.
Correlation is Linear
There are plenty of computer programs that can calculate correlation numbers based on data input or scatter plots. In this book, we won't get into the actual formulas used to calculate correlation. The formulas are messy and tedious for any but the most oversimplified examples. At this introductory level, it's good enough for you to remember that correlation is a measure of the extent to which the points in a scatter plot are concentrated near the least-squares line.
The key word in correlation determination is the word ''line.'' Correlation in a scatter plot is defined by the nearness of the points to a particular straight line determined from the points on the plot. If points lie along a perfectly straight line, then either r = –1 or r = +1. The value of r is positive if the values of both variables increase together. The value of r is negative if one value decreases as the other value increases.
Once in a while, you'll see a scatter plot in which all the points lie along a smooth curve, but that curve is not a straight line. This is a special sort of perfection in the relationship between the variables; it indicates that one is a mathematical function of the other. But points on a non-straight curve do not indicate a correlation of either –1 or +1. Figure 7-1A shows a scatter plot in which the correlation is +1.

Figure 7-1B shows a scatter plot in which the correlation is perfect in the sense that the points lie along a smooth curve, but in fact the correlation is much less than +1.

Correlation and Outliers
In some scatter plots, the points are concentrated near smooth curves or lines, although it is rare for any scatter plot to contain points as orderly as those shown in Fig. 7-1A or B. Once in a while, you'll see a scatter plot in which almost all of the points lie near a straight line, but there are a few points that are far away from the main group. Stray points of this sort are known as outliers. These points are, in some ways, like the outliers found in statistical distributions.


One or two ''extreme outliers'' can greatly affect the correlation between two variables. Consider the example of Fig. 7-2. This is a scatter plot in which all but two of the points are in exactly the same positions as they are in Fig. 7-1A. But the two outliers are both far from the least-squares line. These points happen to be at equal distances (indicated by d) from the line, so their net effects on the position of the line cancel each other. Thus, the least-squares line in Fig. 7-2 is in the same position as the least-squares line in Fig. 7-1A. But the correlation values are much different. In Fig. 7-1A, r = +1. In the situation shown by Fig. 7-2, r is much smaller than +1.

Correlation and Definition of Variables
Here's another important rule concerning correlation. It doesn't matter which variable is defined as dependent and which variable is defined as independent. If the definitions of the variables are interchanged, and nothing about the actual scenario changes, the correlation remains exactly the same.
Think back to the previous chapter, where we analyzed the correlation between average monthly temperatures and average monthly rainfall amounts for two cities. When we generated the scatter plots, we plotted temperature on the horizontal axis, and considered temperature to be an independent variable. However, we could just as well have plotted the rainfall amounts on the horizontal axis, and defined them as the independent variables. The resulting scatter plots would have looked different, but upon mathematical analysis, the correlation figures would have come out the same.
Sometimes a particular variable lends itself intuitively to the role of the independent variable. (Time is an excellent example of this, although there are some exceptions.) In the cases of Happyton and Blissville from the previous chapter, it doesn't matter much which variable is considered independent and which is considered dependent. In fact, these very labels can be misleading, because they suggest causation. Does the temperature change, over the course of the year, actually influence the rainfall in Happyton or Blissville? If so, the effects are opposite between the two cities. Or is it the other way around – rainfall amounts influence the temperature? Again, if that is true, the effects are opposite between the two cities. There is something a little weird about either assumption. Perhaps another factor, or even a combination of multiple factors, influences both the temperature and the rainfall in both towns.
Units (Usually) Don't Matter
Here's an interesting property of correlation. The units we choose don't matter, as long as they express the same phenomenon or characteristic. If the measurement unit of either variable is changed in size but not in essence, the appearance of a bar graph or scatter plot changes. The plot is ''stretched'' or ''squashed'' vertically or horizontally. But the correlation figure, r, between the two variables is unaffected.
Think back again to the last chapter, and the scatter plots of precipitation versus temperature for Happyton and Blissville. The precipitation amounts are indicated in centimeters per month, and the temperatures are shown in degrees Celsius. Suppose the precipitation amounts were expressed in inches per month instead. The graphs would look a little different, but upon analysis by a computer, the correlation figures would turn out the same. Suppose the temperatures were expressed in degrees Fahrenheit. Again, the graphs would look different, but r would not be affected. Even if the average monthly rainfall were plotted in miles per month and the temperatures in degrees Kelvin (where 0K represents absolute zero, the coldest possible temperature), the value of r would be the same.
We must be careful when applying this rule. The sizes of the units can be changed, but the quantities or phenomena they represent must remain the same. Therefore, if we were to plot the average rainfall in inches, centimeters, or miles per week rather than per month, we could no longer be sure the correlation would remain the same. The scatter plots would no longer show the same functions. The variable on the vertical scale – rainfall averaged over weekly periods rather than over monthly periods – would no longer represent the same thing. This is a subtle distinction, but it makes a critical difference.
Correlation Principles Practice Problems
Practice 1
Suppose the distances of the outliers from the least-squares line from Fig. 7-2 are cut in half (to d/2 rather than d), as shown in Fig. 7-3. What effect will this have on the correlation?


Fig. 7-3. Illustration for Practice 1 and 2.
Solution 1
It will increase the correlation, because the average distances of all the points from the least-squares line will be smaller.
Practice 2
Suppose that one of the outliers is removed in the scenario of Fig. 7-3. Will this affect the position of the least-squares line?

Fig. 7-3. Illustration for Practice 1 and 2.
Solution 2
Yes. If the upper-left outlier is removed, the position of the least-squares line will be displaced slightly downward and to the right; if the lower-right outlier is removed, the least-squares line will be displaced slightly upward and to the left.
Practice problems for these concepts can be found at:
Correlation, Causation, Order, and Chaos Practice Test
Add your own comment