Introduction to Regression—Paired Data
Regression is a way of defining the extent to which two variables are related. Regression can be used in an attempt to predict things, but this can be tricky. The existence of a correlation between variables does not always mean that there is a cause-and-effect link between them.
Paired Data
Imagine two cities, one named Happyton and the other named Blissville. These cities are located far apart on the continent. The prevailing winds and ocean currents produce greatly different temperature and rainfall patterns throughout the year in these two cities. Suppose we are about to move from Happyton to Blissville, and we've been told that Happyton ''has soggy summers and dry winters,'' while in Blissville we should be ready to accept that ''the summers are parched and the winters are washouts.'' We've also been told that the temperature difference between summer and winter is much smaller in Blissville than in Happyton.
We go to the Internet and begin to collect data about the two towns. We find a collection of tables showing the average monthly temperature in degrees Celsius (°C) and the average monthly rainfall in centimeters (cm) for many places throughout the world. Happyton and Blissville are among the cities shown in the tables. Table 6-1A shows the average monthly temperature and rainfall for Happyton as gathered over the past 100 years. Table 6-1B shows the average monthly temperature and rainfall for Blissville over the same period. The data we have found is called paired data, because it portrays two variable quantities, temperature and rainfall, side-by-side.
We can get an idea of the summer and winter weather in both towns by scrutinizing the tables. But we can get a more visual-friendly portrayal by making use of bar graphs.

Paired Bar Graphs
Let's graphically compare the average monthly temperature and the average monthly rainfall for Happyton. Figure 6-8A is a paired bar graph showing the average monthly temperature and rainfall there.

The graph is based on the data from Table 6-1A. The horizontal axis has 12 intervals, each one showing a month of the year. Time is the independent variable. The left-hand vertical scale portrays the average monthly temperatures, and the right-hand vertical scale portrays the average monthly rainfall amounts. Both of these are dependent variables, and are functions of the time of year. The average monthly temperatures are shown by the light gray bars, and the average monthly rainfall amounts are shown by the dark gray bars. It's easy to see from this data that the temperature and rainfall both follow annual patterns. In general, the warmer months are wetter than the cooler months in Happyton.

Now let's make a similar comparison for Blissville. Figure 6-8B is a paired bar graph showing the average monthly temperature and rainfall there, based on the data from Table 6-1B.

From this data, we can see that the temperature difference between winter and summer is less pronounced in Blissville than in Happyton. But that's not the main thing that stands out in this bar graph! Note that the rainfall, as a function of the time of year, is much different. The winters in Blissville, especially the months of January and February, are wet. The summers, particularly June, July, and August, get almost no rainfall. The contrast in general climate between Happyton and Blissville is striking. This information is, of course, contained in the tabular data, but it's easier to see by looking at the dual bar graphs.

Scatter Plots
When we examine Fig. 6-8A, it appears there is a relationship between temperature and rainfall for the town of Happyton. In general, as the temperature increases, so does the amount of rain. There is also evidently a relationship between temperature and rainfall in Blissville, but it goes in the opposite sense: as the temperature increases, the rainfall decreases. How strong are these relationships? We can draw scatter plots to find out.

In Fig. 6-9A, the average monthly rainfall is plotted as a function of the average monthly temperature for Happyton. One point is plotted for each month, based on the data from Table 6-1A. In this graph, the independent variable is the temperature, not the time of the year. There is a pattern to the arrangement of points. The correlation between temperature and rainfall is positive for Happyton. It is fairly strong, but not extremely so. If there were no correlation (that is, if the correlation were equal to 0), the points would be randomly scattered all over the graph. But if the correlation were perfect (either +1 or –1), all the points would lie along a straight line.


Figure 6-9B shows a plot of the average monthly rainfall as a function of the average monthly temperature for Blissville. One point is plotted for each month, based on the data from Table 6-1B. As in Fig. 6-9A, temperature is the independent variable. There is a pattern to the arrangement of points here, too. In this case the correlation is negative instead of positive. It is a fairly strong correlation, perhaps a little stronger than the positive correlation for Happyton, because the points seem more nearly lined up. But the correlation is far from perfect.


Regression Curves
The technique of curve fitting, which we learned about in Chapter 1, can be used to illustrate the relationships among points in scatter plots such as those in Figs. 6-9A and B.


Examples, based on ''intuitive guessing,'' are shown in Figs. 6-10A and B. Fig. 6.10A shows the same 12 points as those in Fig. 6-9A, representing the average monthly temperature and rainfall amounts for Happyton (without the labels for the months, to avoid cluttering things up). The dashed curve represents an approximation of a smooth function relating the two variables. In Fig. 6-10B, a similar curve-fitting exercise is done to approximate a function relating the average monthly temperature and rainfall for Blissville.
In our hypothetical scenarios, the data shown in Tables 6-1A and B, Figs. 6-8A and B, Figs. 6-9A and B, and Figs. 6-10A and B are all based on records gathered over 100 years. Suppose that we had access to records gathered over the past 1000 years instead! Further imagine that, instead of having data averaged by the month, we had data averaged by the week. In these cases we would get gigantic tables, and the bar graphs would be utterly impossible to read. But the scatter plots would tell a much more interesting story. Instead of 12 points, each graph would have 52 points, one for each week of the year. It is reasonable to suppose that the points would be much more closely aligned along smooth curves than they are in Figs. 6-9A and B or Figs. 6-10A and B.


Least-Squares Lines
As you might guess, there is computer software designed to find ideal curves for scatter plots such as those shown in Figs. 6-9A and B. Most high-end scientific graphics packages contain curve-fitting programs. They also contain programs that can find the best overall straight-line fit for the points in any scatter plot where correlation exists. Finding the ideal straight line is easier than finding the ideal smooth curve, although the result is usually less precise.
Examine Figs. 6-11A and B. These graphs represent the outputs of hypothetical computer programs designed to find the best straight-line approximations of the data from Figs. 6-9A and B, respectively. Suppose that the dashed lines in these graphs represent the best overall straight-line averages of the positions of the points. In that case, then the lines both obey a rule called the law of least squares.


Here's how a least-squares line is found. Suppose the distance between the dashed line and each of the 12 points in Fig. 6-11A is measured. This gives us a set of 12 distance numbers; call them d1 through d12. They should all be expressed in the same units, such as millimeters. Square these distance numbers, getting d12 through d122. Then add these squared numbers together, getting a final sum D. There is one particular straight line for the scatter plot in Fig. 6-11A (or for any scatter plot in which there is a correlation among the points) for which the value of D is minimum. That line is the least-squares line. We can do exactly the same thing for the points and the dashed line in Fig. 6-11B.
Any computer program designed to find the least-squares line for a scatter plot executes the aforementioned calculations and performs an optimization problem, quickly calculating the equation of, and displaying, the line that best portrays the overall relationship among the points. Unless the points are randomly scattered or are arranged in some bizarre coincidental fashion (such as a perfect circle, uniformly spaced all the way around), there is always one, and only one, least-squares line for any given scatter plot.
Regression Practice Problems
Practice 1
Suppose the points in a scatter plot all lie exactly along a straight line so that the correlation is either +1 (as strong, positively, as possible) or –1 (as strong, negatively, as possible). Where is the least-squares line in this type of scenario?
Solution 1
If all the points in a scatter plot happen to be arranged in a straight line, then that line is the least-squares line.
Practice 2
Imagine that the points in a scatter plot are all over the graph, so that the correlation is 0. Where is the least-squares line in this case?
Solution 2
When there is no correlation between two variables and the scatter plot shows this by the appearance of randomly placed points, then there is no least-squares line.
Practice 3
Imagine the temperature versus rainfall data for the hypothetical towns of Happyton and Blissville, discussed above, has been obtained on a daily basis rather than on a monthly basis. Also suppose that, instead of having been gathered over the past 100 years, the data has been gathered over the past 1,000,000 years. We should expect this would result in scatter plots with points that lie neatly along smooth lines or curves. We might also be tempted to use this data to express the present-day climates of the two towns. Why should we resist that temptation?
Solution 3
The ''million-year data'' literally contains too much information to be useful in the present time. The earth's overall climate, as well as the climate in any particular location, has gone through wild cycles over the past 1,000,000 years. Any climatologist, astronomer, or earth scientist can tell you that. There have been ice ages and warm interglacial periods; there have been wet and dry periods. While the 1,000,000-year data might be legitimate as it stands, it does not necessarily represent conditions this year, or last year, or over the past 100 years.
Statisticians must be careful not to analyze too much information in a single experiment. Otherwise, the results can be skewed, or can produce a valid answer to the wrong question. The gathering of data over a needlessly large region or an unnecessarily long period of time is a tactic sometimes used by people whose intent is to introduce bias while making it look as if they have done an exceptionally good job of data collection. Beware!
Practice problems for these concepts can be found at:
Hypotheses, Prediction, and Regression Practice Test
Add your own comment