Regression Help (page 3)

By — McGraw-Hill Professional
Updated on Aug 26, 2011

Least-Squares Lines

As you might guess, there is computer software designed to find ideal curves for scatter plots such as those shown in Figs. 6-9A and B. Most high-end scientific graphics packages contain curve-fitting programs. They also contain programs that can find the best overall straight-line fit for the points in any scatter plot where correlation exists. Finding the ideal straight line is easier than finding the ideal smooth curve, although the result is usually less precise.

Examine Figs. 6-11A and B. These graphs represent the outputs of hypothetical computer programs designed to find the best straight-line approximations of the data from Figs. 6-9A and B, respectively. Suppose that the dashed lines in these graphs represent the best overall straight-line averages of the positions of the points. In that case, then the lines both obey a rule called the law of least squares.



Here's how a least-squares line is found. Suppose the distance between the dashed line and each of the 12 points in Fig. 6-11A is measured. This gives us a set of 12 distance numbers; call them d1 through d12. They should all be expressed in the same units, such as millimeters. Square these distance numbers, getting d12 through d122. Then add these squared numbers together, getting a final sum D. There is one particular straight line for the scatter plot in Fig. 6-11A (or for any scatter plot in which there is a correlation among the points) for which the value of D is minimum. That line is the least-squares line. We can do exactly the same thing for the points and the dashed line in Fig. 6-11B.

Any computer program designed to find the least-squares line for a scatter plot executes the aforementioned calculations and performs an optimization problem, quickly calculating the equation of, and displaying, the line that best portrays the overall relationship among the points. Unless the points are randomly scattered or are arranged in some bizarre coincidental fashion (such as a perfect circle, uniformly spaced all the way around), there is always one, and only one, least-squares line for any given scatter plot.

Regression Practice Problems

Practice 1

Suppose the points in a scatter plot all lie exactly along a straight line so that the correlation is either +1 (as strong, positively, as possible) or –1 (as strong, negatively, as possible). Where is the least-squares line in this type of scenario?

Solution 1

If all the points in a scatter plot happen to be arranged in a straight line, then that line is the least-squares line.

Practice 2

Imagine that the points in a scatter plot are all over the graph, so that the correlation is 0. Where is the least-squares line in this case?

Solution 2

When there is no correlation between two variables and the scatter plot shows this by the appearance of randomly placed points, then there is no least-squares line.

Practice 3

Imagine the temperature versus rainfall data for the hypothetical towns of Happyton and Blissville, discussed above, has been obtained on a daily basis rather than on a monthly basis. Also suppose that, instead of having been gathered over the past 100 years, the data has been gathered over the past 1,000,000 years. We should expect this would result in scatter plots with points that lie neatly along smooth lines or curves. We might also be tempted to use this data to express the present-day climates of the two towns. Why should we resist that temptation?

Solution 3

The ''million-year data'' literally contains too much information to be useful in the present time. The earth's overall climate, as well as the climate in any particular location, has gone through wild cycles over the past 1,000,000 years. Any climatologist, astronomer, or earth scientist can tell you that. There have been ice ages and warm interglacial periods; there have been wet and dry periods. While the 1,000,000-year data might be legitimate as it stands, it does not necessarily represent conditions this year, or last year, or over the past 100 years.

Statisticians must be careful not to analyze too much information in a single experiment. Otherwise, the results can be skewed, or can produce a valid answer to the wrong question. The gathering of data over a needlessly large region or an unnecessarily long period of time is a tactic sometimes used by people whose intent is to introduce bias while making it look as if they have done an exceptionally good job of data collection. Beware!

Practice problems for these concepts can be found at:

Hypotheses, Prediction, and Regression Practice Test

View Full Article
Add your own comment

Ask a Question

Have questions about this article or topic? Ask
150 Characters allowed