Coefficient of Determination for AP Statistics

By — McGraw-Hill Professional
Updated on Feb 3, 2011

Practice problems for these concepts can be found at:

In the absence of a better way to predict y-values from x-values, our best guess for any given x might well be , the mean value of y.

Example: Suppose you had access to the heights and weights of each of the students in your statistics class. You compute the average weight of all the students. You write the heights of each student on a slip of paper, put the slips in a hat, and then draw out one slip. You are asked to predict the weight of the student whose height is on the slip of paper you have drawn. What is your best guess as to the weight of the student?

Solution: In the absence of any known relationship between height and weight, your best guess would have to be the average weight of all the students. You know the weights vary about the average and that is about the best you could do.

If we guessed at the weight of each student using the average, we would be wrong most of the time. If we took each of those errors and squared them, we would have what is called the sum of squares total (SST). It's the total squared error of our guesses when our best guess is simply the mean of the weights of all students, and represents the total variability of y.

Now suppose we have a least-squares regression line that we want to use as a model for predicting weight from height. It is, of course, the LSRL we discussed in detail earlier in this chapter, and our hope is that there will be less error in prediction than by using . Now, we still have errors from the regression line (called residuals, remember?). We call the sum of those errors the sum of squared errors (SSE). So, SST represents the total error from using as the basis for predicting weight from height, and SSE represents the total error from using the LSRL. SST–SSE represents the benefit of using the regression line rather than for prediction. That is, by using the LSRL rather than , we have explained a certain proportion of the total variability by regression.

The proportion of the total variability in y that is explained by the regression of y on x is called the coefficient of determination, The coefficient of determination is symbolized by r2. Based on the above discussion, we note that

It can be shown algebraically, although it isn't easy to do so, that this r2 is actually the square of the familiar r, the correlation coefficient. Many computer programs will report the value of r2 only (usually as "R-sq"), which means that we must take the square root of r2 if we only want to know r (remember that r and b, the slope of the regression line, are either both positive or negative so that you can check the sign of b to determine the sign of r if all you are given is r2). The TI-83/84 calculator will report both r and r2, as well as the regression coefficient, when you do LinReg(a+bx).

View Full Article
Add your own comment

Ask a Question

Have questions about this article or topic? Ask
150 Characters allowed