Practice problems for these concepts can be found at:
Simply knowing about the center of a distribution doesn't tell you all you might want to know about the distribution. One group of 20 people earning $20,000 each will have the same mean and median as a group of 20 where 10 people earn $10,000 and 10 people earn $30,000. These two sets of 20 numbers differ not in terms of their center but in terms of their spread, or variability. Just as there were measures of center based on the mean and the median, we also have measures of spread based on the mean and the median.
Variance and Standard Deviation
One measure of spread based on the mean is the variance. By definition, the variance is the average squared deviation from the mean. That is, it is a measure of spread because the more distant a value is from the mean, the larger will be the square of the difference between it and the mean.
Symbolically, the variance is defined by
Note that we average by dividing by n - 1 rather than n as you might expect. This is because there are only n - 1 independent datapoints, not n, if you know . That is, if you know n - 1 of the values and you also know , then the nth datapoint is determined.
One problem using the variance as a measure of spread is that the units for the variance won't match the units of the original data because each difference is squared. For example, if you find the variance of a set of measurements made in inches, the variance will be in square inches. To correct this, we often take the square root of the variance as our measure of spread.
The square root of the variance is known as the standard deviation. Symbolically,
As discussed earlier, it is common to leave off the indices and write:
In practice, you will rarely have to do this calculation by hand because it is one of the values returned when you use you calculator to do 1-Var Stats on a list (it's the Sx near the bottom of the first screen).
The definition of standard deviation has three useful qualities when it comes to describing the spread of a distribution:
- It is independent of the mean. Because it depends on how far datapoints are from the mean, it doesn't matter where the mean is.
- It is sensitive to the spread. The greater the spread, the larger will be the standard deviation. For two datasets with the same mean, the one with the larger standard deviation has more variability.
- It is independent of n. Because we are averaging squared distances from the mean, the standard deviation will not get larger just because we add more terms.
example: Find the standard deviation of the following 6 numbers: 3, 4, 6, 6, 7, 10.
Because it depends upon distances from the mean, it should be clear that extreme values will have a major impact on the numerical value of the standard deviation. Note also that, in practice, you will never have to do the calculation above by hand—you will rely on your calculator.
Interquartile Range
Although the standard deviation works well in situations where the mean works well (reasonably symmetric distributions), we need a measure of spread that works well when a mean-based measure is not appropriate. That measure is called the interquartile range.
Remember that the median of a distribution divides the distribution in two—it is the middle of the distribution. The medians of the upper and lower halves of the distribution, not including the median itself in either half, are called quartiles. The median of the lower half is called the lower quartile, or the first quartile (which is the 25th percentile—Q1 on the calculator). The median of the upper half is called the upper quartile, or the third quartile (which is in the 75th percentile—Q3 on the calculator). The median itself can be thought of as the second quartile or Q2 (although we usually don't).
The interquartile range (IQR) is the difference between Q3 and Q1. That is, IQR = Q3 – Q1. When you do 1-Var Stats, the calculator will return Q1 and Q3 along with a lot of other stuff. You have to compute the IQR from Q1 and Q3. Note that the IQR comprises the middle 50% of the data.
example: Find Q1, Q3, and the IQR for the following dataset: 5, 5, 6, 7, 8, 9, 11, 13, 17.
solution: Because the data are in order, and there is an odd number of values (9), the median is 8. The bottom half of the data comprises 5, 5, 6, 7. The median of the bottom half is the average of 5 and 6, or 5.5 which is Q1. Similarly, Q3 is the medians of the top half, which is the mean of 11 and 13, or 12. The IQR = 12 - 5.5 = 6.5.
example: Find the standard deviation and IQR for the number of home runs hit by Babe Ruth in his major league career. The number of home runs was: 0, 4, 3, 2, 11, 29, 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22, 6.
solution: We put these numbers into a TI-83/84 list and do 1-Var Stats on that list. The calculator returns Sx = 20.21, Q1 = 11, and Q3 = 47. Hence the IQR = Q3 – Q1 = 47 – 11 = 36.
The range of the distribution is the difference between the maximum and minimum scores in the distribution. For the home run data, the range equals 60 - 0 = 60. Although this is sometimes used as a measure of spread, it is not very useful because we are usually interested in how the data spread out from the center of the distribution, not in just how far it is from the minimum to the maximum values.
Outliers
We have a pretty good intuitive sense of what an outlier is: it's a value far removed from the others. There is no rigorous mathematical formula for determining whether or not something is an outlier, but there are a few conventions that people seem to agree on. Not surprisingly, some of them are based on the mean and some are based on the median!
A commonly agreed-upon way to think of outliers based on the mean is to consider how many standard deviations away from the mean a term is. Some texts define an outlier as a datapoint that is more than two or three standard deviations from the mean.
In a mound-shaped, symmetric, distribution, this is a value has only about a 5% chance (for two standard deviations) or a 0.3% chance (for three standard deviations) of being as far removed from the center of the distribution as it is. Think of it as a value that is way out in one of the tails of the distribution.
Most texts now use a median-based measure and define outliers in terms of how far a datapoint is above or below the quartiles in a distribution. To find if a distribution has any outliers, do the following (this is known as the "1.5 (IQR) rule"):
- Find the IQR.
- Multiply the IQR by 1.5.
- Find Q1 – 1.5(IQR) and Q3 + 1.5(IQR).
- Any value below Q1 – 1.5(IQR) or above Q3 + 1.5(IQR) is an outlier.
Some texts call an outlier defined as above a mild outlier. An extreme outlier would then be one that lies more than 3 IQRs beyond Q1 or Q3.
example: The following data represent the amount of money, in British pounds, spent weekly on tobacco for 11 regions in Britain: 4.03, 3.76, 3.77, 3.34, 3.47, 2.92, 3.20, 2.71, 3.53, 4.51, 4.56. Do any of the regions seem to be spending a lot more or less than the other regions? That is, are there any outliers in the data?
solution: Using a calculator, we find = 3.62, Sx = s = .59, Q1 = 3.2, Q3 = 4.03.
- Using means: 3.62 ± 2(0.59) = (2.44, 4.8). There are no values in the dataset less than 2.44 or greater than 4.8, so there are no outliers by this method. We don't need to check ± 3s since there were no outliers using ± 2s.
- (using the 1.5IQR Rule): Q1 - 1.5(IQR) = 3.2 - 1.5(4.03 - 3.2) = 1.96, Q3 + 1.5(IQR) = 4.03 + 1.5(4.03 - 3.2) = 5.28. Because there are no values in the data less than 1.96 or greater than 5.28, there are no outliers by this method either.
Outliers are important because they will often tell us that something unusual or unexpected is going on with the data that we need to know about. A manufacturing process that produces products so far out of spec that they are outliers often indicates that something is wrong with the process. Sometimes outliers are just a natural, but rare, variation. Often, however, an outlier can indicate that the process generating the data is out of control in some fashion.
Practice problems for these concepts can be found at: