Practice problems for these concepts can be found at:

We have been discussing characteristics of distributions (shape, center, spread) and of the individual terms (percentiles, z-scores) that make up those distributions. Certain distributions have particular interest for us in statistics, in particular those that are known to be symmetric and mound shaped. The following histogram represents the heights of 100 males whose average height is 70'' and whose standard deviation is 3''.

This is clearly approximately symmetric and mound shaped. We are going to model this with a curve that idealizes what we see in this sample of 100. That is, we will model this with a continuous curve that "describes" the shape of the distribution for very large samples. That curve is the graph of the **normal distribution**. A *normal curve*, when superimposed on the above histogram, looks like this:

The function that yields the *normal curve* is defined *completely* in terms of its mean and standard deviation. Although you are not required to know it, you might be interested to know that the function that defines the normal curve is:

.

One consequence of this definition is that the total area under the curve, and above the *x*-axis, is 1 (for you calculus students, this is because .

This fact will be of great use to us later when we consider areas under the normal curve as probabilities.

### Empirical Rule

The **empirical rule**, or the **68-95-99.7 rule**, states that *approximately* 68% of the terms in a normal distribution are within one standard deviation of the mean, 95% are within two standard deviation of the mean, and 99.7% are within three standard deviations of the mean. The following three graphs illustrate the empirical rule.

### Standard Normal Distribution

Because we are dealing with a theoretical distribution, we will use μ and *σ*, rather than and *s* when referring to the normal curve. If *X* is a variable that has a normal distribution with mean μ and standard deviation σ (we say "*X* has N(*μ*,*s*)"), there is a related distribution we obtain by **standardizing** the data in the distribution to produce the **standard normal distribution**. To do this, we convert the data to a set of z-scores, using the formula

.

The algebraic effect of this, as we saw earlier, is to produce a distribution of *z*-scores with mean 0 and standard deviation 1. Computing *z*-scores is just a linear transformation of the original data, which means that the transformed data will have the same shape as the original distribution. In this case then, the distribution of *z*-scores is normal. We say z has *N*(0,1). This simplifies the defining density function to

.

For the standardized normal curve, the *empirical rule* says that approximately 68% of the terms lie between *z* = 1 and *z* = -1, 95% between *z* = -2 and *z* = 2, and 99.7% between *z* = -3 and *z* = 3. (Trivia for calculus students: one standard deviation from the mean is a *point of inflection*.)

Because many naturally occurring distributions are approximately normal (heights, SAT scores, for example), we are often interested in knowing what proportion of terms lie in a given interval under the normal curve. Problems of this sort can be solved either by use of a calculator or a table of Standard Normal Probabilities (Table A in this book). In a typical table, the marginal entries are z-scores, and the table entries are the areas under the curve to the left of a given z-score. All statistics texts have such tables.

.

**example:** What proportion of the area under a normal curve lies to the left of *z* = -1.37?

**solution:** There are two ways to do this problem, and you should be able to do it either way.

- The first way is to use the table of Standard Normal Probabilities. To read the table, move down the left column (titled "
*z*") until you come to the row whose entry is -1.3. The third digit, the 0.07 part, is found by reading across the top row until you come to the column whose entry is 0.07. The entry at the intersection of the row containing -1.3 and the column containing 0.07 is the area under the curve to the left of *z* = -1.37. That value is 0.0853.
- The second way is to use your calculator. It is the more accurate and more efficient way. In the DISTR menu, the second entry is normalcdf (see the next Calculator Tip for a full explanation of the normalpdf and normalcdf functions). The calculator syntax for a standard normal distribution is normalcdf (lower bound, upper bound). In this example, the lower bound can be any large negative number, say –100. normalcdf(-100,-1.37)= 0.0853435081.

.

.

**example:** What proportion of the area under a normal curve lies between *z* = -1.2 and *z* = 0.58?

**solution:** (i) Reading from Table A, the area to the left of *z* = -1.2 is 0.1151, and the area to the left of z = 0.58 is 0.7190. The geometry of the situation (see below) tells us that the area between the two values is 0.7190 - 0.1151 = 0.6039.

.

**example:** In an earlier example, we saw that heights of men are approximately normally distributed with a mean of 70 and a standard deviation of 3. What proportion of men are more than 6' (72'') tall? Be sure to include a sketch of the situation.

**solution:**

- Another way to state this is to ask what proportion of terms in a normal distribution with mean 70 and standard deviation 3 are greater than 72. In order to use the table of Standard Normal Probabilities, we must first convert to
*z*-scores. The *z*-score corresponding to a height of 72'' is
- (Using the calculator, we have normalcdf (-1.2, 0.58) = 0.603973005. Round to 0.6040 (difference from the answer in part (i) caused by rounding).
.

The area to the left of *z* = 0.67 is 0.7486. However, we want the area to the *right* of 0.67, and that is 1 - 0.7486 = 0.2514.

ii. Using the calculator, we have normalcdf (0.67,100) = 0.2514. We could get the answer from the raw data as follows: normalcdf (72,1000,70,3) = 0.2525, with the difference being due to rounding. (As explained in the last Calculator Tip, simply add the mean and standard deviation of a nonstandard normal curve to the list of parameters for normalcdf.)

.

**example:** For the population of men in the previous example, how tall must a man be to be in the top 10% of all men in terms of height?

**solution:** This type of problem has a standard approach. The idea is to express z_{x} in two different ways (which are, of course, equal since they are different ways of writing the *z*-score for the same point): (i) as a numerical value obtained from Table A or from your calculator and (ii) in terms of the definition of a z-score.

.

- We are looking for the value of
*x* in the drawing. Look at Table A to find the nearest table entry equal to 0.90 (because we know an area, we need to read the table from the inside out to the margins). It is 0.8997 and corresponds to a *z*-score of 1.28.
.

A man would have to be at least 73.84'' tall to be in the top 10% of all men.

- Using the calculator, the z-score corresponding to an area of 90% to the left of
*x* is given by invNorm(0.90) = 1.28. Otherwise, the solution is the same as is given in part (i). See the following Calculator Tip for a full explanation of the invNorm function.

### Chebyshev's Rule

The empirical rule works fine as long as the distribution is approximately normal. But what do you do if the shape of the distribution is unknown or distinctly nonnormal (as, say, skewed strongly to the right)? Remember that the empirical rule told you that, in a normal distribution, approximately 68% of the data are within one standard deviation of the mean, approximately 95% are within two standard deviations, and approximately 99.7% are within three standard deviations. Chebyshev's rule isn't as strong as the empirical rule, but it does provide information about the percent of terms contained in an interval about the mean for any distribution.

Let *k* be a number of standard deviations. Then, according to Chebyshev's rule, for *k* > 1, *at least* % of the data lie within *k* standard deviations of the mean. For example, if *k* = 2.5, then Chebyshev's rule says that at least % = 84% of the data lie with 2.5 standard deviations of the mean. If *k* = 3, note the difference between the empirical rule and Chebyshev's rule. The empirical rule says that *approximately* 99.7% of the data are within three standard deviations of . Chebyshev's says that *at least* % ≈ 89% of the data are within three standard deviations of . This also illustrates what was said in the previous paragraph about the empirical rule being stronger than Chebyshev's. Note that, if *at least* % of the data are *within k* standard deviations of , it follows (algebraically) that *at most* % lie *more than k* standard deviations from .

Knowledge of Chebyshev's rule is not required in the AP Exam, but its use is certainly OK and is common enough that it will be recognized by AP readers.

Practice problems for these concepts can be found at: