**A Pattern Emerges**

When working with large populations, and especially with continuous random variables, probabilities are defined differently than they are with small populations and discrete random variables. As the number of possible values of a random variable becomes larger and ''approaches infinity,'' it's easier to think of the probability of an outcome within a range of values, rather than the probability of an outcome for a single value.

Imagine that some medical researchers want to find out how people's blood pressure levels compare. At first, a few dozen people are selected at random from the human population, and the numbers of people having each of 10 specific systolic pressure readings are plotted (Fig. 3-7A). The systolic pressure, which is the higher of the two numbers you get when you take your blood pressure, is the random variable. (In this example, exact numbers aren't shown either for blood pressure or for the number of people. This helps us keep in mind that this entire scenario is make-believe.)

There seems to be a pattern in Fig. 3-7A. This does not come as a surprise to our group of medical research scientists. They expect most people to have ''middling'' blood pressure, fewer people to have moderately low or high pressure, and only a small number of people to have extremely low or high blood pressure.

In the next phase of the experiment, hundreds of people are tested. Instead of only 10 different pressure levels, 20 discrete readings are specified for the random variable (Fig. 3-7B). A pattern is obvious. Confident that they're onto something significant, the researchers test thousands of people and plot the results at 40 different blood pressure levels. The resulting plot of frequency (number of people) versus the value of the random variable (blood pressure) shows that there is a highly defined pattern. The arrangement of points in Fig. 3-7C is so orderly that the researchers are confident that repeating the experiment with the same number of subjects (but not the same people) will produce exactly the same pattern.

**Expressing the Pattern**

Based on the data in Fig. 3-7C, the researchers can use curve fitting to derive a general rule for the way blood pressure is distributed. This is shown in Fig. 3-7D. A smooth curve like this is called a *probability density function*, or simply a *density function*. It no longer represents the blood pressure levels of individuals, but only an expression of how blood pressure varies among the human population. On the vertical axis, instead of the number of people, the function value, *f*(*x*), is portrayed. Does this remind you of a Cheshire cat that gradually dissolves away until only its smile remains?

As the number of possible values of the random variable increases without limit, the point-by-point plot blurs into a density function, which we call *f*(*x*). The blood pressure of any particular subject vanishes into insignificance. Instead, the researchers become concerned with the probability that any randomly chosen subject's blood pressure will fall within a given range of values.

**Area Under the Curve**

Figure 3-8 is an expanded view of the curve derived by ''refining the points'' of Fig. 3-7 to their limit. This density function, like all density functions, has a special property: if you calculate or measure the total area under the curve, it is always equal to 1. This rule holds true for the same reason that the relative frequency values of the outcomes for a discrete variable always add up to 1 (or 100%), as we learned in the last chapter.

Consider two hypothetical systolic blood pressure values: say *a* and *b* as shown in Fig. 3-8. (Again, we refrain from giving specific numbers because this example is meant to be for illustration, not to represent any recorded fact.) The probability that a randomly chosen person will have a systolic blood pressure reading *k* that is between *a* and *b* can be written in any of four ways:

The first of these expressions includes neither *a* nor *b*, the second includes *a* but not *b*, the third includes *b* but not *a*, and the fourth includes both *a* and *b*. All four expressions are identical in the sense that they are all represented by the shaded portion of the area under the curve. Because of this, an expression with less-than signs only is generally used when expressing discrete probability.

If the vertical lines *x* = *a* and *x* = *b* are moved around, the area of the shaded region gets larger or smaller. This area can never be less than 0 (when the two lines coincide) or greater than 1 (when the two lines are so far apart that they allow for the entire area under the curve).

Practice problems for these concepts can be found at:

Basics of Probability Practice Test

View Full Article

From Statistics Demystified: A Self-Teaching Guide. Copyright © 2004 by The McGraw-Hill Companies. All Rights Reserved.