**Percentiles in a Normal Distribution**

Percentiles divide a large data set into 100 intervals, each interval containing 1% of the elements in the set. There are 99 possible percentiles, not 100, because the percentiles represent the boundaries where the 100 intervals meet.

Imagine an experiment in which systolic blood pressure readings are taken for a large number of people. The systolic reading is the higher of the two numbers the instrument displays. So if your reading is 110/70, read ''110 over 70,'' the systolic pressure is 110. Suppose the results of this experiment are given to us in graphical form, and the curve looks like a continuous distribution because there the population is huge. Suppose it happens to be a normal distribution: bell-shaped and symmetrical (Fig. 4-1).

Let's choose some pressure value on the horizontal axis, and extend a line *L* straight up from it. The percentile corresponding to this pressure is determined by finding the number *n* such that at least *n*% of the area under the curve falls to the left of line *L*. Then, *n* is rounded to the nearest whole number between, and including, 1 and 99 to get the percentile p. For example, suppose the region to the left of the line *L* represents 93.3% of the area under the curve. Therefore, *n* = 93.3. Rounding to the nearest whole number between 1 and 99 gives *p* = 93. This means the blood pressure corresponding to the point where line *L* intersects the horizontal axis is in the 93rd percentile.

The location of any particular percentile point (boundary), say the *p*th, is found by locating the vertical line such that the percentage *n* of the area beneath the curve is exactly equal to p, and then noting the point where this line crosses the horizontal axis. In Fig. 4-1, imagine that line *L* can be moved back and forth like a sliding door. When the number *n*, representing the percentage of the area beneath the curve to the left of *L*, is exactly equal to 93 then the line crosses the horizontal axis at the 93rd percentile boundary point. Although it's tempting to think that there could be a ''0th percentile'' (*n* = 0) and a ''100th percentile'' (*n* = 100), neither of these ''percentiles'' represents a boundary where two intervals meet.

Note the difference between saying that a certain pressure ''is in'' the *p*th percentile, versus saying that a certain pressure ''is at'' the *p*th percentile. In the first case we're describing a data interval; in the second case we're talking about a boundary point between two intervals.

**Percentiles in Tabular Data**

Imagine that 1000 students take a 40-question test. There are 41 possible scores: 0 through 40. Suppose that every score is accounted for. There are some people who write perfect papers, and there are a few unfortunates who don't get any of the answers right. Table 4-1 shows the test results, with the scores in ascending order from 0 to 40 in the first column. For each possible score, the number of students getting that score (the absolute frequency) is shown in the second column. The third column shows the cumulative absolute frequency, expressed from lowest to highest scores.

Where do we put the 99 percentile points (boundaries) in this data set? How can we put 99 ''fault lines'' into a set that has only 41 possible values? The answer is, obviously, that we can't. What about grouping the students, then? A thousand people have taken the test. Why not break them up into 100 different groups with 99 different boundaries, and then call the 99 boundaries the ''percentile points,'' like this?

- The ''worst'' 10 papers, and the 1st percentile point at the top of that group.
- The ''2nd worst'' 10 papers, and the 2nd percentile point at the top of that group.
- The ''3rd worst'' 10 papers, and the 3rd percentile point at the top of that group.
- he ''
*p*th worst'' 10 papers, and the *p*th percentile point at the top of that group.
- he ''
*q*th best'' 10 papers, and the (100 – *q*)th percentile point at the top of that group.
- he ''3rd best'' 10 papers, and the 97th percentile point at the bottom of that group.
- The ''2nd best'' 10 papers, and the 98th percentile point at the bottom of that group.
- The ''best'' 10 papers, and the 99th percentile point at the bottom of that group.

This looks great at first, but there's a problem. When we check Table 4-1, we can see that 50 people have scored 31 on the test. That's five groups of 10 people, all with the same score. These scores are all ''equally good'' (or ''equally bad''). If we're going to say that any one of these papers is ''in the *p*th percentile,'' then clearly we must say that they are all ''in the *p*th percentile.'' We cannot arbitrarily take 10 papers with scores of 31 and put them in the *p*th percentile, then take 10 more papers with scores of 31 and put them in the *p* + 1st percentile, then take 10 more papers with scores of 31 and put them in the *p* + 2nd percentile, then take 10 more papers with scores of 31 and put them in the *p* + 3rd percentile, and then take 10 more papers with scores of 31 and put them in the *p* + 4th percentile. That would be unfair.

This business of percentiles is starting to get confusing and messy, isn't it? By now you must be wondering, ''Who invented this concept, anyway?'' That doesn't matter; the scheme is commonly used and we are stuck with it. What can we do to clear it all up and find a formula that makes sense in all possible scenarios?

**Percential Points, Ranks, and Inversion**

**Percential Points**

We get around the foregoing conundrum by defining a scheme for calculating the positions of the percentile points in a set of *ranked data elements*. A set of ranked data elements is a set arranged in a table from ''worst to best,'' such as that in Table 4-1. Once we have defined the percentile positioning scheme, we accept it as a convention, ending all confusion forever and ever.

So – suppose we are given the task of finding the position of the *p*th percentile in a set of *n* ranked data elements. First, multiply *p* by *n*, and then divide the product by 100. This gives us a number *i* called the index:

Here are the rules:

- If
*i* is not a whole number, then the location of the *p*th percentile point is *i* + 1.
- If
*i* is a whole number, then the location of the *p*th percentile point is *i* + 0.5.

**Percential Ranks**

If we want to find the percentile rank *p* for a given element or position *s* in a ranked data set, we use a different definition. We divide the number of elements less than *s* (call this number t) by the total number of elements *n*, and multiply this quantity by 100, getting a tentative percentile *p**:

Then we round *p** to the nearest whole number between, and including, 1 and 99 to get the percentile rank *p* for that element or position in the set.

Percentile ranks defined in this way are intervals whose centers are at the percentile boundaries as defined above. The 1st and 99th percentile ranks are often a little bit oversized according to this scheme, especially if the population is large. This is because the 1st and 99th percentile ranks encompass *outliers*, which are elements at the very extremes of a set or distribution.

**Percentile Inversion**

Once in a while you'll hear people use the term ''percentile'' in an inverted, or upside-down, sense. They'll talk about the ''first percentile'' when they really mean the 99th, the ''second percentile'' when they really mean the 98th, and so on. Beware of this! If you get done with a test and think you have done well, and then you're told that you're in the ''4th percentile,'' don't panic. Ask the teacher or test administrator, ''What does that mean, exactly? The top 4%? The top 3%? The top 3.5%? Or what?'' Don't be surprised if the teacher or test administrator is not certain.

**Percentile Practice Problems**

**Practice 1**

Where is the 56th percentile point in the data set shown by Table 4-1?

**Solution 1**

There are 1000 students (data elements), so *n* = 1000. We want to find the 56th percentile point, so *p* = 56. First, calculate the index:

This is a whole number, so we must add 0.5 to it, getting *i* + 0.5 = 560.5. This means the 56th percentile is the boundary between the ''560th worst'' and ''561st worst'' test papers. To find out what score this represents, we must check the cumulative absolute frequencies in Table 4-1. The cumulative frequency corresponding to a score of 25 is 531 (that's less than 560.5); the cumulative frequency corresponding to a score of 26 is 565 (that's more than 560.5). The 56th percentile point thus lies between scores of 25 and 26.

**Practice 2**

If you happen to be among the students taking this test and you score 33, what is your percentile rank?

**Solution 2**

Checking the table, you can see that 800 students have scores less than yours. (Not less than or equal to, but just less than!) This means, according to the second definition above, that *t* = 800. Then the tentative percentile *p** is:

This is a whole number, so rounding it to the nearest whole number between 1 and 99 gives us *p* = 80. You are in the 80th percentile.

**Practice 3**

If you happen to be among the students taking this test and you score 0, what is your percentile rank?

**Solution 3**

In this case, no one has a score lower than yours. This means, according to the second definition above, that *t* = 0. The tentative percentile *p** is:

Remember, we must round to the nearest whole number between 1 and 99 to get the actual percentile value. This is *p* = 1. Therefore, you rank in the 1st percentile.

Practice problems for these concepts can be found at:

Descriptive Measures Practice Test