Histograms and Boxplots Study Guide (page 3)
Introduction to Histograms and Boxplots
Numerical data may be discrete or continuous. In this lesson, we will discuss presenting information on the distributions of discrete and continuous random variables in tabular form. Then we will learn how to display numerical data using histograms and boxplots. Discrete data most frequently arise from counting. In these cases, each observation is a whole number; however, some discrete data are not comprised fully of whole numbers. In contrast, a continuous random variable can take on any value in one or more intervals on the number line. Because a discrete random variable may assume either a finite or countably infinite number of values and a continuous random variable can assume any of an uncountably infinite number of values, we sometimes have to present data arising from observing these two types of random variables differently.
Tabular Displays of Discrete Distributions
Categorical data can be organized in tabular form. For each category, the frequency and/or relative frequency were presented.We could not compute the cumulative relative frequency when working with categorical data because the categories had no natural ordering. If the number of possible values of a discrete random variable is finite, its population distribution can be displayed in a table,much like we did for the categorical random variable. For each possible value of the discrete random variable, the frequency or relative frequency is presented. Because for discrete numerical data the categories have an order (e.g., one is less than two is less than three), we may also want to record the cumulative relative frequencies in the table. The cumulative relative frequency of i is the number of observations, with the value of i or less divided by the total number of observations. If we have a sample from a discrete distribution and not the whole population, we can display the sample distribution in tabular form. For each observed value,we would have the frequency, relative frequency, and perhaps the cumulative relative frequency.
A researcher named John L. Hoogland studied the mating behavior of Gunnison's prairie dogs at Petrified Forest National Park in Arizona for seven years, from 1989 to 1995. In 1998, he wrote an article titled "Why do female Gunnison's prairie dogs copulate with more than one male?" in Animal Behavior (55:351–359). Each year, all adult and juvenile prairie dogs at the 14- hectare study site were captured and marked. Mating season begins in mid-March and ends in early April. However, a female usually accepts partners on only one day of the breeding season. The number of partners accepted by each female prairie dog during a breeding season was recorded. During this seven-year period, female prairie dogs number 87, 93, 61, 17, and 5 accepted one, two, three, four, and five partners, respectively.
- What type of a study was this?
- Present the sample distribution in tabular form.
- All female prairie dogs that could be observed were observed in the study area, so there was no random selection of prairie dogs from some larger population. The number of partners could not be assigned at random, so there was no random assignment of treatments. Therefore, this is an observational study.
- The sample distribution is shown in Table 7.1.
From the table, we see that 87 of the 263 prairie dog females accepted one partner, but five prairie dog females accepted five partners. Further, 93 (or 35%) of the females accepted two partners. As in Lesson 3, we have a rounding error because the relative frequency column sums to 0.99 and not to 1.0. Again, we report the 1.0 and not 0.99. It is better to accurately report the relative frequencies than to force columns or rows in a table to total to an inaccurate value. From the cumulative relative frequency column, we see that 68% of the prairie dogs accepted one or two partners.
Grouped Discrete Data
When working with discrete data, each possible count is a natural category for a frequency distribution. Sometimes, numerous possible counts exist or a few large or small values are far away from most of the data. In these cases, a frequency distribution with a very long list of possible values does not aid understanding of the data. By grouping the observed values to form class intervals, or simply classes, greater insight into the data is often gained.
As an example,New Zealand's Tourism Research Council conducts an annual survey of international visitors. The length of stay of people traveling to New Zealand from other countries is collected. The results from the 2004 survey are displayed in Table 7.2 (Source: Tourism Research Council of New Zealand website at www.trcnz.govt.nz/Surveys/International+Visitor+Survey/Data+and+Analysis/).
Suppose we had listed each length of stay in days. We would have had more than 30 classes. By grouping, we have eight classes, and it is easy to see that 22% of the international visitors stayed less than five days and 16% stayed at least 30 days.However, we have lost some information. We do not know how many stayed for three days or how many stayed for more than 60 days.
Tabular Displays of Continuous Data
The difficulty in displaying numerical data in tables becomes more pronounced when working with continuous numerical data. Consider the heights of the orchestra members discussed in Lessons 3 and 4. There were 62 orchestra members who had 52 different heights! Forming class intervals allows us to display the frequencies within each class. The challenge is that no natural intervals exist, so we have to define our own. Because the shortest orchestra member was 53.5 inches and the tallest 76.7 inches, it seems natural to begin the classes at 53 inches and stop them at 77 inches. The question is how long should each interval be? If we form intervals of 2 inches beginning at 53 inches,we would have 16 classes: 53—55, 55—57, 57—59, 59—61, 61—63, 63—65, 65—67, 67—69, 69—71, 71—73, 73— 75, 75—77, 77—79, 79—81, 81—83, 83—85. If we choose 4-inch intervals, we would have eight classes: 53—57, 57—61, 61—65, 65—69, 69—73, 73—77, 77—81, 81—85. For now, let's look at both of these possibilities and later decide which one would be best.
We have one more problem that must be addressed before we can complete the frequency table: What happens if an observation falls on the boundary? As an illustration, one orchestra member is 65.0 inches tall. Should that person be in the 63–65 or the 65–67 class interval? We will adopt the convention that the lower boundary but not the upper boundary is included in a class. For the 65.0-inch-tall orchestra member, this means that she will be counted in the 65–67 class interval.
For 2-inch intervals, we would have the frequency table shown here in Table 7.3.
Using the 4-inch class intervals, we obtain the frequency distribution shown in Table 7.4.
Before deciding whether to use the 2-inch or 4-inch class intervals, we will learn how to construct histograms.
First, suppose we have ungrouped discrete data as in the number of partners for the Gunnison's prairie dog example. Constructing a histogram requires the following steps:
- On the horizontal axis, draw a scale, mark the possible values, and label the axis.
- On the vertical axis, draw a scale, mark it with either frequencies or relative frequencies, and label the scale.
- For each possible value, draw a rectangle centered at that value with a height determined by the corresponding frequency or relative frequency.
Construct the relative frequency histogram for the number of partners for Gunnison's prairie dogs.
The relative frequency histogram is displayed in Figure 7.1.
When working with continuous data, class intervals must be formed, as in Tables 7.3 and 7.4, before a histogram can be constructed. Once this is done, the process of constructing a histogram is similar to that for discrete data. For the 2-inch intervals of orchestra members' heights presented in Table 7.2, the histogram is shown in Figure 7.2.
When looking at a histogram, you should look for a center value, the extent of spread or dispersion, the general shape, the location and number of peaks, and the presence of gaps and outliers. Here, perhaps the most notable feature is the three peaks at 60, 68, and 72 inches. These make determining the center, or typical, value a little difficult. The center of the data seems to be about 66 inches. This agrees well with the mean of 66.4 inches and the median of 66.85 inches found in Lesson 5. The spread seems to be from about 54 to 84 inches. An unusually small value and an unusually large value are separated from the rest of the observations by gaps.
In Figure 7.3, the class intervals are 4 inches wide (see Table 7.3). The center appears to be at about 67, still consistent with the mean and median found in Lesson 5. However, there are only two peaks now, one at about 59 inches and the other at about 67 inches. The shortest person no longer seems to be an outlier as no gap exists on the graph between that value and the next smallest value, but the tallest orchestra member continues to appear to be an unusual value.
If class intervals are made too small, the average number of observations in each interval is small and subject to quite a bit of variation. This results in fluctuations in the height of the bars that may simply reflect fluctuations in the data and not true distributional characteristics. In contrast, if the class intervals are too wide, important features of the data may be obscured. The histogram using 2-inch intervals appears to reflect the fluctuations in the data, whereas the one based on 4-inch intervals is more clearly reflecting the distributional characteristics. Thus, 4-inch intervals are the most appropriate.As a general rule, the number of classes is often set approximately to the square root of the number of data points.Using this rule, seven or eight classes would be about the right number as there are 62 orchestra members. This agrees with our choice of 4-inch intervals.
Another graph that is extremely useful is the boxplot. To create a boxplot, we need the five-number summary. The five-number summary consists of the first quartile, the median, the third quartile, the smallest observed value, and the largest observed value. The following steps lead to a boxplot:
- Draw a scale that extends below the smallest and above the largest values in the data set on either the horizontal or vertical axis.
- Draw parallel line segments at the first quartile, the median, and the third quartile. Connect the ends of the three parallel line segments to form a box.
- Extend "whiskers" from the center of the first quartile line segment and the center of the third quartile line segment to the smallest and largest observations, respectively, as long as these most extreme observations are within 1.5 IQR of the closest quartile. Otherwise, extend them to the smallest value within 1.5 IQR of the first quartile and to the largest value within 1.5 IQR of the first quartile.
- If there are any observations beyond 1.5 IQR of the nearest quartiles (so the whiskers do not extend to these values), mark these observations with an asterisk (*).
- The mean may also be marked using a circle, which allows an easy comparison of the relative sizes of the mean and median.
Any value that is more than 1.5 IQR units below Q1 or above Q3 is defined to be an outlier.
Create a boxplot for the orchestra members' heights introduced in Lesson 4. Identify any outliers that might be present.
Referring again to our work with the orchestra members' heights in Lessons 5 and 6, we have the following five-number summary:
First quartile (Q1): 62.5 Median (Q2): 66.85 Third quartile(Q3): 69.5 Smallest value: 53.5 Largest value: 83.8
As determined in Lesson 6, the IQR for the orchestra members' heights is 7 inches; thus, 1.5 IQR = 1.5 (7) = 10.5 inches. Q1 – 1.5 IQR = 62.5 – 10.5 =52.5 inches. Because the smallest orchestra member is 53.5 inches, which is greater than the 52.5 just calculated, the lower whisker extends only to 53.5 inches. Now, Q3 + 1.5 IQR = 69.5 + 10.5 = 80 inches. The two tallest orchestra members are 76.7 and 83.8 inches tall. The member who is 76.7 inches tall is within 1.5 IQR of the third quartile, but the tallest member is not. Therefore, the upper whisker extends from Q3 to 76.7 (the largest observation within 1.5 IQR of Q3), and a star is used to designate the 83.8 inches that is beyond the end of the whisker. Finally, the mean is denoted with a circle in the box.
Here, the orchestra member who is 83.8 inches tall is an outlier, but the shortest member who is 53.5 inches tall is not. Notice that the histogram based on 4-inch class intervals reflects which values are outliers better than the one based on 2-inch class intervals (see Figure 7.4).
Shape of a Distribution
Unlike dotplots and stem-and-leaf plots, histograms and boxplots may be used with very large data sets.All four, but especially the histograms and boxplots, provide a visual display of the shape of the distribution. Three specific shapes will be discussed most frequently: symmetric, right skewed, and left skewed. If a vertical line can be drawn through the center of a histogram such that the area to the left of the line is a mirror image of the area to the right, the distribution is symmetric (see Figure 7.5). For boxplots, a distribution is symmetric if the shape of the box and length of the whiskers for observations that are smaller than the median is a mirror image of the shape of the box and length of the whiskers for observations that are greater than the median. The most common continuous distribution that is unimodal and symmetric is the normal distribution.
If a distribution has one mode (is unimodal) and is not symmetric, the distribution is said to be skewed. Proceeding to the right of the mode in a unimodal distribution, we move to the upper, or right, tail of the distribution. Similarly, we move to the lower, or left, tail of the distribution as we proceed to the left of the mode in a unimodal distribution. If the upper tail stretches out farther than the lower tail, the distribution is right or positively skewed (see Figure 7.5). If the lower tail stretches out farther than the upper tail, the distribution is left or negatively skewed (see Figure 7.5). These shapes may be seen in both histograms and boxplots.
Histograms and Boxplots In Short
Numerical data can be summarized in tabular form. However, if the number of possible values of a discrete random variable is large or if the random variable is continuous, then possible values need to be grouped. Frequency or relative frequency histograms and boxplots provide visual summaries of the distributions.
Find practice problems and solutions for these concepts at Histograms and Boxplots Practice Exercises.
Today on Education.com
- Coats and Car Seats: A Lethal Combination?
- Kindergarten Sight Words List
- Child Development Theories
- Signs Your Child Might Have Asperger's Syndrome
- 10 Fun Activities for Children with Autism
- Why is Play Important? Social and Emotional Development, Physical Development, Creative Development
- The Homework Debate
- Social Cognitive Theory
- GED Math Practice Test 1
- First Grade Sight Words List