Histograms and Boxplots Study Guide (page 2)
Introduction to Histograms and Boxplots
Numerical data may be discrete or continuous. In this lesson, we will discuss presenting information on the distributions of discrete and continuous random variables in tabular form. Then we will learn how to display numerical data using histograms and boxplots. Discrete data most frequently arise from counting. In these cases, each observation is a whole number; however, some discrete data are not comprised fully of whole numbers. In contrast, a continuous random variable can take on any value in one or more intervals on the number line. Because a discrete random variable may assume either a finite or countably infinite number of values and a continuous random variable can assume any of an uncountably infinite number of values, we sometimes have to present data arising from observing these two types of random variables differently.
Tabular Displays of Discrete Distributions
Categorical data can be organized in tabular form. For each category, the frequency and/or relative frequency were presented.We could not compute the cumulative relative frequency when working with categorical data because the categories had no natural ordering. If the number of possible values of a discrete random variable is finite, its population distribution can be displayed in a table,much like we did for the categorical random variable. For each possible value of the discrete random variable, the frequency or relative frequency is presented. Because for discrete numerical data the categories have an order (e.g., one is less than two is less than three), we may also want to record the cumulative relative frequencies in the table. The cumulative relative frequency of i is the number of observations, with the value of i or less divided by the total number of observations. If we have a sample from a discrete distribution and not the whole population, we can display the sample distribution in tabular form. For each observed value,we would have the frequency, relative frequency, and perhaps the cumulative relative frequency.
A researcher named John L. Hoogland studied the mating behavior of Gunnison's prairie dogs at Petrified Forest National Park in Arizona for seven years, from 1989 to 1995. In 1998, he wrote an article titled "Why do female Gunnison's prairie dogs copulate with more than one male?" in Animal Behavior (55:351–359). Each year, all adult and juvenile prairie dogs at the 14- hectare study site were captured and marked. Mating season begins in mid-March and ends in early April. However, a female usually accepts partners on only one day of the breeding season. The number of partners accepted by each female prairie dog during a breeding season was recorded. During this seven-year period, female prairie dogs number 87, 93, 61, 17, and 5 accepted one, two, three, four, and five partners, respectively.
- What type of a study was this?
- Present the sample distribution in tabular form.
- All female prairie dogs that could be observed were observed in the study area, so there was no random selection of prairie dogs from some larger population. The number of partners could not be assigned at random, so there was no random assignment of treatments. Therefore, this is an observational study.
- The sample distribution is shown in Table 7.1.
From the table, we see that 87 of the 263 prairie dog females accepted one partner, but five prairie dog females accepted five partners. Further, 93 (or 35%) of the females accepted two partners. As in Lesson 3, we have a rounding error because the relative frequency column sums to 0.99 and not to 1.0. Again, we report the 1.0 and not 0.99. It is better to accurately report the relative frequencies than to force columns or rows in a table to total to an inaccurate value. From the cumulative relative frequency column, we see that 68% of the prairie dogs accepted one or two partners.
Grouped Discrete Data
When working with discrete data, each possible count is a natural category for a frequency distribution. Sometimes, numerous possible counts exist or a few large or small values are far away from most of the data. In these cases, a frequency distribution with a very long list of possible values does not aid understanding of the data. By grouping the observed values to form class intervals, or simply classes, greater insight into the data is often gained.
As an example,New Zealand's Tourism Research Council conducts an annual survey of international visitors. The length of stay of people traveling to New Zealand from other countries is collected. The results from the 2004 survey are displayed in Table 7.2 (Source: Tourism Research Council of New Zealand website at www.trcnz.govt.nz/Surveys/International+Visitor+Survey/Data+and+Analysis/).
Suppose we had listed each length of stay in days. We would have had more than 30 classes. By grouping, we have eight classes, and it is easy to see that 22% of the international visitors stayed less than five days and 16% stayed at least 30 days.However, we have lost some information. We do not know how many stayed for three days or how many stayed for more than 60 days.
Tabular Displays of Continuous Data
The difficulty in displaying numerical data in tables becomes more pronounced when working with continuous numerical data. Consider the heights of the orchestra members discussed in Lessons 3 and 4. There were 62 orchestra members who had 52 different heights! Forming class intervals allows us to display the frequencies within each class. The challenge is that no natural intervals exist, so we have to define our own. Because the shortest orchestra member was 53.5 inches and the tallest 76.7 inches, it seems natural to begin the classes at 53 inches and stop them at 77 inches. The question is how long should each interval be? If we form intervals of 2 inches beginning at 53 inches,we would have 16 classes: 53—55, 55—57, 57—59, 59—61, 61—63, 63—65, 65—67, 67—69, 69—71, 71—73, 73— 75, 75—77, 77—79, 79—81, 81—83, 83—85. If we choose 4-inch intervals, we would have eight classes: 53—57, 57—61, 61—65, 65—69, 69—73, 73—77, 77—81, 81—85. For now, let's look at both of these possibilities and later decide which one would be best.
We have one more problem that must be addressed before we can complete the frequency table: What happens if an observation falls on the boundary? As an illustration, one orchestra member is 65.0 inches tall. Should that person be in the 63–65 or the 65–67 class interval? We will adopt the convention that the lower boundary but not the upper boundary is included in a class. For the 65.0-inch-tall orchestra member, this means that she will be counted in the 65–67 class interval.
For 2-inch intervals, we would have the frequency table shown here in Table 7.3.
Using the 4-inch class intervals, we obtain the frequency distribution shown in Table 7.4.
Before deciding whether to use the 2-inch or 4-inch class intervals, we will learn how to construct histograms.
First, suppose we have ungrouped discrete data as in the number of partners for the Gunnison's prairie dog example. Constructing a histogram requires the following steps:
- On the horizontal axis, draw a scale, mark the possible values, and label the axis.
- On the vertical axis, draw a scale, mark it with either frequencies or relative frequencies, and label the scale.
- For each possible value, draw a rectangle centered at that value with a height determined by the corresponding frequency or relative frequency.
Construct the relative frequency histogram for the number of partners for Gunnison's prairie dogs.
The relative frequency histogram is displayed in Figure 7.1.
When working with continuous data, class intervals must be formed, as in Tables 7.3 and 7.4, before a histogram can be constructed. Once this is done, the process of constructing a histogram is similar to that for discrete data. For the 2-inch intervals of orchestra members' heights presented in Table 7.2, the histogram is shown in Figure 7.2.
When looking at a histogram, you should look for a center value, the extent of spread or dispersion, the general shape, the location and number of peaks, and the presence of gaps and outliers. Here, perhaps the most notable feature is the three peaks at 60, 68, and 72 inches. These make determining the center, or typical, value a little difficult. The center of the data seems to be about 66 inches. This agrees well with the mean of 66.4 inches and the median of 66.85 inches found in Lesson 5. The spread seems to be from about 54 to 84 inches. An unusually small value and an unusually large value are separated from the rest of the observations by gaps.
In Figure 7.3, the class intervals are 4 inches wide (see Table 7.3). The center appears to be at about 67, still consistent with the mean and median found in Lesson 5. However, there are only two peaks now, one at about 59 inches and the other at about 67 inches. The shortest person no longer seems to be an outlier as no gap exists on the graph between that value and the next smallest value, but the tallest orchestra member continues to appear to be an unusual value.
If class intervals are made too small, the average number of observations in each interval is small and subject to quite a bit of variation. This results in fluctuations in the height of the bars that may simply reflect fluctuations in the data and not true distributional characteristics. In contrast, if the class intervals are too wide, important features of the data may be obscured. The histogram using 2-inch intervals appears to reflect the fluctuations in the data, whereas the one based on 4-inch intervals is more clearly reflecting the distributional characteristics. Thus, 4-inch intervals are the most appropriate.As a general rule, the number of classes is often set approximately to the square root of the number of data points.Using this rule, seven or eight classes would be about the right number as there are 62 orchestra members. This agrees with our choice of 4-inch intervals.
Today on Education.com
- Coats and Car Seats: A Lethal Combination?
- Kindergarten Sight Words List
- Child Development Theories
- Signs Your Child Might Have Asperger's Syndrome
- 10 Fun Activities for Children with Autism
- Social Cognitive Theory
- Why is Play Important? Social and Emotional Development, Physical Development, Creative Development
- GED Math Practice Test 1
- The Homework Debate
- Problems With Standardized Testing