Describing and Displaying Categorical Data Study Guide (page 3)
Introduction to Describing and Displaying Categorical Data
Population is the set of objects or individuals of interest, and that the sample is a subset of the population. When presented with a set of either population or sample values, we need to summarize them in some way if we are to gain insight into the information they provide. In this lesson, we will focus on the presentation of categorical data, both categorically and graphically.
Population Distribution versus Sample Distribution
Suppose that a high school orchestra has 62 members, 35 females and 27 males. The population of interest is the set of all orchestra members. Data are available on the categorical random variable, gender. The collection of genders for the population represents the population distribution of genders. In this example, the population is finite because there are a finite number of units (orchestra members) in the population.
Suppose 15 of the orchestra members are randomly selected. The genders of these 15 represent the sample distribution of gender. A summary measure of a population is called a parameter; a summary measure of a sample distribution, which is a function of sample values and has no unknown parameters, is called a statistic. Frequencies and relative frequencies are important summary measures for categorical variables such as the gender of the orchestra members.
As the sample size increases, the sample distribution tends to be more like the population distribution as long as the units for the sample have been drawn at random from the population. This is comforting in that, intuitively, we expect for the sample "be better" as the sample size increases.What we mean by "better" is not always clear. However, the fact that the sample distribution begins to tend toward the population distribution is one way in which we have done better. Other measures of "better" will be discussed in later lessons.
Frequency and Relative Frequency
The nature of categorical data leads to counts of the numbers falling within each category: the numbers of females and males; the numbers of red, yellow, or blue items; and the numbers ordering pizza, hamburger, or chicken. Notice, if we have two categories,we have two counts; three categories, three counts; and so on.
The number of times a category appears in a data set is called the frequency of that category. The relative frequency of a category is the proportion of times that category occurs in the data set; that is,
relative frequency = frequency/number of observations in the data set
These frequencies or relative frequencies are best organized in tabular form. The table should display all possible categories and the frequencies or relative frequencies. The frequency distribution (or relative frequency distribution) for categorical data is the categories with their associated frequencies (or relative frequencies). It is important to remember that if we have all population values, we can find the population frequencies and population relative frequencies. Otherwise, population values, then we can find the sample frequencies and sample relative frequencies. For the band members, we have all population values. The population frequency and relative frequency distributions for the genders of these band members are displayed in Table 3.1.
In the orchestra of 62 members, 36 play a string instrument, 12 play a woodwind, ten play brass, and four play percussion. Provide a tabular display of the frequency and relative frequency distributions for the type of instruments for this orchestra.
Note that, although the frequencies sum properly to the total number of orchestra members, the relative frequencies actually sum to 0.99, not to 1 as indicated. The reason for this is round-off error. When dividing the category frequency by the total number of members, the result was not always an exact two-decimal value.We rounded to two decimal places. It is better to use the 1.00 as the total rather than reﬂecting the rounding error in the total.
The orchestra is going to a special awards banquet during which dinner will be served. The hosts of the banquet need to know in advance whether the orchestra members prefer steak, ﬁsh, or pasta as their main dish. Thirty-two members choose steak, 12 choose ﬁsh, and 18 choose pasta. Provide a tabular display of the frequency and relative frequency distributions of the orchestra members' main dish choices.
Pie Charts - Visual Displays for Categorical Data
Pie charts and bar charts are common graphical approaches to displaying data. The frequencies, relative frequencies, or percentages can be presented graphically using pie charts or bar charts. For each category of chart, percent = relative frequency × 100%.
To make a pie chart, ﬁrst draw a circle to represent the entire data set. For each category, the "slice" size is the category's relative frequency times 360 (because there are 360 ° in a circle). Each slice should be labeled with the category name. The numerical value of the frequency, relative frequency, or percentage associated with each slice should also be shown on the graph. Percentages are presented most commonly in pie charts. As an example, consider the frequency distribution of the high school orchestra members' gender. Thirty-ﬁve or 56.5% were female, and 27 or 43.5% were male. For the pie chart, the slice for females is of the pie; the remaining is for the males. See Figure 3.1.
For the relative frequency distribution of the types of instruments played by the high school orchestra members discussed earlier in the lesson, create a pie chart.
First, the size of each pie must be found. For example, for the woodwinds, the slice of the pie is . After performing this calculation for each instrument type, the following pie chart can be created (see Figure 3.2).
Create a pie chart of the frequency and relative frequency distribution of the orchestra members' meal choice for the awards banquet.
See Figure 3.3 for an illustration of the orchestra members' meal choices.
Bar Charts - Visual Displays for Categorical Data
Bar charts may be used to display frequencies, relative frequencies, or percentages represented by each category in a data set. If there is only a response variable and frequencies are to be presented, a bar is used for each category and the height of the bar corresponds to the number of times that category occurs in the data set. For a relative frequency bar chart, a bar is used for each category, and the height of the bar corresponds to the proportion of times that response occurs in the data set. As an illustration, consider a relative frequency bar chart of the genders of the orchestra members.
Notice that in Figure 3.4, both the x- and y-axes are labeled. Categories are displayed on the x-axis, and an appropriate scale is used on the y-axis.
Create a frequency bar chart of the orchestra members' instruments.
See Figure 3.5 for a bar chart illustrating the instrument frequency.
Create a percent bar chart of the types of meals ordered for the awards banquet.
See Figure 3.6 for a bar chart of the percentages of meals ordered.
Visual Displays for Categorical Data with Both Response and Explanatory Variables
Suppose a data set has a response and an explanatory variable, both of which are categorical. Pie charts and bar charts can be used to give visual displays of how the response variable might depend on the explanatory variable. For pie charts, a separate pie is created for each category of the explanatory variable.
The instructor of the university class described in practice problem 1 was initially surprised by the results. Then she realized that whether the students used "pop," "soda," or "coke" in referring to a cola beverage probably depended on the region of the country in which the student grew up. She asked each student where he or she grew up and which term he or she used for a cola beverage. Of the students from the South, 25 used the term "coke" and four used "pop." Of the students from the Midwest, 17 used the term "pop" and 10 used "soda." Of the students from the Northeast, 21 used "soda" and six used "pop." No other region of the United States was represented in the class.
Present the data in tabular form showing both counts and relative frequencies for each response variable-explantory variable combination. Make side-by-side pie charts presenting the data.
The teacher has recorded the value of an explanatory variable (region) to help explain the response variable values (name used to refer to a cola beverage). It is important to notice that the relative frequencies of the response variable are computed for each region.As an illustration, the sum of the relative frequencies for the South will be one. The relative frequency of using the term "coke" for students from the South was . We do not divide by the total number in the class (83), only by the number from the South (29). See Table 3.4 and Figure 3.7 for illustrations of the data.
From the pie charts, it is easy to see that the terms used for cola beverages are quite different for various regions.Notice that these pie charts for the individual regions provide greater insight than the overall pie chart you created in practice problem 4.
In a manner similar to the pie charts, bar graphs can be used to compare the response variable for varying values of the explanatory variable. Suppose we want to graph the relative frequencies. First, form a group of bars for each category of the explanatory variable.Within each group of bars, one bar is drawn for each category of the response variable. The height of the bar depends on the relative frequency.
Create a bar chart that shows how the term used for a cola beverage changes with region of the country.
See Figure 3.8 for a bar chart showing cola terminology in different regions of the country.
Describing and Displaying Categorical Data In Short
In this lesson,we have discussed the difference in population and sample distribution. As the sample size increases, the sample distribution tends to be more like the population distribution as long as sample units are selected at random. When working with categorical variables, pie charts and bar charts give a nice visual display of the data. Sometimes, an explanatory variable helps explain differences in responses. In these cases, multiple pie charts or side-by-side bar charts can help us visualize the relationship of the explanatory variable to the response.
Find practice problems and solutions for these concepts at Describing and Displaying Categorical Data Practice Exercises.