Describing and Displaying Categorical Data Study Guide
Introduction to Describing and Displaying Categorical Data
Population is the set of objects or individuals of interest, and that the sample is a subset of the population. When presented with a set of either population or sample values, we need to summarize them in some way if we are to gain insight into the information they provide. In this lesson, we will focus on the presentation of categorical data, both categorically and graphically.
Population Distribution versus Sample Distribution
Suppose that a high school orchestra has 62 members, 35 females and 27 males. The population of interest is the set of all orchestra members. Data are available on the categorical random variable, gender. The collection of genders for the population represents the population distribution of genders. In this example, the population is finite because there are a finite number of units (orchestra members) in the population.
Suppose 15 of the orchestra members are randomly selected. The genders of these 15 represent the sample distribution of gender. A summary measure of a population is called a parameter; a summary measure of a sample distribution, which is a function of sample values and has no unknown parameters, is called a statistic. Frequencies and relative frequencies are important summary measures for categorical variables such as the gender of the orchestra members.
As the sample size increases, the sample distribution tends to be more like the population distribution as long as the units for the sample have been drawn at random from the population. This is comforting in that, intuitively, we expect for the sample "be better" as the sample size increases.What we mean by "better" is not always clear. However, the fact that the sample distribution begins to tend toward the population distribution is one way in which we have done better. Other measures of "better" will be discussed in later lessons.
Frequency and Relative Frequency
The nature of categorical data leads to counts of the numbers falling within each category: the numbers of females and males; the numbers of red, yellow, or blue items; and the numbers ordering pizza, hamburger, or chicken. Notice, if we have two categories,we have two counts; three categories, three counts; and so on.
The number of times a category appears in a data set is called the frequency of that category. The relative frequency of a category is the proportion of times that category occurs in the data set; that is,
relative frequency = frequency/number of observations in the data set
These frequencies or relative frequencies are best organized in tabular form. The table should display all possible categories and the frequencies or relative frequencies. The frequency distribution (or relative frequency distribution) for categorical data is the categories with their associated frequencies (or relative frequencies). It is important to remember that if we have all population values, we can find the population frequencies and population relative frequencies. Otherwise, population values, then we can find the sample frequencies and sample relative frequencies. For the band members, we have all population values. The population frequency and relative frequency distributions for the genders of these band members are displayed in Table 3.1.