Analyzing Categorical Data Study Guide

Updated on Oct 5, 2011

Introduction to Analyzing Categorical Data

In recent lessons, the responses have been numerical. When working with two treatments or populations, the treatments or populations may be thought of as categorical explanatory variables. In this lesson, we will discuss what to do if the responses, and not just the explanatory variables, are categorical.

Univariate Categorical Data

Univariate categorical data may arise in a variety of settings. In a study of sea turtles, the number of males and females hatched during a season may be recorded. Here two categories exist, male and female, and the data consist of the number of observations falling within each category. A car arriving at an intersection, may continue forward, turn left, turn right, or reverse directions (do a U-turn), resulting in four categories of a univariate response.

The proportion pi of the population within category i, i = 1, 2, . . . , r, tends to be of primary interest in such studies. As an example, we might hypothesize that the proportion of male and female sea turtles hatched each season is the same (0.50). For the car's movement at a particular intersection, one hypothesis would be that 50% continue forward, 30% turn right, 15% turn left, and 5% do a U-turn. Hypothesis tests are conducted to determine whether or not the observed data are compatible with these hypotheses.


A large grocery store wants to decrease the time needed to check out a customer. One aspect of this is the method by which the customer pays. Payment by cash, check, debit card, or credit card is accepted. The store manager believes that 35% of the transactions are by cash, 5% are by check, 50% are by debit card, and 10% are by credit card. Here the percentages are referring to the number of transactions and not the amount of a transaction. She plans a study to determine whether or not these percentages are correct.

  1. Give the response variable and identify the parameters associated with its categories.
  2. Set the hypotheses to be tested.


  1. The response variable of interest is a customer's method of payment.
  2. We may define the following:

    p1 = proportion of cash transactions

    p2 = proportion of check transactions

    p3 = proportion of debit card transactions

    p4 = proportion of credit card transactions

It is not important whether we use p1 to denote the proportion of cash transactions or the proportion of some other type of transaction. However, clearly stating what p1 represents is important. The set of hypotheses of interest is

H0: p1 = 0.35, p2 = 0.05, p3 = 0.50,

and p4 = 0.10

Ha: not H0

Goodness-of-Fit Tests and the x2-Distribution

To test null hypotheses of the type just described, we select a random sample from the population of interest and classify each observation into one of the r categories. Let ni be the number of observations in category i, i =1,2, . . . , r. The observed proportion in each category is ; notice that . Under the null hypothesis, we expect , observations to be in category i. If the observed ni and expected ei counts are equal, the data are fully compatible with the null hypothesis. However, because we expect variability in the sample, we also expect the observed and expected counts to differ to some extent even if the null hypothesis is true. How far apart can they be before we doubt the null hypothesis? The statistical methods that help us make that decision are called goodness-of-fit tests. A goodness-of-fit test assesses whether the observed counts in each category are compatible with the hypothesized proportions.

The test statistic for a goodness-of-fit test is . Again, notice that if ni and ei are equal for i = 1, 2, . . . , r, x2 = 0. As the observed and expected counts become farther apart, x2 gets larger. We want to reject the null hypothesis when x2 gets too large. If the null hypothesis is true and the expected counts in each category are not too small, the distribution of x2 is approximately a x2- distribution (pronounced chi-squared distribution). Two conditions must be satisfied for this test to be appropriate. First, the sample must be randomly selected from the population of interest. Second, the expected counts in each cell must not be too small. In general, the expected counts should be at least five.

The x2- distribution has an area associated with only nonnegative values. The distribution is skewed to the right. The amount of skew, depends on its parameter and the degrees of freedom. Figure 20.1 illustrates how the shape of the distribution changes with degrees of freedom. It illustrates the shape of the distribution for 4, 6, and 10 degrees of freedom. Because we want to reject the null hypothesis if the test statistic x2 gets too large, the p-value is the area under the curve to the right of the test statistic. Although calculators and computers can be used to find these areas precisely, we will use Table 20.1 to help us approximate it.

Figure 20.1

The degrees of freedom associated with a goodness-of-fit test are the number of categories minus 1 minus the number of parameters estimated. When no parameters are estimated, the degrees of freedom are r – 1, or the number of categories minus one. As in earlier tests of hypotheses, we reject the null hypothesis if the p-value gets too small. Typically, a value less than 0.05 is significant, and a value less than 0.01 is highly significant.

This type of hypothesis test differs from the others we have considered. We are unable to have the research hypothesis as the alternative hypothesis. Here, we want to show that the null hypothesis is true. However, at best, our conclusion will be that the data are consistent or compatible with the null hypothesis. If we reject the null hypothesis, we have strong evidence that it is not true.

We should make one final note. If only two categories exist, the x2-test is an alternative to a two-sided test for a population proportion. The two often provide similar but not exactly the same results. If a one sided test of a proportion is needed, the x2-test as presented here cannot be used.

Figure 20.2

Table 20.1 Selected right tail areas

View Full Article
Add your own comment