Introduction to Analyzing Categorical Data
In recent lessons, the responses have been numerical. When working with two treatments or populations, the treatments or populations may be thought of as categorical explanatory variables. In this lesson, we will discuss what to do if the responses, and not just the explanatory variables, are categorical.
Univariate Categorical Data
Univariate categorical data may arise in a variety of settings. In a study of sea turtles, the number of males and females hatched during a season may be recorded. Here two categories exist, male and female, and the data consist of the number of observations falling within each category. A car arriving at an intersection, may continue forward, turn left, turn right, or reverse directions (do a Uturn), resulting in four categories of a univariate response.
The proportion pi of the population within category i, i = 1, 2, . . . , r, tends to be of primary interest in such studies. As an example, we might hypothesize that the proportion of male and female sea turtles hatched each season is the same (0.50). For the car's movement at a particular intersection, one hypothesis would be that 50% continue forward, 30% turn right, 15% turn left, and 5% do a Uturn. Hypothesis tests are conducted to determine whether or not the observed data are compatible with these hypotheses.
Example
A large grocery store wants to decrease the time needed to check out a customer. One aspect of this is the method by which the customer pays. Payment by cash, check, debit card, or credit card is accepted. The store manager believes that 35% of the transactions are by cash, 5% are by check, 50% are by debit card, and 10% are by credit card. Here the percentages are referring to the number of transactions and not the amount of a transaction. She plans a study to determine whether or not these percentages are correct.
 Give the response variable and identify the parameters associated with its categories.
 Set the hypotheses to be tested.
Solution
 The response variable of interest is a customer's method of payment.
 We may define the following:
p_{1} = proportion of cash transactions
p_{2} = proportion of check transactions
p_{3} = proportion of debit card transactions
p_{4} = proportion of credit card transactions
It is not important whether we use p_{1} to denote the proportion of cash transactions or the proportion of some other type of transaction. However, clearly stating what p_{1} represents is important. The set of hypotheses of interest is
H_{0}: p_{1} = 0.35, p_{2} = 0.05, p_{3} = 0.50,
and p_{4} = 0.10
H_{a}: not H_{0}
GoodnessofFit Tests and the x^{2}Distribution
To test null hypotheses of the type just described, we select a random sample from the population of interest and classify each observation into one of the r categories. Let n_{i} be the number of observations in category i, i =1,2, . . . , r. The observed proportion in each category is ; notice that . Under the null hypothesis, we expect , observations to be in category i. If the observed n_{i} and expected e_{i} counts are equal, the data are fully compatible with the null hypothesis. However, because we expect variability in the sample, we also expect the observed and expected counts to differ to some extent even if the null hypothesis is true. How far apart can they be before we doubt the null hypothesis? The statistical methods that help us make that decision are called goodnessoffit tests. A goodnessoffit test assesses whether the observed counts in each category are compatible with the hypothesized proportions.
The test statistic for a goodnessoffit test is . Again, notice that if n_{i} and e_{i} are equal for i = 1, 2, . . . , r, x^{2} = 0. As the observed and expected counts become farther apart, x^{2} gets larger. We want to reject the null hypothesis when x^{2} gets too large. If the null hypothesis is true and the expected counts in each category are not too small, the distribution of x^{2} is approximately a x^{2} distribution (pronounced chisquared distribution). Two conditions must be satisfied for this test to be appropriate. First, the sample must be randomly selected from the population of interest. Second, the expected counts in each cell must not be too small. In general, the expected counts should be at least five.
The x^{2} distribution has an area associated with only nonnegative values. The distribution is skewed to the right. The amount of skew, depends on its parameter and the degrees of freedom. Figure 20.1 illustrates how the shape of the distribution changes with degrees of freedom. It illustrates the shape of the distribution for 4, 6, and 10 degrees of freedom. Because we want to reject the null hypothesis if the test statistic x^{2} gets too large, the pvalue is the area under the curve to the right of the test statistic. Although calculators and computers can be used to find these areas precisely, we will use Table 20.1 to help us approximate it.
The degrees of freedom associated with a goodnessoffit test are the number of categories minus 1 minus the number of parameters estimated. When no parameters are estimated, the degrees of freedom are r – 1, or the number of categories minus one. As in earlier tests of hypotheses, we reject the null hypothesis if the pvalue gets too small. Typically, a value less than 0.05 is significant, and a value less than 0.01 is highly significant.
This type of hypothesis test differs from the others we have considered. We are unable to have the research hypothesis as the alternative hypothesis. Here, we want to show that the null hypothesis is true. However, at best, our conclusion will be that the data are consistent or compatible with the null hypothesis. If we reject the null hypothesis, we have strong evidence that it is not true.
We should make one final note. If only two categories exist, the x^{2}test is an alternative to a twosided test for a population proportion. The two often provide similar but not exactly the same results. If a one sided test of a proportion is needed, the x^{2}test as presented here cannot be used.
Example
Find the pvalues associated with the following values of the test statistic, x^{2}:
 x^{2} = 3.841, df = 1
 x^{2} = 3.841, df = 2
 x^{2} = 3.841, df = 5
 x^{2} = 50, df = 20
Solution
 We look across the row with 1 degree of freedom for the values closest to 3.841. The value 3.841 is in the 0.05 column. Thus, the pvalue is 0.05.
 This time, we look across the row for 2 degrees of freedom. The values 3.794 and 4.605 are in the 0.15 and 0.10 columns, respectively. This means that the pvalue is between 0.10 and 0.15.
 Looking across the row for 5 degrees of freedom, the smallest value is 6.064 in the 0.30 column. Therefore, the pvalue is greater than 0.30.
 In the row for 20 degrees of freedom, the largest value is 47.498 in the 0.0005 column. Thus, the pvalue is less than 0.0005.
Example
The store manager randomly selected 200 transactions from those made during the past week and recorded the method by which the customer is paid. Of the 200 purchases, 62 were paid by cash, 8 were paid by check, 112 were paid by debit card, and 18 were paid by credit card.
 Find the expected counts within each payment category.
 Determine the value of the test statistic.
 Find the pvalue associated with the test statistic.
 Determine whether or not to reject the null hypothesis and state the conclusion in the context of the problem.
Solution

The observed and expected counts are shown in Table 20.2.
 Organizing the information as in the table in problem 1, generally simplifies computations. Recall the test statistic is . Each term in this summation is found in the column. The column total is x^{2} = 2.9543.
 The degrees of freedom associated with the test are 4 – 1 = 3 because there are four categories and no parameters were estimated. The smallest value in the row associated with 3 degrees of freedom is 3.665, which is in the column labeled 0.30. Thus, the pvalue is greater than 0.30.
 The pvalue is greater than 0.30 and would also be larger than any commonly used significance level. Therefore, we would not reject the null hypothesis that the percentages of cash, check, debit card, and credit card transactions are as the store manager has hypothesized.
Tests of Homogeneity
Suppose we want to know whether the proportions within each of the r categories of a response variable are the same for each of c populations. To investigate this question, independent samples are taken from each of the r populations. The data can be arranged in an r × c table, called a contingency table. The general form of a contingency table is shown here in Table 20.3.
The first subscript on a count in Table 20.3 corresponds to the row and the second to the column. Row, column, and overall totals are also in the table. Periods are used to show which variable summation was over. For example, n_{1}. is the total of the first row. The total of the second column is n_{.2}, and n_{..} is the overall total.
The null hypothesis is that the proportion in each category is the same for all populations, i.e., H_{0}: p_{1j} = p_{1}, p_{2j} = p_{2}, .... , p_{rj} = p_{r}, j = 1, 2,,... , c . The alternative hypothesis is H_{a}: not H_{0}.To estimate the proportion in the ith category common to all populations, we use the sample proportion in the ith category across all populations, . The expected count in the ith category from the jth population is the size of the sample from that population times the sample proportion in the ith category; that is, or the row total times the column total divided by the overall total. The test statistic is . If each population was sampled randomly and all expected counts are large enough (≥ 5), the test statistic has an approximate x^{2} distribution with (r – 1)(c – 1) degrees of freedom provided the null hypothesis is true. The pvalue is the probability that a randomly selected value from the x^{2} distribution exceeds the test statistic. As with other hypothesis tests, we reject the null hypothesis if the pvalue is too small.
Example
A researcher wanted to determine whether the age distribution is the same for renters and for home owners in a large city. He selected a random sample of size 100 from all renters and recorded their ages in categories. Similarly, he selected a random sample of size 100 from all home owners and recorded their ages in categories. The data are presented in Table 20.4.
 State the null and alternative hypotheses of interest to this researcher.
 Find the expected counts.
 Verify the conditions for the test are satisfied and, if so, find the value of the test statistic
 Find the pvalue.
 Decide whether or not to reject the null hypothesis.
 State the conclusions in the context of the problem.
Solution
 H_{0}: p_{1j} = p_{1}, p_{2j} = p_{2}, ... , p_{rj} = p_{r}, j = R, H, where R stands for renters and H represents home owners. The alternative hypothesis is H_{a}: not H_{0}.

The expected counts are in Table 20.5/p.
Notice that the expected counts do not have to be, and are often not,whole numbers. They should not be rounded to whole numbers.
 A random sample was selected from each population. The smallest expected count was 12, which is greater than the minimum of 5 needed for the test statistic to have an approximate x^{2} distribution. Thus, the conditions for the test are satisfied. The value of the test statistic is
 The degrees of freedom associated with the test statistic are (r – 1)(c – 1) = 4 × 1 = 4. The largest number on the row with 4 degrees of freedom is 19.997, which is in the 0.0005 column. Therefore, the pvalue is less than 0.0005.
 Because the pvalue is so very small, the evidence against the null hypothesis is very strong. Thus, we reject it in favor of the alternative hypothesis.
 Evidence exists that the proportions in each age category is not the same for home owners and renters.
Tests of Independence
Sometimes, data on two categorical variables can be collected in one sample. For example, instead of sampling renters and home owners separately in the previous example, we could have taken one sample and asked each study participant whether he or she is a renter or a home owner and which age category he or she is in. The data could have been presented in a table as in the example. The only difference is the manner in which the data were collected. The null hypothesis is that the two variables are independent of one another, and the alternative is that they are not. The expected counts are then computed as in the tests for homogeneity. The conditions that must be satisfied are that the sample was randomly selected and that the expected counts are large enough, at least five in each cell. If these are satisfied, the test statistic,
,
has an approximate x^{2}distribution with (r – 1)(c – 1) degrees of freedom if the null hypothesis is true. The pvalue is the probability that a randomly selected observation from the x^{2}distribution is greater than the test statistic. If this value is less than the specified significance level, the null hypothesis is rejected; otherwise, the null is not rejected.
Example
A company wanted to assess the success of its television advertising campaign for a new product. They hired a pollster to find out whether those who saw the ad were more likely to have purchased the new product than those who had not. The pollster took a sample of 250 adults in the viewing area where the ad aired. Each study participant was asked whether he or she had seen the ad and whether he or she had purchased the new product. The results are presented in Table 20.7.
 State the null and alternative hypotheses of interest to the company.
 Find the expected counts.
 Verify the conditions for the test and, if satisfied, find the test statistic.
 Find the pvalue.
 Decide whether or not to reject the null hypothesis.
 State your conclusion. Be sure it is in the context of the problem.
Solution
 H_{0}: Having viewed the ad is independent of whether or not a person purchased the product.
H_{a}: Having viewed the ad is not independent of whether or not a person purchased the product

The expected counts are in Table 20.8.
 The sample was randomly selected from the population that had the opportunity to see the ad, and all expected counts exceed five. Thus the conditions for the test are satisfied. The test statistic is then
 If the null hypothesis is true, the test statistic has an approximate x^{2}distribution with (r  1)(c – 1) = 1 × 1 = 1 degree of freedom. The smallest value in the row of the x^{2}table corresponding to one degree of freedom is 1.074 in the 0.3 column. Thus, the pvalue is greater than 0.30.
 The pvalue is large, indicating that data such as what was observed are not at all unusual if the null hypothesis is true. Therefore, we would not reject the null hypothesis.
 There is not sufficient evidence to reject the hypothesis that seeing the ad is independent of whether or not the new product was purchased. This would be frustrating information for the company's management. The lack of a significant relationship indicates that no sufficient evidence indicates that people were more likely to purchase the product after seeing the television ad. The company may be looking for a new advertisement firm!
Analyzing Categorical Data In Short
Categorical data lead to counts within each category. x^{2}tests are suitable for testing hypotheses about these data. When working with univariate categorical data, one can test whether the population proportions in each category are some set of specified values. If the same univariate categorical variable is observed in independent samples from two or more populations, one can test whether the proportions in each category are the same for all populations. If two different categorical variables are observed in one sample, the test concerns whether or not the two variables are independent.
Find practice problems and solutions for these concepts at Analyzing Categorical Data Practice Questions.
View Full Article
From Statistics Success in 20 Minutes A Day. Copyright © 2006 by LearningExpress, LLC. All Rights Reserved.