Analyzing Categorical Data Study Guide (page 2)
Introduction to Analyzing Categorical Data
In recent lessons, the responses have been numerical. When working with two treatments or populations, the treatments or populations may be thought of as categorical explanatory variables. In this lesson, we will discuss what to do if the responses, and not just the explanatory variables, are categorical.
Univariate Categorical Data
Univariate categorical data may arise in a variety of settings. In a study of sea turtles, the number of males and females hatched during a season may be recorded. Here two categories exist, male and female, and the data consist of the number of observations falling within each category. A car arriving at an intersection, may continue forward, turn left, turn right, or reverse directions (do a U-turn), resulting in four categories of a univariate response.
The proportion pi of the population within category i, i = 1, 2, . . . , r, tends to be of primary interest in such studies. As an example, we might hypothesize that the proportion of male and female sea turtles hatched each season is the same (0.50). For the car's movement at a particular intersection, one hypothesis would be that 50% continue forward, 30% turn right, 15% turn left, and 5% do a U-turn. Hypothesis tests are conducted to determine whether or not the observed data are compatible with these hypotheses.
A large grocery store wants to decrease the time needed to check out a customer. One aspect of this is the method by which the customer pays. Payment by cash, check, debit card, or credit card is accepted. The store manager believes that 35% of the transactions are by cash, 5% are by check, 50% are by debit card, and 10% are by credit card. Here the percentages are referring to the number of transactions and not the amount of a transaction. She plans a study to determine whether or not these percentages are correct.
- Give the response variable and identify the parameters associated with its categories.
- Set the hypotheses to be tested.
- The response variable of interest is a customer's method of payment.
- We may define the following:
p1 = proportion of cash transactions
p2 = proportion of check transactions
p3 = proportion of debit card transactions
p4 = proportion of credit card transactions
It is not important whether we use p1 to denote the proportion of cash transactions or the proportion of some other type of transaction. However, clearly stating what p1 represents is important. The set of hypotheses of interest is
H0: p1 = 0.35, p2 = 0.05, p3 = 0.50,
and p4 = 0.10
Ha: not H0
Goodness-of-Fit Tests and the x2-Distribution
To test null hypotheses of the type just described, we select a random sample from the population of interest and classify each observation into one of the r categories. Let ni be the number of observations in category i, i =1,2, . . . , r. The observed proportion in each category is ; notice that . Under the null hypothesis, we expect , observations to be in category i. If the observed ni and expected ei counts are equal, the data are fully compatible with the null hypothesis. However, because we expect variability in the sample, we also expect the observed and expected counts to differ to some extent even if the null hypothesis is true. How far apart can they be before we doubt the null hypothesis? The statistical methods that help us make that decision are called goodness-of-fit tests. A goodness-of-fit test assesses whether the observed counts in each category are compatible with the hypothesized proportions.
The test statistic for a goodness-of-fit test is . Again, notice that if ni and ei are equal for i = 1, 2, . . . , r, x2 = 0. As the observed and expected counts become farther apart, x2 gets larger. We want to reject the null hypothesis when x2 gets too large. If the null hypothesis is true and the expected counts in each category are not too small, the distribution of x2 is approximately a x2- distribution (pronounced chi-squared distribution). Two conditions must be satisfied for this test to be appropriate. First, the sample must be randomly selected from the population of interest. Second, the expected counts in each cell must not be too small. In general, the expected counts should be at least five.
The x2- distribution has an area associated with only nonnegative values. The distribution is skewed to the right. The amount of skew, depends on its parameter and the degrees of freedom. Figure 20.1 illustrates how the shape of the distribution changes with degrees of freedom. It illustrates the shape of the distribution for 4, 6, and 10 degrees of freedom. Because we want to reject the null hypothesis if the test statistic x2 gets too large, the p-value is the area under the curve to the right of the test statistic. Although calculators and computers can be used to find these areas precisely, we will use Table 20.1 to help us approximate it.
The degrees of freedom associated with a goodness-of-fit test are the number of categories minus 1 minus the number of parameters estimated. When no parameters are estimated, the degrees of freedom are r – 1, or the number of categories minus one. As in earlier tests of hypotheses, we reject the null hypothesis if the p-value gets too small. Typically, a value less than 0.05 is significant, and a value less than 0.01 is highly significant.
This type of hypothesis test differs from the others we have considered. We are unable to have the research hypothesis as the alternative hypothesis. Here, we want to show that the null hypothesis is true. However, at best, our conclusion will be that the data are consistent or compatible with the null hypothesis. If we reject the null hypothesis, we have strong evidence that it is not true.
We should make one final note. If only two categories exist, the x2-test is an alternative to a two-sided test for a population proportion. The two often provide similar but not exactly the same results. If a one sided test of a proportion is needed, the x2-test as presented here cannot be used.
Find the p-values associated with the following values of the test statistic, x2:
- x2 = 3.841, df = 1
- x2 = 3.841, df = 2
- x2 = 3.841, df = 5
- x2 = 50, df = 20
- We look across the row with 1 degree of freedom for the values closest to 3.841. The value 3.841 is in the 0.05 column. Thus, the p-value is 0.05.
- This time, we look across the row for 2 degrees of freedom. The values 3.794 and 4.605 are in the 0.15 and 0.10 columns, respectively. This means that the p-value is between 0.10 and 0.15.
- Looking across the row for 5 degrees of freedom, the smallest value is 6.064 in the 0.30 column. Therefore, the p-value is greater than 0.30.
- In the row for 20 degrees of freedom, the largest value is 47.498 in the 0.0005 column. Thus, the p-value is less than 0.0005.
The store manager randomly selected 200 transactions from those made during the past week and recorded the method by which the customer is paid. Of the 200 purchases, 62 were paid by cash, 8 were paid by check, 112 were paid by debit card, and 18 were paid by credit card.
- Find the expected counts within each payment category.
- Determine the value of the test statistic.
- Find the p-value associated with the test statistic.
- Determine whether or not to reject the null hypothesis and state the conclusion in the context of the problem.
The observed and expected counts are shown in Table 20.2.
- Organizing the information as in the table in problem 1, generally simplifies computations. Recall the test statistic is . Each term in this summation is found in the column. The column total is x2 = 2.9543.
- The degrees of freedom associated with the test are 4 – 1 = 3 because there are four categories and no parameters were estimated. The smallest value in the row associated with 3 degrees of freedom is 3.665, which is in the column labeled 0.30. Thus, the p-value is greater than 0.30.
- The p-value is greater than 0.30 and would also be larger than any commonly used significance level. Therefore, we would not reject the null hypothesis that the percentages of cash, check, debit card, and credit card transactions are as the store manager has hypothesized.
Tests of Homogeneity
Suppose we want to know whether the proportions within each of the r categories of a response variable are the same for each of c populations. To investigate this question, independent samples are taken from each of the r populations. The data can be arranged in an r × c table, called a contingency table. The general form of a contingency table is shown here in Table 20.3.
The first subscript on a count in Table 20.3 corresponds to the row and the second to the column. Row, column, and overall totals are also in the table. Periods are used to show which variable summation was over. For example, n1. is the total of the first row. The total of the second column is n.2, and n.. is the overall total.
The null hypothesis is that the proportion in each category is the same for all populations, i.e., H0: p1j = p1, p2j = p2, .... , prj = pr, j = 1, 2,,... , c . The alternative hypothesis is Ha: not H0.To estimate the proportion in the ith category common to all popula-tions, we use the sample proportion in the ith category across all populations, . The expected count in the ith category from the jth population is the size of the sample from that population times the sample proportion in the ith category; that is, or the row total times the column total divided by the overall total. The test statistic is . If each population was sampled randomly and all expected counts are large enough (≥ 5), the test statistic has an approximate x2- distribution with (r – 1)(c – 1) degrees of freedom provided the null hypothesis is true. The p-value is the probability that a randomly selected value from the x2- distribution exceeds the test statistic. As with other hypothesis tests, we reject the null hypothesis if the p-value is too small.
A researcher wanted to determine whether the age distribution is the same for renters and for home owners in a large city. He selected a random sample of size 100 from all renters and recorded their ages in categories. Similarly, he selected a random sample of size 100 from all home owners and recorded their ages in categories. The data are presented in Table 20.4.
- State the null and alternative hypotheses of interest to this researcher.
- Find the expected counts.
- Verify the conditions for the test are satisfied and, if so, find the value of the test statistic
- Find the p-value.
- Decide whether or not to reject the null hypothesis.
- State the conclusions in the context of the problem.
- Kindergarten Sight Words List
- First Grade Sight Words List
- 10 Fun Activities for Children with Autism
- Child Development Theories
- Grammar Lesson: Complete and Simple Predicates
- Definitions of Social Studies
- Social Cognitive Theory
- Signs Your Child Might Have Asperger's Syndrome
- Why is Play Important? Social and Emotional Development, Physical Development, Creative Development
- Theories of Learning