Analyzing Categorical Data Study Guide (page 2)

Updated on Oct 5, 2011


Find the p-values associated with the following values of the test statistic, x2:

  1. x2 = 3.841, df = 1
  2. x2 = 3.841, df = 2
  3. x2 = 3.841, df = 5
  4. x2 = 50, df = 20


  1. We look across the row with 1 degree of freedom for the values closest to 3.841. The value 3.841 is in the 0.05 column. Thus, the p-value is 0.05.
  2. This time, we look across the row for 2 degrees of freedom. The values 3.794 and 4.605 are in the 0.15 and 0.10 columns, respectively. This means that the p-value is between 0.10 and 0.15.
  3. Looking across the row for 5 degrees of freedom, the smallest value is 6.064 in the 0.30 column. Therefore, the p-value is greater than 0.30.
  4. In the row for 20 degrees of freedom, the largest value is 47.498 in the 0.0005 column. Thus, the p-value is less than 0.0005.


The store manager randomly selected 200 transactions from those made during the past week and recorded the method by which the customer is paid. Of the 200 purchases, 62 were paid by cash, 8 were paid by check, 112 were paid by debit card, and 18 were paid by credit card.

  1. Find the expected counts within each payment category.
  2. Determine the value of the test statistic.
  3. Find the p-value associated with the test statistic.
  4. Determine whether or not to reject the null hypothesis and state the conclusion in the context of the problem.


  1. The observed and expected counts are shown in Table 20.2.

    Table 20.2 Observed and expected counts

  2. Organizing the information as in the table in problem 1, generally simplifies computations. Recall the test statistic is . Each term in this summation is found in the column. The column total is x2 = 2.9543.
  3. The degrees of freedom associated with the test are 4 – 1 = 3 because there are four categories and no parameters were estimated. The smallest value in the row associated with 3 degrees of freedom is 3.665, which is in the column labeled 0.30. Thus, the p-value is greater than 0.30.
  4. The p-value is greater than 0.30 and would also be larger than any commonly used significance level. Therefore, we would not reject the null hypothesis that the percentages of cash, check, debit card, and credit card transactions are as the store manager has hypothesized.

Tests of Homogeneity

Suppose we want to know whether the proportions within each of the r categories of a response variable are the same for each of c populations. To investigate this question, independent samples are taken from each of the r populations. The data can be arranged in an r × c table, called a contingency table. The general form of a contingency table is shown here in Table 20.3.

Table 20.3 Contingency table

The first subscript on a count in Table 20.3 corresponds to the row and the second to the column. Row, column, and overall totals are also in the table. Periods are used to show which variable summation was over. For example, n1. is the total of the first row. The total of the second column is n.2, and n.. is the overall total.

The null hypothesis is that the proportion in each category is the same for all populations, i.e., H0: p1j = p1, p2j = p2, .... , prj = pr, j = 1, 2,,... , c . The alternative hypothesis is Ha: not H0.To estimate the proportion in the ith category common to all popula-tions, we use the sample proportion in the ith category across all populations, . The expected count in the ith category from the jth population is the size of the sample from that population times the sample proportion in the ith category; that is, or the row total times the column total divided by the overall total. The test statistic is . If each population was sampled randomly and all expected counts are large enough (≥ 5), the test statistic has an approximate x2- distribution with (r – 1)(c – 1) degrees of freedom provided the null hypothesis is true. The p-value is the probability that a randomly selected value from the x2- distribution exceeds the test statistic. As with other hypothesis tests, we reject the null hypothesis if the p-value is too small.


A researcher wanted to determine whether the age distribution is the same for renters and for home owners in a large city. He selected a random sample of size 100 from all renters and recorded their ages in categories. Similarly, he selected a random sample of size 100 from all home owners and recorded their ages in categories. The data are presented in Table 20.4.

  1. State the null and alternative hypotheses of interest to this researcher.
  2. Find the expected counts.
  3. Verify the conditions for the test are satisfied and, if so, find the value of the test statistic
  4. Find the p-value.
  5. Decide whether or not to reject the null hypothesis.
  6. State the conclusions in the context of the problem.

Table 20.4 Age of renters and home owners

View Full Article
Add your own comment