The Matched-Pairs Design for Comparing Two Treatment Means Study Guide (page 3)
Introduction to The Matched-Pairs Design for Comparing Two Treatment Means Study Guide
Thus far, we have only attempted to set confidence intervals on proportions or means based on a sample from a single treatment or population. Now we want to conduct studies that will allow us to compare the means of two treatments. First, we will think about how best to design a study. In this lesson, after introducing the basic ideas behind matched pairs and two-group designs, we will focus on the analysis of data from the paired design. In the next lesson, we will consider the two-group design.
Two-Group versus Matched-Pairs Design
Suppose we are going to conduct a study to compare two methods of production, a standard method and a new method, that cause children's dress shoes to shine. Fifty children have been randomly selected to participate in the study. Each child will be given a new pair of dress shoes that shine. But first we need to decide how to assign the treatments (or production methods) to the children's shoes. One approach is to randomly select 25 (half) of the children and give them shoes made using the standard production process; the other half will receive shoes that were made using the new production process. Thus, each child would have a pair of shoes made by one of the two processes. A second approach is to have one shoe of each pair made with the standard process and the other shoe with the new process. Whether the right or left shoe is made with the first process would be randomly determined. In this second approach, each child would wear a dress shoe made using each process.
Regardless of which approach of assigning treatments is used, the children will wear the shoes whenever they wear dress shoes for six months. At the end of the six months, an evaluator who does not know which shoe received which treatment will score the shine quality of each shoe.
Which method of assigning treatments is better? In this case, having each child wear shoes made by both processes is better. Children differ in their activities while wearing dress shoes. Some may wear them only for special occasions, and their shoes will continue to shine no matter what process was used. Other children run and play in their dress shoes. Their shoes are less likely to continue to shine so the process could make a big difference. By having each child wear a shoe made from each process, both processes are subjected to the same environment (level of play). The difference in shine after six months is due more to the differences in the processes and not to differences in the children. This is an example of a paired experiment.
The other design in which half of the children wore dress shoes made by the standard process and half by the new process is a two-group design. Although this is a reasonable design, it is not the best for this study. The differences we observe in the shine of the shoes after six months are not due only to differences in processes, but also due to differences in children. This would lead to more variation in the estimated mean differences, making it more difficult to determine which, if either, shine process is better.
In the planning stages of a study, it is always important to consider the best way to randomize treatments to the study units. Pairs should be formed if, by pairing, we can eliminate some of the variability in the response that would otherwise be present. In the blinking study presented first in Lesson 4, for each study participant, the number of blinks in a two-minute time period was measured during normal conversation and while playing a video game. Those participants who tended to blink less than average during normal conversation also tended to blink less than average while playing a video game. Similarly, those who tended to blink more than average during normal conversation tended to blink more than average while playing a video game. By recording the difference in the number of blinks under each treatment for each person, we could eliminate the differences among people, allowing us to more accurately measure the differences between treatments, that is, between normal conversation and video playing.
Sometimes, it is not reasonable for both treatments to be applied to the same person. In this case, we may want to pair by some factor that will help explain the variability in the response. For example, suppose we want to compare two treatments for cholesterol. We could pair patients by their initial cholesterol levels. Those with the highest cholesterol level would be in the first pair. Those with the next highest cholesterol level would be in the next pair, and so forth. Then, within each pair, one of the patients would be randomly assigned to the first treatment, and the other would get the second treatment.
Whether or not to use pairing is an important consideration. Matched pairs should be formed only if the researcher believes that significant difference in the response variable can be explained, allowing differences in the treatments to be detected more readily. As an illustration, suppose we decided to pair patients in the cholesterol study on the basis of the length of their feet. The two with the longest feet would be in the first pair, the two with the next longest feet would be in the second pair, and so on. We have no reason to believe that foot length is in any way related to cholesterol level. Pairing in such a situation provides no benefit and is not as effective for assessing whether or not the treatment means are different as the two-group design.
A researcher wants to compare the quality of cooking roasts using two methods—open pan and bag. Four ovens are available for the study. Eight roasts of equal quality have been allocated for the study.
- Describe how to conduct the study using a matched-pairs design.
- Describe how to conduct the study using a twogroup design.
- Which of the two designs would you use for this study? Explain.
- For a matched-pairs design, two roasts would be cooked in each oven, one in an open pan and the other in a bag. The location of each roast within the oven would be randomly determined.
- For a two-group design, we randomly select two of the ovens to cook a roast using the open-pan method; the other two ovens would each be used to cook a roast in a bag.
- The matched-pairs design would be the best for this study. Ovens often vary in their ability to hold temperature at a specified level. By having both treatments in each oven, differences between ovens can be accounted for in the analysis. As described, we have used half as many roasts in the two-group design. We could put two roasts in each oven and cook using the same method. This gives us information on the differences within an oven and allows us to more precisely estimate the quality of the roasts cooked in a specific oven. Cooking two roasts in the same oven does not double the number of experimental units in the study. An oven would be the experimental unit because the cooking methods were randomly assigned to the ovens.
Once we have decided to conduct an experiment using matched pairs, how do we actually go about conducting the study? First, the study units need to be obtained. As we learned in Lesson 2, if the study units are randomly selected from some population, conclusions can be made for that population at the end of the study; otherwise, conclusions apply only to the units in the study. In the shoe-shine study, children were randomly selected. The group from which these children were randomly selected is the population for which inferences can be made.
Next, the study units need to be paired. Individuals could be matched according to a characteristic that could explain some of the difference in the response variable. In the cholesterol study, individuals were matched by initial cholesterol level. Sometimes, both treatments can be sequentially applied to the same individual. This form of matched pairs is often very strong, but may require more time than is available for the study.
Once the pairs are formed, one treatment is randomly assigned to one unit in the pair; the other unit receives the second treatment. Notice that a separate randomization is used for each pair. For the shoe-shine study, it would not be sufficient to flip a coin and randomly assign the first treatment to all right shoes and the other treatment to all left shoes. Children are right or left-footed just as they are right- or left-handed. It is possible that one shoe, say the right shoe, tends to get the most wear because most children are right footed. If this is the case, then the treatment assigned to all right shoes would be at a disadvantage in the study. To avoid this and other biases of which we may not even be aware, we randomly assign treatments within each pair.
Once the study is complete, we record the response variable for each unit. Let X1i be the observed response from the first treatment in pair i, i = 1, 2, . . ., n, where there are n pairs. Similarly, let X2i be the observed response from the second treatment in pair i,i = 1, 2, . . . , n. Then Di = X1i – X2i, i = 1, 2, . . . , n, is the observed difference in the two treatments for the ith pair. There is a conceptual population of Di's comprised of the differences in all possible pairs that could have been used in this study. This population has μD and standard deviation σD.
The sample mean difference in the two treatments, , is an estimate of the difference in the treatment means, μ1– μ2= μD, the mean of the population of paired treatment differences. The sample variance of the pairwise differences provides an estimate of σD2 and is . The sample standard deviation is sD = √. Notice that and sD are, respectively, the sample mean and sample standard deviation of the differences. This would lead us to speculate that the standard error of is . This is, in fact, the case! The analysis of a paired study is based on these quantities. We will consider this further in the next two lessons.
An athletic shoe company believes that they have developed a shoe that will help short-distance runners lower their times in races. They recruited 24 runners. Each runner was given a new pair of the athletic shoes. The runners were encouraged to use these shoes and their favorite pair of running shoes equally in practice for two weeks. After two weeks, the runners ran two 100-meter dashes with five hours between races. For each runner, a coin was flipped. If the coin landed heads up, the runner wore his or her favorite running shoes in the first race; otherwise, he or she wore his or her newly developed shoes. In the second race, each runner wore the pair of shoes that was not used in the first race. The times for the runners are given in Table 18.1.
- Explain why this study has a matched-pairs design. Include a clear statement describing what constitutes a pair.
- Find the difference in observations from each pair.
- Estimate the mean and standard deviation of the differences in time to run a 100-meter dash when wearing the favorite running shoes compared to the new running shoes.
- Find the standard error of the estimated mean of the differences in time to run a 100-meter dash when wearing the favorite running shoes compared to the new running shoes.
- Is the assumption reasonable that the differences are normally distributed?
- To which population may inference be drawn from this study?
- The two treatments are the favorite running shoes and the newly developed running shoes. Each treatment is applied to a runner. Thus, the favorite running shoes and the newly developed running shoes are paired by runner. A pair consists of the running times for the two treatments from a single runner. The order in which the shoes were used was randomized for each runner, a critical step in conducting the study.
- The differences in the two treatments are computed for each runner (see Table 18.2).
The estimated mean difference in the running times using the favorite shoes versus using the newly developed shoes is
(–0.06 + 0.64 +………+ 0.34)
The estimated variance of these differences is , and the estimated standard deviation is 0.3068.
- The standard error of the estimated differences in the two treatments is .
- Because the sample size is small, it is difficult to determine whether or not the observed differences are normally distributed. Although formal tests exist for determining normality, we will not study them here. Instead, we will rely on examining graphs to determine whether there are indications that the data may not be normal. Figures 18.1, 18.2, and 18.3 show a histogram, a dotplot, and a boxplot, respectively. The histogram looks fairly symmetric and unimodal. With only 24 observations, the shape of a population is often not fully captured in a histogram of the data. The dotplot appears to be centered at about 0.20. The values range from –0.39 to 0.84 with a higher concentration of dots in the center. From the boxplot, the data appear to be fairly symmetric without any outliers. In summary, we do not see any indication of skewness, outliers, or other features that would cause us to think that the assumption of normality is unreasonable.
- Because the runners were recruited and not randomly selected from some population, the population to which inference may be drawn is the runners in this study.
Confidence Intervals on the Difference in Two Treatment Means
If the goal is to provide an interval of reasonable values for the mean difference in two treatment means based on a matched-pairs design, we want to set a confidence interval on that mean difference. The conditions for inference are that the differences (Di's) are a random sample from a population of differences. Second, the Di's are either normally distributed or the sample size is large enough (at least 30) to assume that the average of the Di's is approximately normally distributed by the Central Limit Theorem.
The methods for statistical inference using the Di's are computationally identical to those for one population; the interpretation is all that differs. Recall is an unbiased estimate of the difference in the treatment means, μ1–μ2=μD. The standard error of this estimate is where . Using the form point estimate ± multiplier × standard error, a 100(1 –α)% confidence interval on μD = μ1 – μ2 has the form where t* with (n – 1) degrees of freedom is the proper tabulated value to give 100(1 – α)% confidence.
Look again at the study comparing running shoes earlier in this lesson. Find a 90% confidence interval on the mean difference in the race times for a 100-meter dash using the athlete's favorite running shoes and the new running shoes.
The 24 runners were recruited not randomly selected from all runners, so our inference will be restricted to these 24 runners (review the table in Lesson 2). We may believe that these runners are representative of all runners and thus attempt to broaden our scope of inference, but we need to be very careful in doing so. If we think these runners may differ from a broader population of runners, the 24 runners must be taken as the population of interest. The random assignment of treatment order is necessary for the first condition of inference to be satisfied. Note: Unless treatments are randomly assigned within a pair, we do not have a random sample of differences, which is the first condition. Although we have not formally tested whether or not the population of differences has a normal distribution, the graphs constructed in the previous example suggest that it is not an unreasonable assumption, so we will assume that the differences in running times are at least approximately normally distributed.
Based on n=24 runners, we found = 0.2050 and sD = 0.3068. For a 90% confidence interval, α = 0.10, so we want to put 5% of the probability in each tail. Looking in the t-table in Lesson 12, the t-value at the intersection of the row corresponding to 23 degrees of freedom and the column showing 0.05 in the upper tail, we have t* = 1.714. The confidence interval on μF – μN is 0.2050 ± 1.714 or 0.2050 ± 0.1096. Therefore, we estimate that, on average, the new running shoes allow runners to complete the race in 0.21 fewer seconds compared to their favorite running shoes, and we are 90% confident that this estimate is within 0.11 seconds of the true mean difference in the times to run the 100-meter dash using the new running shoes and the runners' favorite shoes.
Hypothesis Tests Concerning the Difference in Two Treatment Means
Tests of hypotheses concerning the difference in two treatment means are based on the same philosophy as the hypothesis tests discussed earlier in this book. Five steps are followed to conduct a hypothesis test.
Step 1: Specifying the Hypotheses
For a matched-pairs design, the null hypothesis, Ho, is that the difference in the treatment means is do; that is, Ho: μ1– μ2 =do. Although it is common for do=0 (implying that the means or proportions are equal), this is not necessary; do can be any value. The alternative is that the difference is less than, greater than, or equal to do.
Step 2: Verify Necessary Conditions for a Test and, if Satisfied, Construct the Test Statistic
The conditions for testing hypotheses about the difference in two treatment means are the same as conditions for testing confidence intervals. First, the differences (Di's) must be a random sample from some population of differences. Second, we must satisfy the condition of normality. That is, the Di's are normally distributed or the sample size is sufficient that the sample mean difference is approximately normal by the Central Limit Theorem.
The test statistic has the now familiar form
For the paired design, this becomes
We know the distribution of this test statistic is at least approximately a t-distribution with (n – 1) degrees of freedom if the null hypothesis is true.
Step 3: Find the p-Value Associated with the Test Statistic
If the null hypothesis about the difference in two treatment means is true, the test statistic has either an exact or an approximate t-distribution. The distribution of the test statistic when the null hypothesis is true is called the null distribution. If the null hypothesis is not true, the test statistic is not distributed according to the null distribution and is more likely to assume a value that is "unusual" for a random observation from that distribution. The p-value is the probability of determining the probability of observing a value as extreme as or more extreme than the test statistic from a random selection of the standard normal distribution.
How do we measure how unusual a test statistic is? It depends on the alternative hypothesis. These are summarized in Table 18.4.
Step 4: Decide Whether or Not to Reject the Null Hypothesis
Before beginning the study, the significance level of the test is set. The significance level is the largest acceptable probability of a type I error. If the p-value is less than the significance level, the null hypothesis is rejected; otherwise, the null is not rejected.
Step 5: State Conclusions in the Context of the Study
Statistical tests of hypotheses are conducted to determine whether or not sufficient evidence exists to reject the null hypothesis in favor of the alternative hypothesis.
For the running shoe study presented in the previous lesson, the company wants to be able to claim that the new shoe reduces the mean of the 100-meter race times for runners. Is there statistical evidence to support this claim?
Step 1: Specifying the Hypotheses
Let the subscripts F and N represent the runners' favorite shoes and the newly developed shoes, respectively. The company wants to know whether the mean of the race times is lower for the new shoes. (Learning that the mean is greater would certainly not be a strong promotional point.) Thus, the hypotheses of interest are Ho: μF – μN = 0 versus Ho: μF – μN > 0. Notice we could have written these as Ho: μF = μN and Ho: μF > μN .The two sets of hypotheses are equivalent.