Introduction to The MatchedPairs Design for Comparing Two Treatment Means Study Guide
Thus far, we have only attempted to set confidence intervals on proportions or means based on a sample from a single treatment or population. Now we want to conduct studies that will allow us to compare the means of two treatments. First, we will think about how best to design a study. In this lesson, after introducing the basic ideas behind matched pairs and twogroup designs, we will focus on the analysis of data from the paired design. In the next lesson, we will consider the twogroup design.
TwoGroup versus MatchedPairs Design
Suppose we are going to conduct a study to compare two methods of production, a standard method and a new method, that cause children's dress shoes to shine. Fifty children have been randomly selected to participate in the study. Each child will be given a new pair of dress shoes that shine. But first we need to decide how to assign the treatments (or production methods) to the children's shoes. One approach is to randomly select 25 (half) of the children and give them shoes made using the standard production process; the other half will receive shoes that were made using the new production process. Thus, each child would have a pair of shoes made by one of the two processes. A second approach is to have one shoe of each pair made with the standard process and the other shoe with the new process. Whether the right or left shoe is made with the first process would be randomly determined. In this second approach, each child would wear a dress shoe made using each process.
Regardless of which approach of assigning treatments is used, the children will wear the shoes whenever they wear dress shoes for six months. At the end of the six months, an evaluator who does not know which shoe received which treatment will score the shine quality of each shoe.
Which method of assigning treatments is better? In this case, having each child wear shoes made by both processes is better. Children differ in their activities while wearing dress shoes. Some may wear them only for special occasions, and their shoes will continue to shine no matter what process was used. Other children run and play in their dress shoes. Their shoes are less likely to continue to shine so the process could make a big difference. By having each child wear a shoe made from each process, both processes are subjected to the same environment (level of play). The difference in shine after six months is due more to the differences in the processes and not to differences in the children. This is an example of a paired experiment.
The other design in which half of the children wore dress shoes made by the standard process and half by the new process is a twogroup design. Although this is a reasonable design, it is not the best for this study. The differences we observe in the shine of the shoes after six months are not due only to differences in processes, but also due to differences in children. This would lead to more variation in the estimated mean differences, making it more difficult to determine which, if either, shine process is better.
In the planning stages of a study, it is always important to consider the best way to randomize treatments to the study units. Pairs should be formed if, by pairing, we can eliminate some of the variability in the response that would otherwise be present. In the blinking study presented first in Lesson 4, for each study participant, the number of blinks in a twominute time period was measured during normal conversation and while playing a video game. Those participants who tended to blink less than average during normal conversation also tended to blink less than average while playing a video game. Similarly, those who tended to blink more than average during normal conversation tended to blink more than average while playing a video game. By recording the difference in the number of blinks under each treatment for each person, we could eliminate the differences among people, allowing us to more accurately measure the differences between treatments, that is, between normal conversation and video playing.
Sometimes, it is not reasonable for both treatments to be applied to the same person. In this case, we may want to pair by some factor that will help explain the variability in the response. For example, suppose we want to compare two treatments for cholesterol. We could pair patients by their initial cholesterol levels. Those with the highest cholesterol level would be in the first pair. Those with the next highest cholesterol level would be in the next pair, and so forth. Then, within each pair, one of the patients would be randomly assigned to the first treatment, and the other would get the second treatment.
Whether or not to use pairing is an important consideration. Matched pairs should be formed only if the researcher believes that significant difference in the response variable can be explained, allowing differences in the treatments to be detected more readily. As an illustration, suppose we decided to pair patients in the cholesterol study on the basis of the length of their feet. The two with the longest feet would be in the first pair, the two with the next longest feet would be in the second pair, and so on. We have no reason to believe that foot length is in any way related to cholesterol level. Pairing in such a situation provides no benefit and is not as effective for assessing whether or not the treatment means are different as the twogroup design.
Example
A researcher wants to compare the quality of cooking roasts using two methods—open pan and bag. Four ovens are available for the study. Eight roasts of equal quality have been allocated for the study.
 Describe how to conduct the study using a matchedpairs design.
 Describe how to conduct the study using a twogroup design.
 Which of the two designs would you use for this study? Explain.
Solution
 For a matchedpairs design, two roasts would be cooked in each oven, one in an open pan and the other in a bag. The location of each roast within the oven would be randomly determined.
 For a twogroup design, we randomly select two of the ovens to cook a roast using the openpan method; the other two ovens would each be used to cook a roast in a bag.
 The matchedpairs design would be the best for this study. Ovens often vary in their ability to hold temperature at a specified level. By having both treatments in each oven, differences between ovens can be accounted for in the analysis. As described, we have used half as many roasts in the twogroup design. We could put two roasts in each oven and cook using the same method. This gives us information on the differences within an oven and allows us to more precisely estimate the quality of the roasts cooked in a specific oven. Cooking two roasts in the same oven does not double the number of experimental units in the study. An oven would be the experimental unit because the cooking methods were randomly assigned to the ovens.
MatchedPairs Design
Once we have decided to conduct an experiment using matched pairs, how do we actually go about conducting the study? First, the study units need to be obtained. As we learned in Lesson 2, if the study units are randomly selected from some population, conclusions can be made for that population at the end of the study; otherwise, conclusions apply only to the units in the study. In the shoeshine study, children were randomly selected. The group from which these children were randomly selected is the population for which inferences can be made.
Next, the study units need to be paired. Individuals could be matched according to a characteristic that could explain some of the difference in the response variable. In the cholesterol study, individuals were matched by initial cholesterol level. Sometimes, both treatments can be sequentially applied to the same individual. This form of matched pairs is often very strong, but may require more time than is available for the study.
Once the pairs are formed, one treatment is randomly assigned to one unit in the pair; the other unit receives the second treatment. Notice that a separate randomization is used for each pair. For the shoeshine study, it would not be sufficient to flip a coin and randomly assign the first treatment to all right shoes and the other treatment to all left shoes. Children are right or leftfooted just as they are right or lefthanded. It is possible that one shoe, say the right shoe, tends to get the most wear because most children are right footed. If this is the case, then the treatment assigned to all right shoes would be at a disadvantage in the study. To avoid this and other biases of which we may not even be aware, we randomly assign treatments within each pair.
Once the study is complete, we record the response variable for each unit. Let X_{1i} be the observed response from the first treatment in pair i, i = 1, 2, . . ., n, where there are n pairs. Similarly, let X_{2i} be the observed response from the second treatment in pair i,i = 1, 2, . . . , n. Then D_{i} = X_{1i} – X_{2i}, i = 1, 2, . . . , n, is the observed difference in the two treatments for the ith pair. There is a conceptual population of D_{i}'s comprised of the differences in all possible pairs that could have been used in this study. This population has μ_{D} and standard deviation σ_{D}.
The sample mean difference in the two treatments, , is an estimate of the difference in the treatment means, μ_{1}– μ_{2}= μ_{D}, the mean of the population of paired treatment differences. The sample variance of the pairwise differences provides an estimate of σ_{D}^{2} and is . The sample standard deviation is s_{D} = √. Notice that and s_{D} are, respectively, the sample mean and sample standard deviation of the differences. This would lead us to speculate that the standard error of is . This is, in fact, the case! The analysis of a paired study is based on these quantities. We will consider this further in the next two lessons.
Example
An athletic shoe company believes that they have developed a shoe that will help shortdistance runners lower their times in races. They recruited 24 runners. Each runner was given a new pair of the athletic shoes. The runners were encouraged to use these shoes and their favorite pair of running shoes equally in practice for two weeks. After two weeks, the runners ran two 100meter dashes with five hours between races. For each runner, a coin was flipped. If the coin landed heads up, the runner wore his or her favorite running shoes in the first race; otherwise, he or she wore his or her newly developed shoes. In the second race, each runner wore the pair of shoes that was not used in the first race. The times for the runners are given in Table 18.1.
 Explain why this study has a matchedpairs design. Include a clear statement describing what constitutes a pair.
 Find the difference in observations from each pair.
 Estimate the mean and standard deviation of the differences in time to run a 100meter dash when wearing the favorite running shoes compared to the new running shoes.
 Find the standard error of the estimated mean of the differences in time to run a 100meter dash when wearing the favorite running shoes compared to the new running shoes.
 Is the assumption reasonable that the differences are normally distributed?
 To which population may inference be drawn from this study?
Solution
 The two treatments are the favorite running shoes and the newly developed running shoes. Each treatment is applied to a runner. Thus, the favorite running shoes and the newly developed running shoes are paired by runner. A pair consists of the running times for the two treatments from a single runner. The order in which the shoes were used was randomized for each runner, a critical step in conducting the study.
 The differences in the two treatments are computed for each runner (see Table 18.2).

The estimated mean difference in the running times using the favorite shoes versus using the newly developed shoes is
(–0.06 + 0.64 +………+ 0.34)
= 0.2050
The estimated variance of these differences is , and the estimated standard deviation is 0.3068.
 The standard error of the estimated differences in the two treatments is .
 Because the sample size is small, it is difficult to determine whether or not the observed differences are normally distributed. Although formal tests exist for determining normality, we will not study them here. Instead, we will rely on examining graphs to determine whether there are indications that the data may not be normal. Figures 18.1, 18.2, and 18.3 show a histogram, a dotplot, and a boxplot, respectively. The histogram looks fairly symmetric and unimodal. With only 24 observations, the shape of a population is often not fully captured in a histogram of the data. The dotplot appears to be centered at about 0.20. The values range from –0.39 to 0.84 with a higher concentration of dots in the center. From the boxplot, the data appear to be fairly symmetric without any outliers. In summary, we do not see any indication of skewness, outliers, or other features that would cause us to think that the assumption of normality is unreasonable.
 Because the runners were recruited and not randomly selected from some population, the population to which inference may be drawn is the runners in this study.
Confidence Intervals on the Difference in Two Treatment Means
If the goal is to provide an interval of reasonable values for the mean difference in two treatment means based on a matchedpairs design, we want to set a confidence interval on that mean difference. The conditions for inference are that the differences (D_{i}'s) are a random sample from a population of differences. Second, the D_{i}'s are either normally distributed or the sample size is large enough (at least 30) to assume that the average of the D_{i}'s is approximately normally distributed by the Central Limit Theorem.
The methods for statistical inference using the D_{i}'s are computationally identical to those for one population; the interpretation is all that differs. Recall is an unbiased estimate of the difference in the treatment means, μ_{1}–μ_{2}=μ_{D}. The standard error of this estimate is where . Using the form point estimate ± multiplier × standard error, a 100(1 –α)% confidence interval on μ_{D} = μ_{1} – μ_{2} has the form where t* with (n – 1) degrees of freedom is the proper tabulated value to give 100(1 – α)% confidence.
Example
Look again at the study comparing running shoes earlier in this lesson. Find a 90% confidence interval on the mean difference in the race times for a 100meter dash using the athlete's favorite running shoes and the new running shoes.
Solution
The 24 runners were recruited not randomly selected from all runners, so our inference will be restricted to these 24 runners (review the table in Lesson 2). We may believe that these runners are representative of all runners and thus attempt to broaden our scope of inference, but we need to be very careful in doing so. If we think these runners may differ from a broader population of runners, the 24 runners must be taken as the population of interest. The random assignment of treatment order is necessary for the first condition of inference to be satisfied. Note: Unless treatments are randomly assigned within a pair, we do not have a random sample of differences, which is the first condition. Although we have not formally tested whether or not the population of differences has a normal distribution, the graphs constructed in the previous example suggest that it is not an unreasonable assumption, so we will assume that the differences in running times are at least approximately normally distributed.
Based on n=24 runners, we found = 0.2050 and s_{D} = 0.3068. For a 90% confidence interval, α = 0.10, so we want to put 5% of the probability in each tail. Looking in the ttable in Lesson 12, the tvalue at the intersection of the row corresponding to 23 degrees of freedom and the column showing 0.05 in the upper tail, we have t* = 1.714. The confidence interval on μ_{F} – μ_{N} is 0.2050 ± 1.714 or 0.2050 ± 0.1096. Therefore, we estimate that, on average, the new running shoes allow runners to complete the race in 0.21 fewer seconds compared to their favorite running shoes, and we are 90% confident that this estimate is within 0.11 seconds of the true mean difference in the times to run the 100meter dash using the new running shoes and the runners' favorite shoes.
Hypothesis Tests Concerning the Difference in Two Treatment Means
Tests of hypotheses concerning the difference in two treatment means are based on the same philosophy as the hypothesis tests discussed earlier in this book. Five steps are followed to conduct a hypothesis test.
Step 1: Specifying the Hypotheses
For a matchedpairs design, the null hypothesis, H_{o}, is that the difference in the treatment means is d_{o}; that is, H_{o}: μ_{1}– μ_{2} =d_{o}. Although it is common for d_{o}=0 (implying that the means or proportions are equal), this is not necessary; d_{o} can be any value. The alternative is that the difference is less than, greater than, or equal to d_{o}.
Step 2: Verify Necessary Conditions for a Test and, if Satisfied, Construct the Test Statistic
The conditions for testing hypotheses about the difference in two treatment means are the same as conditions for testing confidence intervals. First, the differences (D_{i}'s) must be a random sample from some population of differences. Second, we must satisfy the condition of normality. That is, the D_{i}'s are normally distributed or the sample size is sufficient that the sample mean difference is approximately normal by the Central Limit Theorem.
The test statistic has the now familiar form
For the paired design, this becomes
.
We know the distribution of this test statistic is at least approximately a tdistribution with (n – 1) degrees of freedom if the null hypothesis is true.
Step 3: Find the pValue Associated with the Test Statistic
If the null hypothesis about the difference in two treatment means is true, the test statistic has either an exact or an approximate tdistribution. The distribution of the test statistic when the null hypothesis is true is called the null distribution. If the null hypothesis is not true, the test statistic is not distributed according to the null distribution and is more likely to assume a value that is "unusual" for a random observation from that distribution. The pvalue is the probability of determining the probability of observing a value as extreme as or more extreme than the test statistic from a random selection of the standard normal distribution.
How do we measure how unusual a test statistic is? It depends on the alternative hypothesis. These are summarized in Table 18.4.
Step 4: Decide Whether or Not to Reject the Null Hypothesis
Before beginning the study, the significance level of the test is set. The significance level is the largest acceptable probability of a type I error. If the pvalue is less than the significance level, the null hypothesis is rejected; otherwise, the null is not rejected.
Step 5: State Conclusions in the Context of the Study
Statistical tests of hypotheses are conducted to determine whether or not sufficient evidence exists to reject the null hypothesis in favor of the alternative hypothesis.
Example:
For the running shoe study presented in the previous lesson, the company wants to be able to claim that the new shoe reduces the mean of the 100meter race times for runners. Is there statistical evidence to support this claim?
Solution:
Step 1: Specifying the Hypotheses
Let the subscripts F and N represent the runners' favorite shoes and the newly developed shoes, respectively. The company wants to know whether the mean of the race times is lower for the new shoes. (Learning that the mean is greater would certainly not be a strong promotional point.) Thus, the hypotheses of interest are H_{o}: μ_{F} – μ_{N} = 0 versus H_{o}: μ_{F} – μ_{N} > 0. Notice we could have written these as H_{o}: μ_{F} = μ_{N} and H_{o}: μ_{F} > μ_{N} .The two sets of hypotheses are equivalent.
Step 2: Verify Necessary Conditions for a Test and, if Satisfied, Construct the Test Statistic
Although the runners were recruited, the order in which the treatments (newly developed shoes and favorite shoes) were observed was randomly determined. Thus, the observed differences are a random sample of all possible differences for these 24 runners, and the first condition for inference is satisfied. Often, race times are normally distributed, so the differences in race times under two treatments would be normally distributed. The sample size is not sufficient to test the assumption of normality rigorously. However, from inspection of the graphs and summary statistics earlier in this lesson, it is not unreasonable to assume that the differences in run times using the favorite and new shoes are normally distributed. Thus, the conditions for inference are assumed to be satisfied.
Because the study has a paired design, the test statistic is the now familiar form
.
Step 3: Find the pValue Associated with the Test Statistic
If the null hypothesis is true, the test statistic has a tdistribution with (n – 1) = 23 degrees of freedom. Given the alternative hypothesis, we want to reject the null hypothesis if t_{T} gets too large. Thus, p = P(t > 3.27). In the ttable on the line for 23 degrees of freedom, 3.27 lies between 2.807 and 3.485, corresponding to upper tail probabilities of 0.005 and 0.001, respectively; thus, 0.001 < p < 0.005.
Step 4: Decide Whether or Not to Reject the Null Hypothesis
The pvalue observed in this study indicates that a test statistic of this magnitude is very unusual if the null hypothesis is true. Therefore, we reject the null hypothesis and decide in favor of the alternative.
Step 5: State Conclusions in the Context of the Study
The mean time for a 100meter race was significantly less when runners wore the newly developed shoes compared to their favorite running shoes.
The MatchedPairs Design for Comparing Two Treatment Means In Short
Designs comparing two treatments or populations have been discussed. The matchedpairs design allows one to account for known or suspected sources of variability in the design. The twogroup design is useful when a reasonable basis for pairing is not available or feasible. For the paired design, confidence intervals and hypothesis tests on the difference in the treatment means were described.
Find practice problems and solutions for these concepts at The MatchedPairs Design for Comparing Two Treatment Means Practice Exercises.
View Full Article
From Statistics Success in 20 Minutes A Day. Copyright © 2006 by LearningExpress, LLC. All Rights Reserved.