The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR...

25
Statistical Science 2009, Vol. 24, No. 1, 29–53 DOI: 10.1214/08-STS274 © Institute of Mathematical Statistics, 2009 The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insurance Evaluation Kosuke Imai, Gary King and Clayton Nall Abstract. A basic feature of many field experiments is that investigators are only able to randomize clusters of individuals—such as households, com- munities, firms, medical practices, schools or classrooms—even when the individual is the unit of interest. To recoup the resulting efficiency loss, some studies pair similar clusters and randomize treatment within pairs. However, many other studies avoid pairing, in part because of claims in the litera- ture, echoed by clinical trials standards organizations, that this matched-pair, cluster-randomization design has serious problems. We argue that all such claims are unfounded. We also prove that the estimator recommended for this design in the literature is unbiased only in situations when matching is unnecessary; its standard error is also invalid. To overcome this problem without modeling assumptions, we develop a simple design-based estimator with much improved statistical properties. We also propose a model-based approach that includes some of the benefits of our design-based estimator as well as the estimator in the literature. Our methods also address individual- level noncompliance, which is common in applications but not allowed for in most existing methods. We show that from the perspective of bias, efficiency, power, robustness or research costs, and in large or small samples, pairing should be used in cluster-randomized experiments whenever feasible; failing to do so is equivalent to discarding a considerable fraction of one’s data. We develop these techniques in the context of a randomized evaluation we are conducting of the Mexican Universal Health Insurance Program. Key words and phrases: Causal inference, community intervention trials, field experiments, group-randomized trials, place-randomized trials, health policy, matched-pair design, noncompliance, power. Kosuke Imai is Assistant Professor, Department of Politics, Princeton University, Princeton, New Jersey 08544, USA (e-mail: [email protected]; URL: http://imai.princeton. edu). Gary King is Albert J. Weatherhead III University Professor, Institute for Quantitative Social Science, Harvard University, 1737 Cambridge St., Cambridge, Massachusetts 02138, USA (e-mail: King@Harvard. edu; URL: http://GKing.Harvard.edu). Clayton Nall is Ph.D. Candidate, Department of Government, Institute for Quantitative Social Science, Harvard University, 1737 1. INTRODUCTION For political, ethical or administrative reasons, re- searchers conducting field experiments are often un- able to randomize treatment assignment to individuals and so instead randomize treatments to clusters of indi- viduals (Murray, 1998; Donner and Klar, 2000a; Rau- denbush, Martinez and Spybrook, 2007). For example, 19 (68%) of the 28 field experiments we found pub- Cambridge St., Cambridge, Massachusetts 02138, USA (e-mail: [email protected]. edu). 29

Transcript of The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR...

Page 1: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

Statistical Science2009, Vol. 24, No. 1, 29–53DOI: 10.1214/08-STS274© Institute of Mathematical Statistics, 2009

The Essential Role of Pair Matching inCluster-Randomized Experiments, withApplication to the Mexican UniversalHealth Insurance EvaluationKosuke Imai, Gary King and Clayton Nall

Abstract. A basic feature of many field experiments is that investigators areonly able to randomize clusters of individuals—such as households, com-munities, firms, medical practices, schools or classrooms—even when theindividual is the unit of interest. To recoup the resulting efficiency loss, somestudies pair similar clusters and randomize treatment within pairs. However,many other studies avoid pairing, in part because of claims in the litera-ture, echoed by clinical trials standards organizations, that this matched-pair,cluster-randomization design has serious problems. We argue that all suchclaims are unfounded. We also prove that the estimator recommended forthis design in the literature is unbiased only in situations when matchingis unnecessary; its standard error is also invalid. To overcome this problemwithout modeling assumptions, we develop a simple design-based estimatorwith much improved statistical properties. We also propose a model-basedapproach that includes some of the benefits of our design-based estimator aswell as the estimator in the literature. Our methods also address individual-level noncompliance, which is common in applications but not allowed for inmost existing methods. We show that from the perspective of bias, efficiency,power, robustness or research costs, and in large or small samples, pairingshould be used in cluster-randomized experiments whenever feasible; failingto do so is equivalent to discarding a considerable fraction of one’s data. Wedevelop these techniques in the context of a randomized evaluation we areconducting of the Mexican Universal Health Insurance Program.

Key words and phrases: Causal inference, community intervention trials,field experiments, group-randomized trials, place-randomized trials, healthpolicy, matched-pair design, noncompliance, power.

Kosuke Imai is Assistant Professor, Department of Politics,Princeton University, Princeton, New Jersey 08544, USA(e-mail: [email protected]; URL: http://imai.princeton.edu). Gary King is Albert J. Weatherhead III UniversityProfessor, Institute for Quantitative Social Science,Harvard University, 1737 Cambridge St., Cambridge,Massachusetts 02138, USA (e-mail: [email protected]; URL: http://GKing.Harvard.edu). Clayton Nall isPh.D. Candidate, Department of Government, Institute forQuantitative Social Science, Harvard University, 1737

1. INTRODUCTION

For political, ethical or administrative reasons, re-searchers conducting field experiments are often un-able to randomize treatment assignment to individualsand so instead randomize treatments to clusters of indi-viduals (Murray, 1998; Donner and Klar, 2000a; Rau-denbush, Martinez and Spybrook, 2007). For example,19 (68%) of the 28 field experiments we found pub-

Cambridge St., Cambridge, Massachusetts 02138, USA(e-mail: [email protected]. edu).

29

Page 2: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

30 K. IMAI, G. KING AND C. NALL

lished in major political science journals since 2000randomized households, precincts, city-blocks or vil-lages even though individual voters were the inferen-tial target (e.g., Arceneaux, 2005); in public health andmedicine, where “the number of trials reporting a clus-ter design has risen exponentially since 1997” (Camp-bell, 2004), randomization occurs at the level of healthclinics, physicians or other administrative and geo-graphical units even though individuals are the unitsof interest (e.g., Sommer et al., 1986; Varnell et al.,2004); and numerous education researchers random-ize schools, classrooms or teachers instead of students(e.g., Angrist and Lavy, 2002).

Since efficiency drops when randomizing clusters ofindividuals instead of individuals themselves (Corn-field, 1978), many scholars attempt to recoup some ofthis lost efficiency by pairing clusters, based on thesimilarity of available background characteristics, be-fore randomly assigning one cluster within each pairto receive the treatment assignment (e.g., Ball and Bo-gatz, 1972; Gail et al., 1992; Hill, Rubin and Thomas,1999). Since matching prior to random treatment as-signment can greatly improve the efficiency of causaleffect estimation (Bloom, 1978; Greevy et al., 2004),and matching in pairs can be substantially more ef-ficient than matching in larger blocks, matched-pair,cluster-randomization (MPCR) would appear to be anattractive design for field experiments (Imai, King andStuart, 2008). [See also Moulton (2004).] The design isespecially useful for public policy experiments since,when used properly, it can be robust to interventionsby politicians and others that have ruined many policyevaluations, such as when office-holders arrange pro-gram benefits for constituents who live in control groupclusters (King et al., 2007).

Unfortunately, despite its apparent benefits and com-mon usage, this experimental design has an uncertainscientific status. Researchers in this area and formalstatements from clinical trial standards organizations(e.g., Donner and Klar, 2004; Feng et al., 2001; Med-ical Research Council, 2002) claim that certain “ana-lytic limitations” make MPCR, or at least the existingmethods available to analyze data from this design, in-appropriate. These claimed limitations include “the re-striction of prediction models to cluster-level baselinerisk factors (e.g., cluster size), the inability to test forhomogeneity of . . . [causal effects across clusters], anddifficulties in estimating the intracluster correlation co-efficient, a measure of similarity among cluster mem-bers” (Klar and Donner, 1997, page 1754). In addition,

in a widely cited article, Martin et al. (1993) claim thatin small samples, pairing can reduce statistical power.

We show that each of the claims regarding analyti-cal limitations of MPCR is incorrect. We also demon-strate that the power calculations leading Martin et al.(1993) to recommend against MPCR in small samplesis dependent on an assumption of equal cluster sizesthat vitiates one major advantage of pair matching; weshow in real data that the assumption does not applyand without it pair matching on cluster sizes and pre-treatment variables that affect the outcome improvesboth efficiency and power a great deal, even in samplesas small as three pairs. In fact, because the efficiencygain of MPCR depends on the correlation of clustermeans weighted by cluster size, the advantage can bemuch larger than the unweighted correlations that havebeen studied seem to indicate, even when cluster sizesare independent of the outcome.

Finally, there exists no published formal evaluationof the statistical properties of the estimator for MPCRdata most commonly recommended in the methodolog-ical literature. By defining the quantities of interestseparately from the methods used to estimate them,and identifying a model that gives rise to the mostcommonly used estimator, we show that this approachdepends on assumptions, such as the homogeneity oftreatment effects across all clusters, that apply bestwhen matching is not needed to begin with. The com-monly used variance estimator is also biased. We thenoffer new simple design-based estimators and theirvariances. We also propose an alternative model-basedapproach that includes the benefits of our design-basedestimator, which has little or no bias, and the estima-tor in the literature, which under certain circumstanceshas lower variance. Finally, we extend our methods tosituations with individual-level noncompliance, whichis a basic feature of many MPCR experiments but forwhich most prior methods have not been adapted. Withthe results and new methods offered here, ambiguityabout what to do in cluster randomized experimentsvanishes: pair matching should be used whenever fea-sible.

2. EVALUATION OF THE MEXICAN UNIVERSALHEALTH INSURANCE PROGRAM

As a running example of MPCR, we introduce arandomized evaluation we are conducting of SeguroPopular de Salud (SPS) in Mexico. A major domesticinitiative of the Vicente Fox presidency, the programseeks “to provide social protection in health to the 50

Page 3: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31

million uninsured Mexicans” (Frenk et al., 2003, page1667), constituting about half the population (King etal., 2007). The government intends to spend an addi-tional one percent of GDP on health compared to 2002once the program is fully introduced.

SPS permitted a cluster randomized (CR) study to bebuilt into the program rollout. Under national legisla-tion, Mexican states must apply to the federal govern-ment for funds both to publicize the program and fundits operations. The federal government approves theserequests only when local health clinics are brought upto federal standards. When an area is approved to be-gin program enrollment, families who affiliate are ex-pected to receive free preventative and regular medicalcare, pharmaceuticals and medical procedures. How-ever, because local health clinics and hospitals maytake years to meet federal standards, and also becauseof budget restrictions, a staged rollout was necessaryand also allowed us the chance to run this randomizedstudy. Finally, since SPS allows individuals to decidefor themselves whether to enroll (if necessary, by trav-eling from unenrolled to enrolled areas), it was possi-ble to adopt a clustered encouragement design (Fran-gakis et al., 2002), thereby permitting estimation ofindividual-level program effects. (We focus on the ITTeffect until Section 6.)

The MPCR design was implemented in geographicareas created for the project which we call “health clus-ters,” defined as the geographic catchment area of a lo-cal hospital or clinic. The country is tiled by 12,824such clusters, and negotiations with the Mexican gov-ernment produced more than 100 for which randomassignment was acceptable. The chosen clusters werepaired based on census demographics, poverty, educa-tion, and health infrastructure. Within each pair, one“treatment” cluster was randomly chosen for early pro-gram rollout, receiving funds to upgrade their healthclinics and encourage individual enrollment. The “con-trol” cluster in each pair had its rollout set for some fu-ture time. (Individuals could still obtain SPS benefitsby traveling to SPS-approved clusters, but did not re-ceive encouragement or resources to do so.) For designdetails, see King et al. (2007).

The primary outcome of interest at this stage was thelevel of out-of-pocket health expenditures, while sec-ondary outcomes of interest included medical utiliza-tion, health self-assessment and self-reported healthbehaviors. Outcomes were measured in a baseline andfollowup panel survey of more than 32,000 house-holds. Our examples draw upon 67 of these variablesmeasured at the 10-month followup.

3. MATCHED-PAIR, CLUSTER-RANDOMIZEDEXPERIMENTS

We now introduce MPCR experiments, including thetheories of inference commonly applied (Section 3.1),the formal definitions, notation and assumptions usedin (Section 3.2), and the quantities of interest typicallysought (Section 3.3).

3.1 Theories of Inference

We describe the model-based and permutation-basedtheories of statistical inference that have been appliedto MPCR data and then the design-based theory fromwhich our work is derived.

First, model-based inference applied to MPCR typ-ically uses generalized mixed-effects models, general-ized estimating equations or multi-level models (Fenget al., 2001). Most of these work only if the modelingassumptions are correct; they also rely on asymptoticapproximations. Model-based and model-assisted ap-proaches have proved to be powerful in other areas, es-pecially in survey research and missing data where it isoften necessary, but they violate the purpose and spiritof experimental work which goes to great lengths andexpense to avoid these types of assumptions.

Fisher’s (1935) permutation-based theory of infer-ence, which constructs exact nonparametric hypothe-sis tests based only on the random treatment assign-ment, has also been applied to MPCR. Although per-mutation inference in principle requires no models orapproximations, in practice applications typically haverequired additional assumptions such as constant treat-ment effects across clusters or some kind of (e.g.,Monte Carlo, large sample) approximations. The exist-ing applications include Gail et al. (1996) and Braunand Feng (2001), which combine permutation infer-ence with parametric modeling, and Small, Ten Haveand Rosenbaum (2008) which considers quantile ef-fects using different and more modest assumptions.

In contrast, we use Neyman’s (1923) theory of in-ference, which is well known but has not before beenattempted for MPCR. Like Fisher’s permutation-basedtheory, Neyman’s approach is also design (or “ran-domization”) based and nonparametric, but it naturallyavoids the constant treatment effect assumption andcan provide valid inferences about both sample andpopulation average treatment effects without modelingassumptions (Rubin, 1991). The estimators we deriveare also simple to understand and easier to compute(requiring only weighted means and no numerical op-timization, or simulation).

Page 4: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

32 K. IMAI, G. KING AND C. NALL

3.2 Formal Design Definition, Notation andAssumptions

Consider a MPCR experiment where 2m clustersare paired, based on a known function of the clustercharacteristics, prior to the randomization of a binarytreatment. We assume the j th cluster in the kth paircontains njk units, where j = 1,2 and k = 1, . . . ,m,and thus the total number of units is equal to n =∑m

k=1(n1k + n2k).Under MPCR, simple randomization of an indicator

variable, Zk for k = 1,2, . . . ,m, is conducted indepen-dently across the m pairs. For a pair with Zk = 1, thefirst cluster within pair k is treated (in our case, as-signed encouragement to affiliate with SPS), and thesecond cluster is assigned control. In contrast, for a pairwith Zk = 0, the first cluster is the control whereas thesecond is treated. Thus, using Tjk for the treatment in-dicator for the j th cluster in the kth pair, then T1k = Zk

and T2k = 1−Zk . In the context of the SPS evaluation,we consider an intention-to-treat (ITT) analysis to es-timate the causal effects of encouragement to affiliatewith the program (see Section 6 on the estimation ofcausal effects of the actual affiliation).

We denote Yijk(Tjk) as the potential outcomes underthe treatment (Tjk = 1) and control (Tjk = 0) condi-tions for the ith unit in the j th cluster of the kth pair(Holland, 1986; Maldonado and Greenland, 2002). Theobserved outcome variable is Yijk = TjkYijk(1) + (1 −Tjk)Yijk(0). Finally, the order of clusters within eachpair is randomized so that the population distributionof (Yi1k(1), Yi1k(0)) equals (Yi2k(1), Yi2k(0)) (thoughthis equality may not hold in sample).

A defining feature of CR experiments is that the po-tential outcomes for the ith unit in the j th cluster ofthe kth pair are a function of the cluster-level random-ized treatment variable, Tjk , rather than its unit-leveltreatment counterpart. Similarly, the unit-level causaleffect, Yijk(1) − Yijk(0), is the difference between twounit-level potential outcomes that are the functions ofthe cluster-level treatment variable. Thus, in CR exper-iments, the usual assumption of no interference (Cox,1958; Rubin, 1990) applies only at the cluster level.Moreover, in MPCR, assuming no interference onlybetween pairs of clusters is sufficient. This advantageof MPCR designs can be substantial if contagion or so-cial influence is present at the individual level, where,for example, individuals may affect the behavior ofneighbors or friends, but such interference does not ex-ist across clusters or pairs of clusters. Thus, we onlyassume:

ASSUMPTION 1 (No interference between matched-pairs). Let Yijk(T) be the potential outcomes for theith unit in the j th cluster of the kth matched-pair whereT is a (m × 2) matrix whose (j, k) element is Tjk . Weassume that if Tjk = T ′

jk , then Yijk(T) = Yijk(T′).

The assumption allows us to write Yijk(Tjk) ratherthan Yijk(T). Since T1k = Zk and T2k = 1 − Zk ,Yijk(Tjk) only depends on Zk . Given that the as-sumption of no interference among individuals is of-ten highly unrealistic (Sobel, 2006), MPCR offers anattractive alternative. In the Mexico experiment, As-sumption 1 is reasonable because most of the clustersin our experiment are noncontiguous and the traveltimes between them are substantial. However, espe-cially in small villages, individual-level no interferenceassumptions would have been implausible.

Finally, we formalize the cluster-level randomizedtreatment assignment as follows.

ASSUMPTION 2 (Cluster randomization undermatched-pair design). The potential outcomes areindependent of the randomization indicator variable:(Yijk(1), Yijk(0)) ⊥⊥ Zk , for all i, j and k. Also, Zk isindependent across matched-pairs, and Pr(Zk) = 1/2for all k.

The assumption also implies (Yijk(1), Yijk(0)) ⊥⊥Tjk since Tjk is a function of Zk .

3.3 Quantities of Interest

We now offer the definitions of the causal effects ofinterest under MPCR (or CR in general) which havenot been formally defined in the literature. At leasttwo types of each of four distinct quantities may be ofinterest in these experiments. We begin with the fourquantities, which define the target population, and thendiscuss the two types, which clarify the role of interfer-ence. Section 6 introduces additional quantities of in-terest when individual-level noncompliance exists. (Allthe quantities below are based on causal effects de-fined as grouped individual-level phenomena; we dis-cuss cluster-level causal quantities in Section 4.6.)

3.3.1 Target population quantities. Table 1 offersan overview of the four target population causal effects.All four quantities represent the causal treatment effect(the potential outcome under treatment minus the po-tential outcome under control) averaged over differentsets of units.

First is the sample average treatment effect (SATEor ψS ), which is an average over the set of all units in

Page 5: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 33

TABLE 1Quantities of Interest: For each causal effect, this table lists

whether clusters and units within clusters are treated as observedand fixed or instead as a sample from a larger population. The

resulting inferential target is also given

Unitswithin

Quantities Clusters clusters Inferential target

ψS SATE Observed Observed Observed sampleψC CATE Observed Sampled Population within

observed clustersψU UATE Sampled Observed Observable units

within the populationof clusters

ψP PATE Sampled Sampled Population

the observed sample (which we denote as S):

ψS ≡ ES(Y(1) − Y(0)

)(1)

= 1

n

m∑k=1

2∑j=1

njk∑i=1

(Yijk(1) − Yijk(0)

),

where the sums go over pairs, the two clusters withineach pair and the units within each cluster.

The second quantity treats observed clusters as fixed(and not necessarily representative of some population)and the units within clusters as randomly sampled fromthe finite population of units within each cluster. Thisgives the cluster average treatment effect (CATE orψC):

ψC ≡ EC(Y(1) − (0)

)(2)

= 1

N

m∑k=1

2∑j=1

Njk∑i=1

(Yijk(1) − Yijk(0)

),

where the expectation is taken over the set C whichcontains all observed units within the sample clusters,Njk is the known (and finite) population cluster size,and N ≡ ∑m

k=1(N1k + N2k). Throughout, we assumesimple random sampling within each cluster for sim-plicity, but other random sampling procedures can eas-ily be accommodated via unit-level weights. Thus, theonly difference between SATE and CATE is whethereach unit within clusters is treated as fixed or randomlydrawn based on a known sampling mechanism.

A third quantity treats the clusters as randomly sam-pled from a larger population, but the units within thesampled clusters are treated as fixed. The inferentialtarget is the set U, which includes all units in the pop-ulation of clusters that would be observed if its cluster

were in the observed sample. This is what we call theunit average treatment effect (UATE or ψU ) and is de-fined as ψU ≡ EU(Y (1) − Y(0)).

The final quantity of interest is the population aver-age treatment effect (PATE or ψP ), which is defined asψP ≡ EP (Y (1)−Y(0)), where the expectation is takenover the entire population P —that is, the population ofunits within the population of clusters. For simplicitythroughout, we assume an infinite population of clus-ters, but this is easily extended to finite populations atsome cost in additional notation.

Researchers should design their experiments to makeinferences to their desired quantity of interest, thoughin practice they may choose to estimate other quanti-ties of interest when they face design limitations. Inthe SPS evaluation, for example, we would like to in-fer PATE for all of Mexico, but our health clusters werenot (and due to political and administrative constraintscould not be) randomly selected. This means that, likemost medical experiments, any method applied to ourdata to estimate PATE will be dependent on assump-tions about the selection process. An alternative ap-proach would be to try to estimate one of the otherquantities. CATE or SATE are straightforward possi-bilities, and CATE is probably most apt in this case,since individuals within clusters were randomly se-lected, and both quantities condition on the clusters weobserve. Of course, even when inferences are made torestricted populations, readers may still extrapolate toa different population of interest, and so the researcherneeds to decide on the appropriate presentation strat-egy. From a public policy perspective, UATE may bea reasonable target quantity, where we try to infer tothe individuals who would be sampled in all the healthclusters in Mexico that are similar to our observed clus-ters, and from which our clusters could plausibly havebeen randomly drawn.

3.3.2 Interference. Inference in CR experimentsmay be affected by three different types of interference,each of which may require different assumptions. First,when interference exists among individuals within acluster, the potential outcomes of one person (or unit)within a cluster may be different depending on otherunits’ treatment assignment. This type of interferenceis expected and no assumptions are required for thefour causal quantities of interest. In CR experiments,within-cluster interference is part of the outcome, andresearchers can estimate the causal effects of cluster-level treatment on unit-level outcome. Understandingthe effect of individuals independent of and isolated

Page 6: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

34 K. IMAI, G. KING AND C. NALL

from other individuals in the same cluster is best left tostudies where individual randomization is possible.

Second, interference between clusters in differentpairs may affect outcomes. Assumption 1 requires theabsence of such interference between clusters in dif-ferent pairs. We continue to maintain this assumption,as Sobel (2006) demonstrates that without it even thedefinition of a causal effect is complicated (see alsoRosenbaum, 2007).

Third, interference between treatment and controlclusters in the same pair requires us to redefine causaleffects to account for interference. For example, if onecluster is assigned SPS, individuals in the other (con-trol) cluster within the pair may become envious ordepressed as a consequence. This type of interferencewithin a pair can be dealt with in two ways. In the first,which we call no-interference, we define the causal ef-fect (SATE, CATE, UATE or PATE) so that the treat-ment in one cluster has no effect on the potential out-comes of units in the control cluster. In the second,which we call the with-interference, the causal effectis defined so that it includes interference between clus-ters within pairs as well as interference between unitswithin each cluster. (For our Mexico experiment, wedo not expect much direct interference within or acrosspairs, although nearby clusters outside our experimentmight exert some influence over those we observe, inwhich case the definition of UATE or PATE mightchange).

Estimating the no-interference version of SATE,CATE, UATE or PATE in the presence of inter-ference is feasible only with assumption-laden esti-mators. In contrast, estimating the with-interferenceversion is easier since it accepts whatever level of non-interference one’s data happens to present. Of course,having a quantity that is easy to estimate is not a satis-factory substitute for having an estimate of the quantityof interest. The best way to avoid this problem is to usethese facts to design better experiments. For example,we can select noncontiguous clusters to pair, and pairsthat are not contiguous to other pairs. Following ruleslike this whenever feasible reduces the difference be-tween the no-interference and with-interference quan-tities.

4. ESTIMATORS

We now define our estimators and derive their sta-tistical properties. Our strategy throughout is to makeas few assumptions as feasible beyond the experimen-tal design. We also briefly discuss an approach that hasbeen offered in the literature. Since our approach has

little or no bias, and the existing estimator is biased butmay have low variance in some circumstances, we alsooffer a model-based method that combines some of thebenefits of both approaches.

4.1 Definitions

The point estimators for the with-interference ver-sion of the four quantities of interest are each weightedaverages of within-pair mean differences between thetreated and control clusters, but with different weights.We thus define

ψ̂(wk)

≡ 1∑mk=1 wk

·m∑

k=1

wk

{Zk

(∑n1k

i=1 Yi1k

n1k

−∑n2k

i=1 Yi2k

n2k

)(3)

+ (1 − Zk)

·(∑n2k

i=1 Yi2k

n2k

−∑n1k

i=1 Yi1k

n1k

)},

where the weight for the kth pair of clusters, denotedby wk , defines a specific estimator.

The estimator most commonly recommended in themethodological literature is based on a weight usingthe harmonic mean of sample cluster sizes, which canbe written as ψ̂(n1kn2k/(n1k + n2k)) (see, e.g., Don-ner, 1987; Donner and Donald, 1987; Donner and Klar,1993; Hayes and Bennett, 1999; Bloom, 2006; Rau-denbush, 1997; Turner, White and Croudace, 2007).This estimator, and its variance estimator, are in gen-eral biased (see Appendix A.4), but may have low vari-ance in some situations, an issue we return to in Sec-tion 4.5.

As shown in Table 2, ψ̂(n1k + n2k) is our point es-timator for both SATE and UATE, whereas ψ̂(N1k +N2k) applies to both CATE and PATE. This is intuitive,as SATE and UATE are based on those units (whichwould be) sampled in a cluster whereas CATE andPATE are based on the population of units within clus-ters. Our estimator for SATE and UATE differs fromthe existing estimator based on harmonic mean weightsunless the sample cluster sizes within each matchedpair are equal (n1k = n2k for all k = 1, . . . ,m), whichrarely occurs at least in field experiments.

Table 2 also summarizes the variances and their esti-mators. Under our design-based inference, UATE andPATE have identifiable variances, the exact expres-sion for which we give below. SATE and CATE haveunidentifiable variances, and so we offer their upperbound, leading to a conservative confidence interval.

Page 7: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 35

TABLE 2Point estimators and variances for the four causal quantities of interest. “Identified” refers to design-based identification of estimated

causal effects without modeling assumptions

SATE CATE UATE PATE

Point estimator ψ̂(n1k + n2k) ψ̂(N1k + N2k) ψ̂(n1k + n2k) ψ̂(N1k + N2k)

Variance Vara(ψ̂) Varau(ψ̂) Varap(ψ̂) Varaup(ψ̂)

Identified no no YES YES

Our variance estimators differ from the existing estima-tor even when sample cluster sizes are matched exactly.Our variance estimator is approximately unbiased forany weights. Estimates from UATE and PATE (orequivalently SATE and CATE) will differ depending onhow sample and population sizes vary across clusters.

4.2 Bias

We first focus on SATE. This allows us, follow-ing Neyman (1923), to use the randomized treatmentassignment mechanism as the sole basis for statisticalinference (see also Imai, 2008). Here, the potential out-comes are assumed fixed, but possibly unknown, quan-tities. We begin by rewriting ψ̂(n1k + n2k) using po-tential outcome notation:

ψ̂(n1k + n2k)

= 1

n

m∑k=1

(n1k + n2k)

·{Zk

(∑n1k

i=1 Yi1k(1)

n1k

−∑n2k

i=1 Yi2k(0)

n2k

)+ (1 − Zk)

·(∑n2k

i=1 Yi2k(1)

n2k

−∑n1k

i=1 Yi1k(0)

n1k

)}.

Then, taking the expectation with respect to Zk yields

Ea{ψ̂(n1k + n2k)} − ψS

= 1

n

m∑k=1

2∑j=1

{(n1k + n2k

2− njk

)(4)

·njk∑i=1

Yijk(1) − Yijk(0)

njk

},

where the expectation is taken with respect to the ran-domization of treatment assignment which we indicateby the subscript “a.”

Although the bias does not generally equal zero,either of two common conditions can eliminate it.

These two conditions motivate our choice of weights(wk = n1k + n2k). First, when cluster sizes are equalwithin each matched-pair (i.e., n1k = n2k for all k),the bias is always zero. This implies that researchersmay wish to form pairs of clusters, at least partially,based on their sample cluster size if SATE is the esti-mand. Second, ψ̂(n1k +n2k) is also unbiased if match-ing is effective, so that the within-cluster SATEs areidentical for each matched-pair (i.e.,

∑n1k

i=1(Yi1k(1) −Yi1k(0))/n1k = ∑n2k

i=1(Yi2k(1)−Yi2k(0))/n2k for all k).In contrast, bias may remain if cluster sizes are poorlymatched and within each pair cluster sizes are stronglyassociated with the cluster-specific SATEs. However,the bounds on the bias can be found by applyingthe Cauchy–Schwarz inequality to equation (4) andthey can be consistently estimated from the observeddata. In sum, roughly speaking, if cluster sizes andimportant confounders are matched well so that pre-randomization matching accomplishes the purpose forwhich it was designed, this estimator will be approxi-mately unbiased.

A similar bias expression can be derived for ourCATE estimator, ψ̂(N1k + N2k), where the weightsare now based on the arithmetic mean of the popula-tion cluster sizes rather than their sample counterparts.A calculation analogous to the one above yields the fol-lowing bias expression:

Eau(ψ̂(N1k + N2k)

) − ψC

= 1

N

m∑k=1

2∑j=1

{(N1k + N2k

2− Njk

)(5)

· Eu

(Yijk(1) − Yijk(0)

)},

where subscript “au” means that the expectation istaken with respect to random treatment assignment andthe simple random sampling of units within each clus-ter. The conditions under which this bias disappears areanalogous to the ones for SATE: If matching is effec-tive so that the cluster-specific average causal effects,

Page 8: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

36 K. IMAI, G. KING AND C. NALL

that is, Eu[Yijk(1) − Yijk(0)], are constant across clus-ters within each pair, then the bias is zero. The biasalso vanishes if the population cluster sizes are identi-cal within each pair, that is, N1k = N2k for all k. Again,the bounds on the bias can be obtained in the mannersimilar to the case of SATE above.

Finally, the bias for UATE and PATE can be ob-tained by taking the expectation of the bias for SATEand CATE, respectively, with the expectation definedwith based on random sampling of cluster pairs. If thewithin-cluster sample (population) average treatmenteffects are uncorrelated with cluster sizes within eachmatched-pair, then the bias for the estimation of UATE(PATE) is zero, regardless of whether one can matchexactly on cluster sizes. In general, however, clustersizes may be correlated with the size of average treat-ment effects. In such cases, the matching strategies toreduce the bias for the estimation of SATE (CATE) alsowork for the estimation of UATE (PATE). That is, pairsof clusters should be constructed such that within eachpair, cluster sizes and important pre-treatment covari-ates are similar. (We also derived an unbiased estimatorand its variance, but we do not present it here becausethey are not invariant to a constant shift of the outcomevariable when cluster sizes vary within each pair.)

4.3 Variance

In a critical comment about Klar and Donner (1997),Thompson (1998) shows how to obtain valid varianceestimates under the linear mixed effects model and the“common effect assumption.” In their reply, Klar andDonner (1998) criticize the common effect assumptionand, as a result, maintain their claim of analytical diffi-culties with MPCRs. We show here how to obtain validvariance estimates without the common treatment ef-fect assumption or other modeling assumptions.

Rather than focusing on each of our proposed esti-mators, ψ̂(n1k +n2k) and ψ̂(N1k +N2k), separately weconsider the variance of the general estimator, ψ̂(wk)

in equation (3), so that the analytical results we developapply to any choice of weights including the harmonicmean weights. For notational simplicity, we use nor-malized weights, that is, w̃k ≡ nwk/

∑mk=1 wk (so that

the weights sum up to n as in our estimator of SATEand UATE), and consider the variances of ψ̂(w̃k). First,we use potential outcomes notation and write ψ̂(w̃k) =∑m

k=1 w̃k{ZkDk(1) + (1 − Zk)Dk(0)}/n. Then, ourvariance estimator is

σ̂ (w̃k)

≡ m

(m − 1)n2

·m∑

k=1

[w̃k

{Zk

(∑n1k

i=1 Yi1k

n1k

−∑n2k

i=1 Yi2k

n2k

)+ (1 − Zk)(6)

·(∑n2k

i=1 Yi2k

n2k

−∑n1k

i=1 Yi1k

n1k

)}

− nψ̂(w̃k)

m

]2

.

SATE. We first consider the variance of ψ̂(w̃k) forSATE. Taking the expectation of ψ̂(w̃k) with respectto Zk , the true variance of ψ̂(w̃k) is given by

Vara(ψ̂(w̃k)) = 1

4n2

m∑k=1

w̃2k

(Dk(1) − Dk(0)

)2.(7)

This variance is not identified since we do not jointlyobserve Dk(1) and Dk(0) for each k. Thus, we identifyan upper bound of this variance, making no additionalassumptions, and estimate it from the observed data.

The next proposition establishes that the true vari-ance, Vara(ψ̂(w̃k)), is not identifiable, and shows thatour proposed variance estimator, σ̂ (w̃k), is conserva-tive.

PROPOSITION 1 (SATE variance identification).Suppose that SATE is the estimand. Then, the true vari-ance of ψ̂(w̃k) is not identifiable. The bias of σ̂ (w̃k) isgiven by

Ea(σ̂ (w̃k)) − Vara(ψ̂(w̃k))

= m

4n2 var{w̃k

(Dk(1) + Dk(0)

)},

where var(·) represents the sample variance with de-nominator m − 1.

See Appendix A.1 for a proof. This proposition im-plies that on average σ̂ (w̃k) overestimates the truevariance Vara(ψ̂(w̃k)) unless the sample variance ofweighted within-cluster SATEs across pairs is zero.For example, if SATE is constant across pairs, andthe cluster sizes are equal, σ̂ (w̃k) estimates the truevariance without bias. However, such a scenario ishighly unlikely under MPCR, and thus σ̂ (w̃k) shouldbe seen as a conservative estimator of the variance. Itis also possible to obtain a less conservative varianceestimate than σ̂ (w̃k). For example, researchers mayuse a consistent estimator of {(∑m

k=1 w̃2kDk(1)2)1/2 +

(∑m

k=1 w̃2kDk(1)2)−1/2}2/4n2, which is obtained by

applying the Cauchy–Schwarz inequality to equa-tion (7). Another approach to obtain a tighter boundwould be to apply the covariance inequality to the biasexpression given in Proposition 1.

Page 9: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 37

CATE. Next, we study variance for CATE, ψ̂(w̃k),which we write as

Varau(ψ̂(w̃k))

= Eu{Vara(ψ̂(w̃k))}+ Varu{Ea(ψ̂(w̃k))}(8)

= 1

4n2

m∑k=1

w̃2k

{Eu

(Dk(1) − Dk(0)

)2

+ Varu(Dk(1) + Dk(0)

)},

where the second equality holds because sampling ofunits is independent within clusters. Similar to theSATE variance, this is not identified since we do notjointly observe Dk(1) and Dk(0) for each k. The nextproposition shows that σ̂ (w̃k) is again conservative.

PROPOSITION 2 (CATE variance identification).Suppose that CATE is the estimand. The true varianceof ψ̂(w̃k), Varau(ψ̂(w̃k)), is not identifiable. The biasof σ̂ (w̃k) is given by

Ea(σ̂ (w̃k)) − Vara(ψ̂(w̃k))

= m

4n2 var{w̃kEu

(Dk(1) + Dk(0)

)}.

See Appendix A.2 for a proof. The proposition im-plies that our proposed variance estimator, σ̂ (w̃k), is anupper bound of the true variance. As in the case of theSATE, this upper bound can be improved. For example,rewrite the variance in equation (8) as

Varau(ψ̂(w̃k))

= 1

2n2

m∑k=1

w̃2k

[Varu(Dk(1)) + Varu(Dk(0))(9)

+ 1

2

{Eu

(Dk(1) − Dk(0)

)}2].

Then, apply the Cauchy–Schwarz inequality to thethird term in the bracket of equation (9). Alternatively,applying the covariance inequality to the bias expres-sion in Proposition 2 yields a tighter bound.

UATE and PATE. Unlike in the case of the SATEand the CATE, the variance of ψ̂ is identified and canestimated approximately without bias when UATE orPATE is the estimand. We establish this result as thefollowing proposition:

PROPOSITION 3. Conditional on w̄ = ∑mk=1 wk/

m, the variances of ψ̂(w̃k) for estimating the UATE and

PATE are given by:

Varap(ψ̂(w̃k)) = 1

mw̄2 Varp(wkDk),

Varapu(ψ̂(w̃k)) = 1

mw̄2 [Ep{w2k Varu(Dk)}

+ Varp{w̃kEu(Dk)}],respectively, where Dk = ZkDk(1) + (1 − Zk)Dk(0)

and “p” represents the expectation with respect tosimple random sampling of matched-pairs of clusters.Conditional on w̄, both variances can be estimated byσ̂ (w̃k) without bias under their corresponding sam-pling schemes.

See Appendix A.3 for a proof. The propositionshows that when estimating PATE, the variance ofψ̂(w̃k) is proportional to the sum of two elements: themean of within-cluster variances and the variance ofwithin-cluster means. If all units are included in eachcluster, then the first term will be zero because thewithin-cluster means are observed without samplinguncertainty, that is, Varu(Dk) = 0 for all k. In eithercase, however, our proposed variance estimator σ̂ (w̃k)

is unbiased, conditional on the mean weight, w̄.

Inference. Given our proposed estimators and vari-ances, we make statistical inferences by assuming thatψ̂(w̃k) is approximately unbiased. We consider threesituations:

1. Many pairs. When the number of pairs is large (re-gardless of the number of units within each clus-ter), no additional assumption is necessary due tothe central limit theorem. For PATE and UATE, thelevel α confidence intervals are given by [ψ̂(w̃k) −zα/2

√σ̂ (w̃k), ψ̂(w̃k) + zα/2

√σ̂ (w̃k)] where zα/2

represents the critical value of two-sided level α

normal test. For the SATE and CATE, the confi-dence level of this interval will be greater than orequal to α.

2. Few pairs, many units. For CATE (and PATE),the central limit theorem implies that Dk followsthe normal distribution. Since the weights are as-sumed to be fixed for CATE, w̃kDk is also nor-mally distributed. For the other three quantities,we assume w̃kDk is normally distributed. In ei-ther case, the level α confidence intervals aregiven by [ψ̂(w̃k) − tm−1,α/2

√σ̂ (w̃k), ψ̂(w̃k) +

tm−1,α/2√

σ̂ (w̃k)], where tm−1,α/2 represents thecritical value of the one-sample, two-sided level α t-test with (m− 1) degrees of freedom. For the SATEand CATE, the confidence level of this interval willbe greater than or equal to α.

Page 10: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

38 K. IMAI, G. KING AND C. NALL

3. Few pairs, few units. When little information isavailable, a distributional assumption is required forthe inferences about all four quantities. We mayassume w̃kDk follows the normal distribution asabove and construct the confidence intervals andconduct hypothesis tests based on t-distribution.

Finally, although it was once thought that the needfor, and inability to estimate, the intracluster corre-lation coefficient (ICC) was a major disadvantage ofMPCR designs (Campbell, Mollison and Grimshaw,2001; Klar and Donner, 1997; Donner, 1998), esti-mates of the ICC are in fact not needed for our esti-mators or their variances. Below, we also show that ef-ficiency analysis, power comparisons and sample sizecalculations can also be conducted without the ICC es-timation.

4.4 Performance in Practice

We now study how our estimator and the harmonicmean estimator work in practice. The results here alsomotivate a combined approach to estimation we offerin Section 4.5.

Confidence interval coverage. To construct realis-tic simulations, we begin with the observed cluster-specific mean for two out-of-pocket health expendi-tures from the SPS evaluation data (measured in pesos)and use this to set the potential outcomes’ true popu-lation for the simulation. Finally, we generate the out-come variables via independent normal draws for unitswithin clusters using a set of heterogeneous variances.Thus, the existing harmonic mean estimator’s meanand variance constancy assumptions are violated, asis common in real data, although its normality and in-dependence assumptions are maintained. (Replicationdata are available in Imai, King and Nall, 2009.)

We study the properties of the proposed and existingvariance estimators with PATE or UATE as the esti-mand. (As shown in Proposition 2, the CATE varianceis not identified and the expectation of our variance es-timator equals a upper bound.) We generate a popula-tion of clusters by bootstrapping the observed pairs ofSPS clusters along with their observed means and a setof heterogeneous variances. We then compute cover-age probabilities under both estimators where the arith-metic and harmonic mean weights are used for the pro-posed and existing estimators, respectively. We drawfrom the discrete empirical distribution, which is farfrom a Gaussian distribution, yielding a hard case forboth estimators. The left panels of Figure 1 summarize

the results. As expected due to the central limit theo-rem, both sets of our 90% confidence intervals (soliddisks) approach their corresponding nominal cover-age probabilities as the number of pairs increase. Incontrast, the confidence intervals based on the har-monic mean variance estimator (open diamonds) arebiased—too wide in the top graph and too narrow inthe bottom—and the magnitude of bias does not de-crease even as the number of pairs grows.

Standard error comparisons. We begin by comput-ing the standard error (the square root of the estimatedvariance) based on the general variance formula pro-posed in Donner (1987), Donner and Donald (1987)and Donner and Klar (1993), as well as the one basedon our approximately unbiased alternative. For compa-rability, we use the arithmetic mean weights for bothstandard error calculations. We make these computa-tions for a large number of outcome variables fromthe SPS evaluation survey conducted 10 months afterrandomization. The outcome variables include somewhich were binary (e.g., did the respondent suffercatastrophic medical expenditures? Does our bloodtest indicate that the respondent has high cholesterol?Has the respondent been diagnosed with asthma?) andothers denominated in Mexican pesos (e.g., out-of-pocket expenditures for health care, for drugs, etc.).We then divided this standard error by our alternativefor each variable. The top right graph in Figure 1 givesa smoothed histogram of these ratios (plotted on thelog scale but labeled in original units, with 1 the pointof equality). In these real data, the biased standard er-rors range from about two times too small to two timestoo large. Note that the central tendency of this his-togram has no particular meaning, as it is constructedfrom whatever questions happened to be asked on thesurvey. The key point is that in real data the deviationfrom the approximately unbiased estimator for any onesuch standard error can be large in either direction.

Bias-variance tradeoff. Using data from an expendi-ture outcome in the SPS sample, we simulate an in-stance in which the variance of the existing estimatoroutperforms our estimator. To distinguish between theharmonic and arithmetic mean, we begin by setting allwithin-pair cluster sizes equal to the size of the treat-ment cluster in the SPS evaluation. Then, keeping thetotal pair size constant, we increase the difference inwithin pair cluster size such that the added difference incluster sizes is proportional to the within-pair treatmenteffect. This leaves the average treatment effect constantwhile demonstrating differences in the two weightingschemes. The bottom right graph in Figure 1 presents

Page 11: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 39

FIG. 1. Inference Accuracy. Simulations in the left panels demonstrate how our estimator’s coverage is approximately correct and in-creasingly so for larger sample sizes, while the existing estimator can yield confidence intervals that are either too large (top left) or toosmall (bottom left). The top right panel uses real data to give the ratios of the harmonic mean standard error to our approximately unbiasedalternative on the horizontal axis (on the log scale, but labeled as ratios). The bottom right figure gives squared bias, MSE, and variancecomparisons as a function of the average cluster size ratio; a vertical line marks the observed heterogeneity in the SPS data.

the absolute difference between the two estimators inmean square error, squared bias and variance, with theobserved SPS value marked with a vertical line.

The overall picture from these results indicates thatthe arithmetic estimator would be preferred because ithas lower mean square error than the harmonic meanestimator. However, at the expense of introducing biaswhen treatment effect is both variable across pairs andcorrelated with the cluster size, the harmonic mean es-timator can have substantially lower variance. Theseresults suggest the possibility of an improved estimatorbased on the combination of both approaches, a subjectto which we now turn.

4.5 An Encompassing, Model-Based Approach

The standard harmonic mean estimator is unbiasedwhen applied to data where the between-cluster ho-mogeneity assumption holds. In this situation, the har-monic mean weights also have the attractive propertyof downweighting observations with worse matchesand larger variances, thereby reducing variance. If thehomogeneity assumption is violated, however, then onecannot afford to downweight pairs, no matter how

badly matched or imprecisely measured, because doingso could result in arbitrarily large biases. In contrast,the arithmetic mean estimator avoids bias by makingno assumptions about the nature of how treatment ef-fects vary over the pairs. However, a consequence of itimposing no structure on treatment effect heterogene-ity is that mismatched pairs are not downweighted andso some inefficiency may result if in fact the treatmenteffects are similar across pairs.

We now combine the insights of these two ap-proaches and propose a single encompassing modelthat provides some of the advantages of each, at thecost of somewhat more stringent assumptions thanwith our design-based approach. Consider data withm∗ groups of clusters, where the homogeneity assump-tion holds within each group. Assume that the clusterswithin any one pair are never split between groups.Let g(k) = l denote the group to which pair k belongs,l = 1,2, . . . ,m∗ with m∗ ≤ m. Then, make the model-

ing assumption that Yijk(t)i.i.d.∼ N (μ

(g(k))t , σ̃ (g(k))) for

t = 0,1 where μ(l)t and σ̃ (l) are not necessarily equal

to μ(l′)t and σ̃ (l′) for l �= l′. Under this model, CATE

Page 12: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

40 K. IMAI, G. KING AND C. NALL

equals,

ψC = 1

N

m∑k=1

(N1k + N2k)(μ

(g(k))1 − μ

(g(k))0

).(10)

When the group membership is known ex ante, anunbiased and efficient estimator of CATE is given byreplacing μ

(l)k ≡ μ

(l)1 −μ

(l)0 with its harmonic mean es-

timate

μ̂(l)k =

m∑k=1

1{g(k) = l}wkDk

/[m∑

k′=1

1{g(k′) = l}wk′

],

where wk = n1kn2k/(n1k + n2k) and 1{·} representsthe indicator function. Thus, this mixture model esti-mator is an arithmetic mean of within (homogeneous)group harmonic mean estimators. A special case is theharmonic mean estimator in the literature, where thehomogeneity assumption is made across all clusters,that is, m∗ = 1. When every pair belongs to a differ-ent group, that is, m∗ = m, this estimator approximatesour proposed design-based estimator. In many applica-tions, the group membership as well as the number ofgroups may be unknown. In this case, CATE may beestimated via standard methods for fitting finite mix-ture models (e.g., McLaughlan and Peel, 2000).

4.6 Cluster-Level Quantities of Interest

The eight quantities of interest defined in Sec-tion 3.3—SATE, CATE, PATE and UATE, both withand without interference—are all defined as aggrega-tions of unit-level causal effects. For some purposes,however, analogous quantities of interest can be de-fined at the cluster level. For example, quantities ofinterest in the SPS evaluation include the health clinic-level variables. Some of these effects, such as the sup-ply of drugs and doctors, are defined and measured atthe health clinic, and so are effectively unit-level vari-ables amenable to cluster-level analyses.

However, for other variables, individual-level sur-vey responses are required to measure the aggregatevariables. Examples include the success health clinicsin our experiment have in protecting privacy, reducewaiting times, etc. If these latter variables are used tojudge the causal effect of SPS on the clinics, we have aCR experiment, but a quantity of interest at the clusterlevel. In this situation, our estimator is a special caseof equation (3), with a constant weight, ψ̂(1). Sim-ilarly, the variance of this estimator is a special caseof our general formulation in equation (6), σ̂ (1). Thisestimator for aggregate quantities is unbiased and in-variant for all quantities of interest. In the case of unit-level variables amenable to cluster-level analysis (such

as collected via survey), there will likely be samplingerror and so may result in a larger variance.

5. COMPARING MATCHED-PAIR AND OTHERDESIGNS

We now study the relative efficiency and powerof the MPCR and unmatched cluster randomization(UMCR) designs, and give sample size calculations forMPCR. We also briefly compare MPCR with the strat-ified design and discuss the consequences of loss ofclusters under each.

5.1 Unmatched Cluster Randomized Design

The UMCR design is defined as follows. Consider arandom sample of 2m clusters from a population. Weobserve a total of nj units within the j th cluster in thesample, and use n to denote the total number of unitsin the sample, n = ∑2m

j=1 nj . Under this design, m ran-domly selected clusters are assigned to the treatmentgroup with equal probability while the remaining m

clusters are assigned to the control group.We construct an estimator analogous to that pro-

posed for the UMCR as

τ̂ (w̃j ) ≡ 2

n

2m∑j=1

nj∑i=1

w̃j

nj

{ZjYij − (1 − Zj)Yij }(11)

= 2

n

2m∑j=1

nj∑i=1

w̃j

nj

{ZjYij (1) − (1 − Zj)Yij (0)},

where Zj is the randomized binary treatment variable,Yij (t) is the potential outcome for the ith unit in the j thcluster under the treatment value t for t = 0,1, and w̃j

is the known normalized weight with∑2m

j=1 w̃j = n.For SATE and UATE, we use w̃j = nj . For CATE andPATE, we use w̃j ∝ Nj where Nj is the populationsize of the j th cluster. Analysis similar to the one inSection 4.2 shows that this estimator is unbiased for allfour quantities in UMCR experiments.

The commonly used estimator in the literature forthis design takes a form slightly different from equa-tion (11): κ̂ ≡ ∑2m

j=1 Zj

∑nj

i=1 Yij /∑2m

j=1 Zjnj +∑2mj=1(1 − Zj)

∑nj

i=1 Yij /(n − ∑2mj=1 Zjnj ). This es-

timator is applicable to SATE and UATE but notCATE and PATE because it ignores cluster populationweights. The estimator is also biased for SATE andUATE, and the magnitude of bias can be derived usingthe Taylor series. Without modeling assumptions, theexact variance calculation is difficult within the design-based framework because the denominator as well as

Page 13: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 41

the numerator is a function of the randomized treat-ment variable. In addition, the usual approximate vari-ance calculations for such a ratio estimator yield eitherthe same variance as τ̂ (nj ) or the variance estimatorthat is not invariant to a constant shift. Thus, for thesake of simplicity, we focus on τ̂ (w̃j ) in this sectionalthough κ̂ and its approximate variance estimator mayperform reasonably well in practice.

For the rest of this section, we assume that the esti-mand is UATE. However, the same calculations applywhen the estimand is PATE since the variance estima-tor is the same for both. For SATE and CATE, we caninterpret these results as conservative estimates of effi-ciency, power and sample sizes.

5.2 Efficiency

When the estimand is UATE, the variance of τ̂ (w̃j )

is approximately (conditional on n = ∑2mj=1 w̃j ) equal

to Varac(τ̂ (w̃j )) = 4mn2 {Varc(w̃jYj (1)) + Varc(w̃j ·

Yj (0))}, where Yj (t) ≡ ∑nj

i=1 Yij (t)/nj for t = 0,1,and the subscript “c” represents the simple randomsampling of clusters. To facilitate comparison, as-sume that under MPCR one is able to match on clus-ter sizes so that n1k = n2k for all k. Proposition 3implies that under the same condition the varianceof ψ̂(w̃k) can be approximated by Varap(ψ̂(w̃k)) =mVarp{w̃k(Yjk(1) − Yj ′k(0))}/n2, where Yjk(t) ≡∑njk

i=1 Yijk(t)/njk and j �= j ′. Since the assumptionof n1k = n2k means w̃jk = 2w̃j , we haveVarp(w̃kYjk(t)) = 4 Varc(w̃jYj (t)) for t = 0,1. Thus,the relative efficiency of MPCR over UMCR is

Varac(τ̂ (w̃j ))

Varap(ψ̂(w̃k))

≈{

1 − 2 Covp(w̃kYjk(1), w̃kYj ′k(0))∑1t=0 Varp(w̃kYjk(t))

}−1

.

This implies that the relative efficiency of MPCR de-pends on the correlation of the observed within-paircluster mean outcomes weighted by cluster sizes. Ifmatching induces a positive correlation, as is its pur-pose and will normally occur in practice, then MPCRis more efficient. (In the worst case scenario wherematching is implemented in a manner opposite to theway it was designed, and thus induces a negative cor-relation, MPCR can be less efficient.) Under MPCR,we can estimate Covp(w̃kYjk(1), w̃kYj ′k(0)) withoutbias using the sample covariance between w̃kYjk(1)

and w̃kYj ′k(0), which are jointly observed for each k.And thus, under MPCR, the variance one would obtainunder UMCR can also be estimated without bias (see

also Imai, 2008). This is another advantage of MPCRsince the converse is not true. (If cluster sizes are equal,one can also estimate the ICC nonparametrically andseparately for the treated and control groups—there isno reason to assume the ICC is the same for two po-tential outcomes as done in the literature. Note that theICC is not required for efficiency, power or sample sizecalculations.)

Empirical evidence. Although the MPCR designhave other advantages in public policy evaluations(King et al., 2007), their advantage in statistical effi-ciency can be considerable. We estimate the efficiencyof MPCR as used in the SPS evaluation over the effi-ciency that our experiment would have achieved, if wehad used complete randomization without matching.Figure 2 plots the relative efficiency of our estimatorfor MPCR over UMCR for UATE and for PATE. We dothis for our 14 outcome variables denominated in pe-sos. For UATE, the estimator based on the MPCR is be-tween 1.13 and 2.92 times more efficient, which meansthat our standard errors would have been as much as√

2.92 = 1.7 times larger if we had neglected to pairclusters first. The result is even more dramatic for esti-mating PATE, for which the MPCR design for differentvariables is between 1.8 and 38.3 times more efficient.In this situation, our standard errors would have been asmuch as six times larger if we had neglected to matchfirst.

5.3 Power

We now use the variance results in Section 4.3 tocalculate statistical power, that is, the probability of

FIG. 2. Relative efficiency of matched-pair over unmatched clus-ter randomized designs in the SPS evaluation.

Page 14: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

42 K. IMAI, G. KING AND C. NALL

rejecting the null if it is indeed false, for UATE andPATE, which also represent the minimum power forSATE and CATE, respectively.

5.3.1 Power calculations under the matched-pairdesign. We begin with power calculation for UATEgiven a null hypothesis of H0 :ψU = 0, the alterna-tive hypothesis of HA :ψU = ψ , and the level α t-test.In this setting, Proposition 3 implies the power func-

tion, 1 + Tm−1(−tm−1,α/2 | nψ/√

mVarp{w̃kDk) −Tm−1(tm−1,α/2 | nψ/

√mVarp{w̃kDk}), where Tm−1(· |

ζ ) is the distribution function of the noncentral t dis-tribution with (m − 1) degrees of freedom and thenoncentrality parameter ζ , and w̃k = n1k + n2k . ForUATE, we sample cluster pairs but not units withineach cluster. Thus, a simpler expression for the powerfunction results if we assume equal cluster sizes. Inthat case, a researcher may reparameterize the powerfunction by normalizing ψ in terms of the standard de-viation of within-pair mean differences, that is, dU ≡ψ/

√Var(Dk). Then, we write the power function as

1 + Tm−1(−tm−1,α/2 | dU

√m

)(12)

− Tm−1(tm−1,α/2 | dU

√m

).

Next, for PATE, we sample units within each clusteras well as pairs of clusters. The null hypothesis is givenby H0 :ψP = 0 and the alternative is Ha :ψP = ψ .Again, for simplicity, we assume n̄ = njk = n/(2m)

for all j and k. Then, Proposition 3 implies the powerfunction is of the same form as equation (12) exceptthat the noncentrality parameter is given by

ψ√

m/√√√√√ 2∑

j=1

Ep{w̃2k Varu(Yijk)}

n̄+ Varp(w̃kEu(Dk)),

where w̃k ∝ N1k + N2k . Similar to UATE, if popula-tion clusters sizes are equal, we obtain a simpler powerfunction

1 + Tm−1

(−tm−1,α/2

∣∣∣ dP

√m√

1 + π/n̄

)(13)

− Tm−1

(tm−1,α/2

∣∣∣ dP

√m√

1 + π/n̄

),

where, for UATE, ψ is normalized by the standarddeviation of the within-pair mean difference, dP ≡ψ/

√Varp{Eu(Dk)}, and π is the ratio of the mean

variances of the potential outcomes and the varianceof within-pair differences-in-means by the mean vari-ances of the potential outcomes, π ≡ ∑2

j=1 Ep{Varu ·(Yijk)}/Varp(Eu(Dk)).

5.3.2 Sample size calculations. We use the aboveresults to estimate the sample size required to achievea given precision in a future experiment under MPCR.Suppose an investigator wishes to specify the desireddegree of precision in terms of Type I and Type II errorrates in hypothesis testing, denoted by α and β , respec-tively. In particular, the goal is to calculate the sam-ple size required to achieve a given degree of power,1 − β , against a particular alternative (Snedecor andCochran, 1989, Section 6.14), using the power func-tions just derived. For example, for UATE under equalcluster sizes, and using equation (12), the desired num-ber of cluster pairs is the smallest value of m suchthat 1 + Tm−1(−tm−1,α/2 | dU

√m) − Tm−1(tm−1,α/2 |

dU

√m) ≥ 1 − β where dU ≡ ψ/

√Var(Dk), α, and β

are specified by the researcher. Similarly, for PATE,equation (13) is used to determine the number of pairsand units within each cluster.

Empirical evidence. To illustrate, we use SPS eval-uation data on the annualized out-of-pocket health careexpenditure that a household spent in the most recentmonth. Using estimates of π and Varp{Eu(Dk)} fromthe SPS data and equation (13), we calculate the mini-mal absolute effect size for PATE that can be detectedusing a two-sided t-test with size 0.95 and power 0.8,for any given cluster size and number of cluster pairs.Since the household is the unit of interest in this ex-ample, our population count involves the number ofhouseholds per cluster, instead of the number of indi-viduals.

In the left panel of Figure 3, horizontal axis is thenumber of pairs and the vertical axis indicates the num-ber of units within each cluster. The contour lines rep-resent the minimum detectable size in pesos. The graphshows that MPCR with 30 pairs and 100 units withineach cluster can detect the true absolute effect size ofapproximately 450 pesos with the given precision. Thefigure displays the obvious result that experiments withmore pairs or clusters, can detect smaller sized effects(contour lines are labeled with smaller numbers as wemove to the top right of the figure). More importantly,the nearly vertical contour lines (above 50 or so unitswithin each cluster) indicates that adding more pairs ofclusters adds more statistical power than adding moreunits within each pair. However, adding one more pairmeans that many more units will be added, and in somesituations sampling units within new clusters is moreexpensive than within existing clusters. As such, theexact tradeoff depends on the specifics of each appli-cation, and it would be incorrect to conclude that moreclusters always dominates more units. (We discuss theright panel of the figure next.)

Page 15: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 43

FIG. 3. Sample size calculations for PATE under MPCR. The left panel plots the smallest detectable absolute effect size of SPS on annual-ized out-of-pocket expenditures (in pesos) using a 0.05 level two-sided test with power 0.95, with π estimated from SPS data. The horizontaland vertical axes plot m and n̄, respectively. The right panel compares correlations with and without population weights between treatmentand control group cluster-specific means in SPS data. All but one variable has higher correlation when incorporating weights, as seen by adot below the 45◦ line. The graph also presents “break-even” correlations (indicated by dashed and dotted lines with and without weights,respectively), which are the smallest possible correlations matching must induce in order for MPCR to detect smaller effect size than theUMCR, given fixed power (0.8) and size (0.95). The graph suggests, when weights are appropriately taken into consideration, that MPCRshould be preferred (for all but possibly one variable) even when the number of pairs is as small as three.

5.3.3 Power comparison. Although MPCR is typ-ically more efficient than UMCR regardless of sam-ple size, Martin et al. [(1993), page 330] point outthat when the number of pairs is small (fewer thanabout 10), “the matched design will probably have lesspower than the unmatched design” due to the loss ofdegrees of freedom. Here, we show that this conclu-sion critically hinges on Martin et al.’s assumption ofequal cluster population sizes as well as their partic-ular assumed parametric model relating the matchingand outcome variables. Modeling assumptions are al-ways worrisome, but the equal cluster size assumptionis especially problematic because varying cluster sizesis in fact a fundamental feature of numerous CR exper-iments.

When cluster sizes are unequal, the efficiency gainof matching in CR trials depends on the correlations ofweighted cluster means between the treatment and con-trol clusters across pairs (with weights based on sam-ple or population cluster sizes depending on the quan-tity of interest), not the unweighted correlations usedin Martin et al.’s calculations. Since population clustersizes are typically observed prior to the treatment ran-domization, researchers can incorporate this variableinto their matching procedure. As a result, correlationsof weighted outcomes (constructed from clusters withmatched weights) will usually be substantially higher

than those of unweighted outcomes; this is true evenwhen cluster sizes are independent of outcomes. Thus,in CR trials with unequal cluster sizes, the efficiencygain due to pre-randomization matching is likely to beconsiderably greater than the equal cluster size caseconsidered by Martin et al. (1993). Any power compar-ison must take this factor into consideration, and alongwith the bias reduction, this is another reason to incor-porate cluster sizes into one’s matching procedure.

Empirical evidence. The right panel of Figure 3 il-lustrates the argument above using the SPS evalua-tion data, by calculating the across-pair correlations be-tween treatment and control cluster means of 67 out-come variables (ranging from health related variablesto household health care expenditure variables), bothwith and without weights. We use population clustersizes as weights, which were observed prior to therandomization of the treatment and incorporated intothe matching procedure used (King et al., 2007). Thegraph shows that all but one variable has consider-ably higher correlations when weights are incorporated(which does not make the equal cluster size assump-tion) than when they are ignored (which assumes con-stant cluster size); this can be seen by all but one of thedots falling below the solid 45◦ line. In fact, the me-dian of the correlations is more than three times largerwith (0.68) than without (0.20) weights. In their con-clusion, Martin et al. [(1993), page 336] recommend

Page 16: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

44 K. IMAI, G. KING AND C. NALL

that if the number of pairs is 10 or fewer, then match-ing should be used only if researchers are confident thatthe correlation due to matching is at least 0.2. Indeed,all variables in SPS meet this criteria if the weights areappropriately taken into account, the minimum corre-lation with weights being 0.22. (If the correlations arecalculated incorrectly as they did without weights, thenonly about half of the variables meet their criteria.)

To illustrate the above result in terms of powerand sample size calculations, the graph also presentsthe “break-even” matching correlations (indicated bydashed and dotted lines for correlations with and with-out weights, respectively) that are used by Martin etal. (1993), Section 7. As in the original article, we setthe power and size of the test to be 0.8 and 0.95, re-spectively, and derive the smallest correlation matchingmust induce in order for the matched-pair design to beable to detect smaller effect sizes than the UMCR de-sign. The result indicates that even with as few as threepairs, more than 85% of the variables had a correlationhigher than the break-even point, which is 0.56. Withfive pairs, all but one variable exceeds the threshold.

In contrast, if one ignores the weights, by incorrectlyassuming that the clusters are equally sized, as in Mar-tin et al., then only 4% and 34% of the variables havethe correlations higher than the break-even correlationsof three and five pairs, respectively. Martin et al. (1993)described the correlation of 0.25 as “difficult to achieveby matching” (page 335). However, as the data fromSPS evaluation show, since one can match on clustersizes, the level of weighted correlations is much higherwhen cluster sizes are different.

For another example, Donner and Klar [(2000b),page 37] give the unweighted correlations from sevendifferent studies, only one of which is negative (0.49,0.41, 0.13, 0.63, −0.32, 0.94 and 0.21). The correctweighted correlations are not reported, but in all caseswould be higher, and in all likelihood all seven wouldbe positive.

Thus, by dropping the assumption that all clustersare equally sized we have shown here that, for practi-cal purposes, the matched pair design may well havemore statistical power than the UMCR design, evenfor small samples. Of course, if one has fewer thanthree matched pairs, it’s probably time to stop worryingabout the properties of statistical estimators and head tothe field to gather more data.

5.4 Lost Clusters, Stratified Designs and CausalHeterogeneity

We now clarify four additional issues about MPCRthat have arisen. First, some recommend a stratified de-

sign, where units are matched in blocks of larger thantwo. However, a stratified design is merely a UMCRdesign operating within each stratum. If all units withina stratum have identical values on important back-ground covariates, then it is effectively equivalent toMPCR. But if any heterogeneity on these covariatesor cluster sizes remain within strata, then the stratifieddesign may leave some efficiency on the table. Thus,when feasible, switching from a stratified to an MPCRdesign has the potential to greatly increase efficiencyand power.

Second, Donner and Klar [(2000a), page 40] explainthat a “disadvantage of the matched-pair design is thatthe loss to follow-up of a single cluster in a pair impliesthat both clusters in that pair must effectively be dis-carded from the trial, at least with respect to testing theeffect of the intervention. This problem. . . clearly doesnot arise if there is some replication of clusters withineach combination of intervention and stratum.” Indeed,the loss of a single cluster from a stratum with morethan two clusters may make it possible to estimate thecausal effect within that stratum, but the missingnessprocess must be ascertained or assumed and some typeof imputation strategy (or other procedure; e.g., Wei,1982) must be used, risking model dependence. Theseare issues for both MPCR and stratified designs. Al-ternatively, if a cluster is lost in an MPCR study, thendropping the other member of the pair makes it pos-sible to retain the benefits of randomization for SATEor CATE defined in the remaining pairs—without los-ing other observations, without imputation and possi-ble model dependence, and regardless of the missingdata mechanism (King et al., 2007). In contrast, theloss of a cluster in a UMCR design turns an experi-mental study into an observational study requiring theaddition of ignorability assumptions which experimen-talists normally try to avoid. The loss of a single clusterwithin a stratum larger than two units means that morethan one cluster will need to be dropped in order to re-tain the benefits of randomization, which may lead tounnecessary efficiency losses.

Third is the claim that MPCR restricts “predictionmodels to cluster-level baseline risk factors (for exam-ple, cluster size)” (Donner and Klar, 2004). This sen-tence has been widely interpreted to mean that pre-diction models under MPCR cannot include baselinerisk factors, but Donner and Klar clearly intended itto indicate (and confirmed to us that they meant) themore straightforward point that cluster-level fixed ef-fects cannot be included in regression models underMPCR. Of course, results can be analyzed within stratadefined by any individual or cluster level variable, so

Page 17: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 45

long as it is pre-treatment. For example, the bottom tworows of Table 3 repeat the same analysis as the top tworows but only for male-headed households, a variablemeasured only at the unit-level and used to separatethe sample at that level. (The results for each quantityof interest in this case appear only slightly larger thanfor the entire sample.) Regression models with fixedeffects for clusters are unidentified under MPCR, al-though substituting in random effects is unproblematic,at least for identification purposes.

Finally, Donner and Klar (2004) explain that MPCRis to be faulted because of its “inability to test for ho-mogeneity” of causal effects within a pair. And hy-pothesis tests cannot be conducted for the differencebetween two pairs. However, the causal effect is easyto measure without bias or model dependence underMPCR (but not under UMCR) at the pair level with-out bias merely by taking the difference in means be-tween the two clusters. This may be a noisy estimateif matching quality is poor, but it serves as a usefulunbiased dependent variable for subsequent analyses.We can see how it varies as a function of any vari-able measured at the unit level and then aggregated tothe cluster-pair level, or measured directly at the aggre-gate level from existing data, such as from census data.Even hypothesis tests are possible if we pool pairs. Forexample, since the point of SPS was to help poor fam-ilies, we could examine whether the causal effect ofrolling out SPS on various outcome variables increasesas the wealth of an area drops. This can be done by asimple plot of the pair-level causal effect by wealth, orfitting a regression model.

6. METHODS FOR UNIT-LEVEL NONCOMPLIANCE

CR trials typically have imperfect treatment compli-ance at the unit level. Some individuals in treatmentclusters refuse treatment while others in the controlcluster receive the treatment. Since most CR social ex-periments, including the SPS evaluation, allow non-compliance, analyses, in addition to ITT estimates,may account for noncompliance and estimate the effectof the program only for individuals who would adhereto the experimental protocol. Thus, we now extend ourapproach to CR trials under the MPCR encouragementdesign, where the encouragement to receive a treat-ment, rather than the receipt of the treatment itself, israndomized at the cluster-level.

Angrist, Imbens and Rubin (1996) show how an in-strumental variable method can be used to analyzeunit-randomized experiments with noncompliance un-der individually randomized designs. We extend their

approach to MPCR experiments with unit-level non-compliance. To complement the parametric Bayesianapproach to this problem (under the unmatched clus-ter randomized design) by Frangakis, Rubin and Zhou(2002), we consider a design-based analysis applyingthe approach introduced in Section 4.

6.1 Causal Quantities of Interest

We consider the two types of causal quantities ofinterest under MPCR encouragement designs—theintention-to-treat (ITT) effect and the complier aver-age causal effect (CACE) (Angrist, Imbens and Rubin,1996). The ITT effect is the average causal effect of en-couragement (rather than treatment) and is equivalentto the various versions of the average treatment effectin Section 3.3 (i.e., SATE, CATE, UATE and PATE,with or without interference).

In contrast, the CACE estimand is the average treat-ment effect (for SATE, CATE, UATE or PATE, with orwithout interference) among compliers only. Compli-ers are neither those merely observed to affiliate amongthose in encouragement clusters nor those observed notto affiliate in clusters not encouraged since the formerincludes always-takers and the latter includes never-takers. Note that always-takers (never-takers) are thosewho always (never) take the treatment regardless ofwhether or not they are encouraged. In addition, thesegroups are defined as a consequence of the treatment.Compliers are those who would affiliate only if theywere encouraged and would not affiliate only if theywere not encouraged, and so this group is defined priorto the encouragement but its members are not com-pletely observed. We propose a method that can beused to estimate CACE.

6.2 Design and Notation

The setup is the same as Section 3.2 except thatTjk represents whether the units in the j th cluster inthe kth pair are encouraged to receive the treatmentrather than whether it received the treatment. Recallthat T1k = Zk and T2k = 1 − Zk . Now, let Rijk(Tjk)

be the potential treatment receipt indicator variablesfor the ith unit in the j th cluster of the kth pair un-der the encouragement (Tjk = 1) and control (Tjk = 0)conditions. The observed treatment variable is, then,Rijk ≡ TjkRijk(1) + (1 − Tjk)Rijk(0). Similar to thepotential outcomes, these potential treatment variablesdepend on cluster-level encouragement variable ratherthan the unit-level encouragement variable, requiringa different interpretation of the resulting causal ef-fects. Finally, we write the potential outcomes as func-tions of (cluster-level) randomized encouragement and

Page 18: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

46 K. IMAI, G. KING AND C. NALL

actual receipt of treatment (at the unit-level),that is,Yijk(Rijk, Tjk). This formulation makes the followingassumption, which an extension of Assumption 1:

ASSUMPTION 3 (No interference between units).Let Rijk(T) be the potential outcomes for the ith unitin the j th cluster of the kth matched-pair where T isan (m × 2) matrix whose (j, k) element is Tjk . Fur-thermore, let Yijk(R,T) be the potential outcomes forthe ith unit in the j th cluster of the kth matched-pairwhere R is an (njk × m × 2) ragged array whose(i, j, k) element is Rijk . Then:

1. If Tjk = T ′jk , then Rijk(T) = Rijk(T′).

2. If Tjk = T ′jk and Rijk = R′

ijk , then Yijk(R,T) =Yijk(R′,T′).

In other words, this assumption requires that one per-son’s decision to affiliate has no effect on any otherperson’s outcomes within the same cluster; as such,the requirements are more demanding than for the ITTeffects above. This assumption might be violated forcertain health outcomes in the SPS evaluation: if allof one’s neighbors affiliate with SPS, the health carethey receive may reduce the prevalence of infectiousdiseases and so might thereby improve that person’shealth outcomes (an example of “herd immunity”).Relaxing this assumption thus remains an importantmethodological issue that seems worthy of future re-search.

The no interference assumption allows us to writeRijk(T) = Rijk(Tjk) and Yijk(R,T) = Yijk(Rijk, Tjk).Since T1k = Zk and T2k = 1 − Zk , both Rijk(Tjk) andYijk(Tjk) depend on Zk alone.

Extending the framework of Angrist, Imbens andRubin (1996) to CR trials, we make an exclusion re-striction so that cluster-level encouragement affects theunit-level outcome only through the unit-level receiptof the treatment:

ASSUMPTION 4 (Exclusion restriction). Yijk(r,

0) = Yijk(r,1) for r = 0,1 and all i, j , and k.

These assumptions together simplify the problem byenabling us to write the potential outcomes as func-tions of Tjk (or Zk) alone, that is, Yijk(Rjk, Tjk) =Yijk(Tjk).

Finally, following Angrist, Imbens and Rubin(1996), we call the units with Rijk(Tjk) = Tjk com-pliers (and denote them by Cijk = c), those withRijk(Tjk) = 1 always-takers (Cijk = a), those withRijk(Tjk) = 0 never-takers (Cijk = n), and the unitswith Rijk(Tjk) = 1 − Tjk defiers (Cijk = d). The

monotonicity assumption excludes the existence of de-fiers.

ASSUMPTION 5 (Monotonicity). There exists nodefier. That is, Rijk(1) ≥ Rijk(0) holds for all i, j, k.

In our Mexico evaluation, never-takers are those whowould not affiliate with SPS regardless of whether thegovernment encourages them to do so or not. SinceSPS was designed for the poor, many wealthy citizenswith their own preexisting health care arrangementsmay be never-takers. We expected a substantial propor-tion of the population to qualify as never-takers, andin fact estimate them at 56%. Always-takers are thosewho would affiliate with SPS regardless of assignment.These are more uncommon, and would likely be thepoor without access to health care who neverthelesshave the information and financial resources neces-sary to travel to the place to sign up for SPS and totravel back to receive care. (The estimated proportionof always-takers is only 7%.) The last type is defiers, orpeople who would affiliate with SPS if not encouragedto do so but would not affiliate if encouraged. Assum-ing the absence of defiers seems reasonable.

6.3 Estimation

If we assume sampling of both pairs of clusters andunits within each cluster, then the ITT causal effect canbe defined as ψP . Thus, ψ̂(N1k + N2k) can be usedto estimate this ITT effect, and the approximately un-biased estimation of its variance is possible using theresults given in Section 4.3.

Next, we consider population CACE. Under the as-sumption of simple random sampling of both clus-ters and units within each cluster, this estimand is de-fined as γ ≡ EP (Y (1) − Y(0)|C = c) = EP (Y (1) −Y(0))/EP (R(1) − R(0)), where the equality followsfrom the direct application Angrist, Imbens and Rubin(1996) to CR trials under the assumptions stated above.If we only assume simple random sampling of clustersas in UATE, then the expectation in γ is taken withrespect to the set U rather than P .

Thus, the instrumental variable estimator based onthe general weighted estimator in equation (3) isγ̂ (wk) ≡ ψ̂(wk)/τ̂ (wk), where τ̂ (wk) is the estimatorof the ITT effect on the receipt of the treatment

τ̂ (wk) ≡ 1∑mk=1 wk

·m∑

k=1

wk

{Zk

(∑n1k

i=1 Ri1k

n1k

−∑n2k

i=1 Ri2k

n2k

)

Page 19: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 47

+ (1 − Zk)

·(∑n2k

i=1 Ri2k

n2k

−∑n1k

i=1 Ri1k

n1k

)}.

When matching is effective or when cluster sizes areequal within each matched-pair, this estimator is con-sistent and approximately unbiased. Using a Taylor se-ries expansion, the variance of this estimator can beapproximated by

Varapu(γ̂ (wk))

≈ 1

{Eapu(τ̂ (wk))}4

· [{Eapu(τ̂ (wk))}2 Varapu(ψ̂(wk))(14)

+ {Eapu(ψ̂(wk))}2 Varapu(τ̂ (wk))

− 2Eapu(ψ̂(wk))Eapu(τ̂ (wk))

· Covapu(ψ̂(wk), τ̂ (wk))],where if simple random sampling of pairs of clustersalone is assumed, then the subscript “apu” (for assign-ment, pairs, and units) is replaced with “ap.” Further-more, the argument given in Section 4.3 implies, forexample, that the variance of γ̂ (w̃k) for estimating thesample CACE is on average less than the variance forthe population CACE given in equation (14).

Finally, Proposition 3 shows how to estimateVarapu(ψ̂(wk)), Varapu(τ̂ (wk)) (or Varap(ψ̂(wk)) andVarap(τ̂ (wk))) approximately without bias. Thus, weonly need an estimate of the covariance between ofψ̂(wk) and τ̂ (wk) from the observed data. Using thenormalized weights w̃k , Appendix A.5 proves that thefollowing estimator is approximately unbiased for bothCovapu(ψ̂(wk), τ̂ (wk)) and Covpu(ψ̂(wk), τ̂ (wk)) un-der their respective sampling assumptions:

ν̂(w̃k)

≡ m

(m − 1)n2

·m∑

k=1

[w̃k

{Zk

(∑n1k

i=1 Yi1k

n1k

−∑n2k

i=1 Yi2k

n2k

)+ (1 − Zk)

·(∑n2k

i=1 Yi2k

n2k

−∑n1k

i=1 Yi1k

n1k

)}

− nψ̂(w̃k)

m

]

·[w̃k

{Zk

(∑n1k

i=1 Ri1k

n1k

−∑n2k

i=1 Ri2k

n2k

)

+ (1 − Zk)

(∑n2k

i=1 Ri2k

n2k

−∑n1k

i=1 Ri1k

n1k

)}

− nτ̂ (w̃k)

m

].

7. SEGURO POPULAR EVALUATION

We now estimate the causal effect of SPS on theprobability of a household suffering catastrophic healthexpenditures (out-of-pocket health care expenditurestotaling more than 30% of a household’s annual post-subsistence or disposable income). As nearly 10% ofhouseholds suffer catastrophic health expenditures in ayear, it is easy to see why this would be a major pri-ority. We estimate all four target population quantitiesof interest (SATE, CATE, UATE, and PATE) both forthe intention to treat (ITT) effect of encouragement toaffiliate an the average causal effect among compliers(CACE). Although in most applications, substantiveinterest would narrow this list to one or a few of thesequantities, for our methodological purposes we presentall eight estimates (and standard errors) in Table 3.

A table like this will always have some of the samefeatures, no matter what variable is analyzed. Recall,for example, that point estimates of SATE and UATEare the same, as they are for CATE and PATE. In addi-tion, standard errors of UATE and PATE are the upperbounds of the standard errors for SATE and CATE, re-spectively. CACE estimates of course are never smallerthan those for ITT.

For the specific estimates, consider first the two toplines of Table 3 corresponding to all households. Forthese data, the CACE estimates are about 2.7 timeslarger than that for ITT. The large difference is be-cause of all those who had preexisting health care andso were largely never-takers. Overall, these results in-dicate that SPS was clearly successful in reducing themost devastating type of medical expenditures. Thedifferences among the columns indicate that the aver-age causal effect of encouragement to affiliate to SPS(the ITT effect) is somewhat larger in the population ofindividuals represented by our sample (−0.023) thanamong the individuals we directly observe (−0.014).The same is also true among compliers, but at a higherlevel (−0.038 vs. −0.064).

Substantively, these numbers are quite large. Sincethose who suffer from catastrophic health expendituresare mostly the poor without access to health insurance,they are likely to be disproportionately representedamong compliers as compared to the wealthy with pre-

Page 20: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

48 K. IMAI, G. KING AND C. NALL

TABLE 3Estimates of eight causal effect of SPS on the probability of catastrophic health expenditures for all households and male-headed

households (standard errors in parentheses)

SATE CATE UATE PATE

All ITT −0.014 (≤ 0.007) −0.023 (≤ 0.015) −0.014 (0.007) −0.023 (0.015)CACE −0.038 (≤ 0.018) −0.064 (≤ 0.024) −0.038 (0.018) −0.064 (0.024)

Male-headed ITT −0.016 (≤ 0.008) −0.025 (≤ 0.018) −0.016 (0.008) −0.025 (0.018)CACE −0.042 (≤ 0.020) −0.070 (≤ 0.031) −0.042 (0.020) −0.070 (0.031)

existing health care arrangements. As such, this analy-sis indicates that the causal effect of rolling out thepolicy reduces by about 23% the proportion of thosewho experience catastrophic expenditures (i.e., −0.023of the 10% with catastrophic expenditures). (Detailedanalyses of these and other substantive results from theSPS evaluation appear in King et al., 2009.)

8. CONCLUDING REMARKS

The methods developed here are designed for re-searchers lucky enough to be able to randomize treat-ment assignment, but stuck because of political orother constraints with having to randomize clustersof individuals rather than the individuals themselves.Field experiments in particular frequently require clus-ter randomization. Individual-level randomization wasimpossible in our evaluation of the Mexican SPS pro-gram; in fact, negotiations with the Mexican govern-ment began with the presumption that no type of ran-domization would be politically feasible, but it eventu-ally concluded by allowing cluster-level randomizationto be implemented.

When clusters of individuals are randomized ratherthan the individuals themselves, the best practiceshould involve three steps. First, researchers shouldchoose their causal quantity of interest, as definedin Section 3.3. They should then identify availablepre-treatment covariates likely to affect the outcomevariable, and, if possible, pair clusters based on thesimilarity of these covariates and cluster sizes; thisstep is severely underutilized and, when feasible, willtranslate into considerable research resources savedand numerous observations gained. Finally, researchersshould randomly choose one treated and one controlcluster within each pair. Claims in the literature aboutproblems with matched-pair cluster randomization de-signs are misguided: clusters should be paired prior torandomization when considered from the perspectiveof efficiency, power, bias or robustness.

Of course, administrative, political, ethical and otherissues will sometimes constrain the ability of re-searchers to pair clusters prior to randomization. Withthe results and new estimators offered here, the effortin the design of cluster-randomized experiments cannow shift from debates about when pairing is useful topractical discussions of how best to marshal creativearguments and procedures to ensure that clusters canmore often be paired prior to randomization.

Cornfield [(1978), pages 101–102] concludes hisnow classic study by writing that “Randomizationby cluster accompanied by an analysis appropriateto randomization by individual is an exercise in self-deception, . . . and should be avoided,” and an enor-mous literature has grown in many fields echoing thiswarning. We can now add that randomization by clusterwithout prior construction of matched pairs, when pair-ing is feasible, is an exercise in self-destruction. Failingto match can greatly reduce efficiency, power and ro-bustness, and is equivalent to discarding a large portionof experimental data or wasting grant money and inves-tigator effort. This result should affect practice, espe-cially in literatures like political science where exper-imental analyses routinely use cluster-randomizationbut examples of matched-pair designs have almostnever been used, as well as community consensusrecommendations for best practices in the conductand analysis of cluster-randomized experiments, whichclosely follow prior methodological literature. Theseinclude the extension to the “CONSORT” agreementamong the major biomedical journals (Campbell, El-bourne and Altman, 2004), the Cochrane Collabora-tion requirements for reviewing research (Higgins andGreen, 2006, Section 8.11.2), the prominent MedicalResearch Council (2002) guidelines, and the educa-tion research What Works Clearinghouse (2006). Eachwould seem to require crucial modifications in light ofthe results given here.

Page 21: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 49

APPENDIX A: MATHEMATICAL APPENDIX

A.1 Proof of Proposition 1

This proof uses a strategy similar to that of Proposi-tion 1 of Imai (2008). First, rewrite σ̂ (w̃k) as

(m − 1)n2

mσ̂(w̃k)

=m∑

k=1

[w̃k{ZkDk(1) + (1 − Zk)Dk(0)}

− 1

m

m∑k′=1

w̃k′ {Zk′Dk′(1)

+ (1 − Zk′)Dk′(0)}]2

= m − 1

m

m∑k=1

w̃2k{ZkDk(1)2 + (1 − Zk)Dk(0)2}

− 1

m

m∑k=1

∑k′ �=k

w̃kw̃k′ {ZkZk′Dk(1)Dk′(1)

+ Zk(1 − Zk′)Dk(1)Dk′(0)

+ (1 − Zk)Zk′Dk(0)Dk′(1)

+ (1 − Zk)(1 − Zk′)Dk(0)

· Dk′(0)}.Assumption 2 implies Ea(Zk) = 1/2 and Ea(ZkZk′) =1/4 for k �= k′. Thus, taking expectations over Zk andrearranging, gives

Ea(σ̂ (w̃k))

= 1

2n2

{m∑

k=1

w̃2k

(Dk(1)2 + Dk(0)2)

− 1

2(m − 1)

m∑k=1

∑k′ �=k

w̃kw̃k′(15)

· (Dk(1) + Dk(0)

)· (

Dk′(1) + Dk′(0))}

.

Finally, we compare this with the true variance expres-sion in (7): Ea(σ̂ (w̃k)) − Vara(ψ̂(w̃k)), which equals

1

4n2

{m∑

k=1

w̃2k{Dk(1) + Dk(0)}2

− 1

m − 1

m∑k=1

∑k′ �=k

w̃kw̃k′(Dk(1) + Dk(0)

)

· (Dk′(1) + Dk′(0)

)}

= m

4n2 var{w̃k

(Dk(1) + Dk(0)

)}.

This bias term is not identifiable because Dk(1) andDk(0) are not jointly observed for any k, implying thatthe variance is not identifiable either.

A.2 Proof of Proposition 2

Applying the law of iterated expectations to equa-tion (15), we have

Eau(σ̂ (w̃k))

= 1

2n2

[m∑

k=1

w̃2kEu{Dk(1)2 + Dk(0)2}

− 1

2(m − 1)(16)

·m∑

k=1

∑k′ �=k

w̃kw̃k′Eu

(Dk(1) + Dk(0)

)

· Eu

(Dk′(1) + Dk′(0)

)],

where the equality follows from the assumption thatsampling of units is independent across clusters. To-gether with the definition of Varau(ψ̂(w̃k)) givenabove, we have

Eau(σ̂ (w̃k)) − Varau(ψ̂(w̃k))

= 1

4n2

[m∑

k=1

w̃2k

{Eu

(Dk(1)2 + Dk(0)2)

− Varu(Dk(1)) − Varu(Dk(0))

+ 2Eu(Dk(1))Eu(Dk(0))}

− 1

m − 1

·m∑

k=1

∑k′ �=k

w̃kw̃k′Eu

(Dk(1) + Dk(0)

)

· Eu

(Dk′(1) + Dk′(0)

)]

= m

4n2 var{w̃kEu

(Dk(1) + Dk(0)

)}.

Since we do not observe Dk(1) and Dk(0) jointly, thisvariance is not identified.

Page 22: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

50 K. IMAI, G. KING AND C. NALL

A.3 Proof of Proposition 3

Since UATE is a special case of PATE where allunits within each cluster are observed (njk = Njk),we first derive the variance of ψ̂(w̃k) for PATE. LetD̃k(t) = wkDk(t), μ̃k(t) = Eu(D̃k(t)), and η̃k(t) =Varu(D̃k(t)) for t = 0,1. Then, randomizing the or-der of clusters within each pair implies Ec(η̃k) =Ec(η̃k(1)) = Ec(η̃k(0)) and Varc(μ̃k) = Varc(μ̃k(1)) =Varc(μ̃k(0)). Then, the variance is

Varapu(ψ̂(w̃k))

= 1

2mw̄2 Ep

[Varu(D̃k(1)) + Varu(D̃k(0))

+ 1

2

{Eu

(D̃k(1) − D̃k(0)

)}2]

+ 1

4mw̄2 Varp(Eu

(D̃k(1) + D̃k(0)

))= 1

mw̄2 {Ep(η̃k) + Varp(μ̃k)}.

When the estimand is UATE, η̃k = 0 for all k sincewithin-cluster means are observed without samplingvariability. Thus, Varapu(ψ̂(w̃k)) = Varp(μ̃k)/(mw̄2).

Next, we show that σ̂ (w̃k) is approximately unbiasedby applying the law of iterated expectations to Equa-tion (16):

Eapu(σ̂ (w̃k))

= 1

2mw̄2

[Ep

{w2

kEu

(Dk(1)2 + Dk(0)2)}

− 1

2

[Ep

{wkEu

(Dk(1) + Dk(0)

)}]2]

= 1

mw̄2 [Ep{Varu(D̃k)} + Varp(μ̃k)],

where Eu(D2k ) = Eu(Dk(0)2) = Eu(Dk(1)2) holds be-

cause the order of clusters within each pair is random-ized. For PATE, Varu(Dk) = 0 for all k since within-cluster means are observed without sampling uncer-tainty. Thus, Eap(σ̂ (w̃k)) = Varp(μ̃k)/(mw̄2).

A.4 Properties of the Harmonic Mean Estimatorand Standard Error

Modeling assumptions. The harmonic mean esti-mator, with weights based on the harmonic mean ofsample cluster sizes wk = n1kn2k/(n1k + n2k), stemsfrom the weighted one-sample t-test for the differ-

ence in means: Dkindep.∼ N (μ, (wk/

∑mk′=1 wk′)−1σ)

for k = 1,2, . . . ,m where wk is the known harmonicmean weight. In our context, Dk is the observedwithin-pair mean difference, that is, Dk ≡ ZkDk(1) +(1 − Zk)Dk(0) where Dk(1) ≡ ∑n1k

i=1 Yi1k(1)/n1k −∑n2k

i=1 Yi2k(0)/n2k and Dk(0) ≡ ∑n2k

i=1 Yi2k(1)/n2k −∑n1k

i=1 Yi1k(0)/n1k . It is well known that under thismodel,

∑mk=1 wkDk/

∑mk′=1 wk′ is the uniformly mini-

mum variance unbiased estimator.Although the derivation of this model is not dis-

cussed in the cluster randomization literature, a modelcommonly used in the statistics literature for other pur-poses gives rise to these weights (see e.g., Kalton,

1968): Yijk(t)i.i.d.∼ N (μt , σ̃ ) for t = 0,1 where σ̃ =

σ∑m

k=1 wk and∑m

k=1 wk is a known constant since wk

is assumed fixed. The normality assumption is not nec-essary for some inferential purposes, but this modeldoes require (1) independent and identical distributionsacross units within each cluster as well as (2) acrossclusters and pairs (which of course implies constantmeans and variances within and across clusters andpairs) and (3) equal variances for the two potential out-comes. In sum, the model assumes homogeneity withinand across matched-pairs. [Although we focus on thet-test here, for binary outcomes the suggested approachin the literature is also based on a homogeneity as-sumption where the odds ratio is assumed constantacross clusters; see, e.g., Donner and Donald (1987);Donner and Hauck, (1989).]

Bias conditions. The harmonic mean weight differsfrom our proposed weight in three ways. First, it givesmore weight to pairs with well-matched cluster sizesthan to pairs whose cluster sizes are unbalanced. Thatis, if we assume the sum n1k + n2k is fixed, the har-monic mean is the largest when n1k = n2k and becomessmaller as n1k − n2k increases. Second, and most im-portantly, this weight does not remove the bias whenwithin-cluster average treatment effects are identicalwithin pairs (so long as heterogeneity across matched-pairs remains), meaning that bias may remain evenwhen matching is effective. (The direction of the biasdepends on the data.) One condition under which it isunbiased is with exact matching on sample cluster sizes(i.e., n1k = n2k for all k), in which case this estimatorcoincides with our proposed estimator. Finally, sincethe weight is based on sample cluster sizes, this esti-mator is not valid for estimating CATE or PATE. Whenits assumptions hold, the harmonic mean estimator isuniformly minimum variance unbiased, and is clearlyuseful in those circumstances.

Page 23: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 51

Bias in the variance estimator. We show here thatthe variance estimator proposed in the literature (see,e.g., Donner, 1987; Donner and Donald, 1987; Don-ner and Klar, 1993) may be biased regardless of choiceof weights and the direction of bias is indeterminate.A condition under which this variance estimator is un-biased (and approximately equal to ours) is when m islarge and the weights are identical across pairs, whichis uncommon in practice. We first write this estimatorusing our notation:

δ̂(w̃k) ≡∑m

k=1 w̃2k

n3

·m∑

k=1

w̃k

{Zk

(∑n1k

i=1 Yi1k

n1k

−∑n2k

i=1 Yi2k

n2k

)+ (1 − Zk)(17)

·(∑n2k

i=1 Yi2k

n2k

−∑n1k

i=1 Yi1k

n1k

)

− ψ̂(w̃k)

}2

.

Next, we rewrite δ̂(w̃k) as

n3∑mk=1 w̃2

k

δ̂(w̃k)

=m∑

k=1

w̃k

[ZkDk(1) + (1 − Zk)Dk(0)

− 1

n

m∑k′=1

w̃k′ {Zk′Dk′(1)

+ (1 − Zk′)Dk′(0)}]2

=m∑

k=1

w̃k

[ZkDk(1)2 + (1 − Zk)Dk(0)2

− 2

n

m∑k′=1

w̃k′ {ZkDk(1)

+ (1 − Zk)Dk(0)}· {Zk′Dk′(1)

+ (1 − Zk′)Dk′(0)}

+ 1

n2

m∑k′=1

m∑k′′=1

w̃2k′w̃2

k′′

· {Zk′Dk′(1)

+ (1 − Zk′)Dk′(0)}

· {Zk′′Dk′′(1)

+ (1 − Zk′′)Dk′′(0)}].

Taking the expectation with respect to Zk , Ea(δ̂(w̃k)),gives∑m

k=1 w̃2k

2n3

m∑k=1

{(1 − w̃k

n

)w̃k

(Dk(1)2 + Dk(0)2)

− 1

2n

m∑k=1

∑k′ �=k

w̃kw̃k′(Dk(1) + Dk(0)

)

· (Dk′(1) + Dk′(0)

)}.

Comparing this expression with Ea(σ̂ (w̃k)) in equa-tion (15) shows a difference which remains even aftertaking the expectation with respect to simple randomsampling of pairs of clusters or units within clusters.Since σ̂ (w̃k) is an approximately unbiased estimate ofthe variance for UATE and PATE, δ̂(w̃k) may be bi-ased.

A.5 Covariance Estimation

This Appendix derives unbiased estimates ofCovauc(ψ̂(w̃k), τ̂ (w̃k)) and Covac(ψ̂(w̃k), τ̂ (w̃k)) us-ing the proofs in Propositions 1–3. First, we derivethe true covariance between ψ̂(w̃k) and τ̂ (w̃k). De-fine Gk(1) = ∑n1k

i=1 Ri1k(1)/n1k − ∑n2k

i=1 Ri2k(0)/n2k

and Gk(0) = ∑n2k

i=1 Ri2k(1)/n2k − ∑n1k

i=1 Ri1k(0)/n1k .Taking the expectation of with respect to Zk yields:Cova(ψ̂(w̃k), τ̂ (w̃k)) = 1

n2

∑mk=1 w̃2

k(Dk(1) − Dk(0)) ·(Gk(1) − Gk(0)). Then, we have

Covap(ψ̂(w̃k), τ̂ (w̃k))

= Ep{Cova(ψ̂(w̃k), τ̂ (w̃k))}+ Covp{Ea(ψ̂(w̃k)),Ea(τ̂ (w̃k))}

= 1

mw̄2 Covp(D̃k, G̃k),

where G̃k(t) = wkGk(t) for t = 0,1, and the lastequality follows from the fact that Ep(D̃k) =Ep(D̃k(t)), Ep(G̃k) = Ep(G̃k(t)) and Ep(D̃kG̃k) =Ep(D̃k(t)G̃k(t)) for t = 0,1. Similarly,

Covau(ψ̂(w̃k), τ̂ (w̃k))

= Eu{Cova(ψ̂(w̃k), τ̂ (w̃k))}+ Covu{Ea(ψ̂(w̃k)),Ea(τ̂ (w̃k))}

Page 24: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

52 K. IMAI, G. KING AND C. NALL

= 1

2w̄2

m∑k=1

[Covu(D̃k(1), G̃(1))

+ Covu(D̃k(0), G̃k(0))

+ 1

2{Eu(D̃k(1)) − Eu(D̃k(0))}

· {Eu(G̃k(1)) − Eu(G̃k(0))}].

And thus,Covapu(ψ̂(w̃k), τ̂ (w̃k))

= 1

mw̄2

[Covu(D̃k, G̃k)

+ 1

4Ep{Eu(D̃k(1)) − Eu(D̃k(0))}

· {Eu(G̃k(1)) − Eu(G̃k(0))}+ 1

4Covp{Eu(D̃k(1) + D̃k(0)),

Eu(G̃k(1) + G̃k(0))}]

= 1

mw̄2

[Ep{Covu(D̃k, G̃k)}

+ Covp{Eu(D̃k),Eu(G̃k)}].

Then, calculations analogous to the ones above showsthat Eap(ν̂(w̃k)) = Covap(ψ̂(w̃k), τ̂ (w̃k)) andEapu(ν̂(w̃k)) = Covapu(ψ̂(w̃k), τ̂ (w̃k)).

ACKNOWLEDGMENTS

Our proposed methods can be implemented us-ing an R package, experiment, which is available fordownload at the Comprehensive R Archive Network(http://cran.r-project.org). We thank Kevin Arceneaux,Jake Bowers, Paula Diehr, Ben Hansen, Jennifer Hill,and Dylan Small for helpful comments, and NeilKlar and Allan Donner for detailed suggestions andgracious extended conversations. For research sup-port, our thanks go to the National Institute of Pub-lic Health of Mexico, the Mexican Ministry of Health,the National Science Foundation (SES-0550873, SES-0752050), the Princeton University Committee on Re-search in the Humanities and Social Sciences and theInstitute for Quantitative Social Science at Harvard.

REFERENCES

ANGRIST, J. and LAVY, V. (2002). The effect of high school ma-triculation awards: Evidence from randomized trials. WorkingPaper 9389, National Bureau of Economic Research, Washing-ton, DC.

ANGRIST, J. D., IMBENS, G. W. and RUBIN, D. B. (1996). Identi-fication of causal effects using instrumental variables (with dis-cussion). J. Amer. Statist. Assoc. 91 444–455.

ARCENEAUX, K. (2005). Using cluster randomized field experi-ments to study voting behavior. The Annals of the AmericanAcademy of Political and Social Science 601 169–179.

BALL, S. and BOGATZ, G. A. (1972). Reading with television: Anevaluation of the electric company. Technical Report PR-72-2,Educational Testing Service, Princeton, NJ.

BLOOM, H. S. (2006). The core analytics of randomized experi-ments for social research. Technical report, MDRC.

BOX, G. E., HUNGER, W. G. and HUNTER, J. S. (1978). Statisticsfor Experimenters. Wiley, New York. MR0483116

BRAUN, T. M. and FENG, Z. (2001). Optimal permutation tests forthe analysis of group randomized trials. J. Amer. Statist. Assoc.96 1424–1432. MR1946587

CAMPBELL, M., ELBOURNE, D. and ALTMAN, D. (2004). CON-SORT statement: Extension to cluster randomised trials. BMJ328 702–708.

CAMPBELL, M., MOLLISON, J. and GRIMSHAW, J. (2001). Clustertrials in implementation research: Estimation of intracluster cor-relation coefficients and sample size. Statist. Med. 20 391–399.

CAMPBELL, M. J. (2004). Editorial: Extending consort toinclude cluster trials. BMJ 328 654–655. Available athttp://www.bmj.com/cgi/content/full/328/7441/654%20?q=y.

CORNFIELD, J. (1978). Randomization by group: A formalanalysis. American Journal of Epidemiology 108 100–102.

COX, D. R. (1958). Planning of Experiments. Wiley, New York.MR0095561

DONNER, A. (1987). Statistical methodology for paired clusterdesigns. American Journal of Epidemiology 126 972–979.

DONNER, A. (1998). Some aspects of the design and analysis ofcluster randomization trials. Appl. Statist. 47 95–113.

DONNER, A. and DONALD, A. (1987). Analysis of data arisingfrom a stratified design with the cluster as unit of randomiza-tion. Statist. Med. 6 43–52.

DONNER, A. and HAUCK, W. (1989). Estimation of a commonodds ration in paired-cluster randomization designs. Statist.Med. 8 599–607.

DONNER, A. and KLAR, N. (1993). Confidence interval construc-tion for effect measures arising from cluster randomizationtrials. Journal of Clinical Epidemiology 46 123–131.

DONNER, A. and KLAR, N. (2000a). Design and Analysis ofCluster Randomization Trials in Health Research. OxfordUniv. Press, New York.

DONNER, A. and KLAR, N. (2000b). Design and Analysis of Clus-ter Randomization Trials in Health Research. Arnold, London.

DONNER, A. and KLAR, N. (2004). Pitfalls of and controversiesin cluster randomization trials. American Journal of PublicHealth 94 416–422.

FENG, Z., DIEHR, P., PETERSON, A. and MCLERRAN, D. (2001).Selected statistical issues in group randomized trials. AnnualReview of Public Health 22 167–187.

FISHER, R. A. (1935). The Design of Experiments. Oliver andBoyd, London.

FRANGAKIS, C. E., RUBIN, D. B. and ZHOU, X.-H. (2002).Clustered encouragement designs with individual noncompli-ance: Bayesian inference with randomization, and applicationto advance directive forms (with discussion). Biostatistics 3147–164.

FRENK, J., SEPÚLVEDA, J., GÓMEZ-DANTÉS, O. and KNAUL,F. (2003). Evidence-based health policy: Three generations of

Page 25: The Essential Role of Pair Matching in Cluster-Randomized ... · THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al.,

THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 53

reform in Mexico. The Lancet 362 1667–1671.GAIL, M. H., BYAR, D. P., PECHACEK, T. F. and CORLE, D. K.

(1992). Aspects of statistical design for the community inter-vention trial for smoking cessation (COMMIT). ControlledClinical Trials 13 16–21.

GAIL, M. H., MARK, S. D., CARROLL, R. J., GREEN, S. B. andPEE, D. (1996). On design considerations and randomization-based inference for community intervention trials. Statist. Med.15 1069–1992.

GREEVY, R., LU, B., SILBER, J. H. and ROSENBAUM, P.(2004). Optimal multivariate matching before randomization.Biostatistics 5 263–275.

HAYES, R. and BENNETT, S. (1999). Simple sample size calcu-lation for cluster-randomized trials. International Journal ofEpidemiology 28 319–326.

HIGGINS, J. and GREEN, S., eds. (2006). Cochrane Handbook forSystematic Review of Interventions 4.2.5 [Updated September2006]. Wiley, Chichester, UK.

HILL, J. L., RUBIN, D. B. and THOMAS, N. (1999). The design ofthe New York school choice scholarship program evaluation. InResearch Designs: Inspired by the Work of Donald Campbell(L. Bickman, ed.) 155–180. Sage, Thousand Oaks.

HOLLAND, P. W. (1986). Statistics and causal inference. J. Amer.Statist. Assoc. 81 945–960. MR0867618

IMAI, K. (2008). Variance identification and efficiency analysis inrandomized experiments under the matched-pair design. Statist.Med. 27 4857–4873.

IMAI, K., KING, G. and STUART, E. A. (2008). Misunderstandingsamong experimentalists and observationalists about causalinference. J. Roy. Statist. Soc., Ser. A 171 481–502.

IMAI, K., KING, G. and NALL, C. (2009). Replicationdata for: The essential role of pair matching in cluster-randomized experiments, with application to the Mexicanuniversal health insurance evaluation hdl:1902.1/11047UNF:3:jeUN9XODtYUp2iUbe8gWZQ== Murray ResearchArchive [Distributor].

KALTON, G. (1968). Standardization: A technique to control forextraneous variables. Appl. Statist. 17 118–136. MR0234599

KING, G., GAKIDOU, E., RAVISHANKAR, N., MOORE, R. T.,LAKIN, J., VARGAS, M., TÉLLEZ-ROJO, M. M., ÁVILA,J. E. H., ÁVILA, M. H. and LLAMAS, H. H. (2007). A ‘politi-cally robust’ experimental design for public policy evaluation,with application to the Mexican universal health insurance pro-gram. Journal of Policy Analysis and Management 26 479–506.Available at http://gking.harvard.edu/files/abs/spd-abs.shtml.

KING, G., GAKIDOU, E., IMAI, K., LAKIN, J., MOORE, R.T., RAVISHANKAR, N., VARGAS, M., TÈLLEZ-ROJO,M. M., ÁVILA, J. E. H., ÁVILA, M. H. and LLA-MAS, H. H. (2009). Public policy for the poor? A ran-domised assessment of the Mexican universal health in-surance programme. The Lancet. To appear. Available athttp://gking.harvard.edu/files/abs/spi-abs.shtml.

KLAR, N. and DONNER, A. (1997). The merits of matching incommunity intervention trials: A cautionary tale. Statist. Med.16 1753–1764.

KLAR, N. and DONNER, A. (1998). Author’s reply. Statist. Med.17 2151–2152.

MALDONADO, G. and GREENLAND, S. (2002). Estimating causaleffects. International Journal of Epidemiology 31 422–429.

MARTIN, D. C., DIEHR, P., PERRIN, E. B. and KOEPSELL, T. D.(1993). The effect of matching on the power of randomizedcommunity intervention studies. Statist. Med. 12 329–338.

MCLAUGHLAN, G. and PEEL, D. (2000). Finite Mixture Models.Wiley, New York. MR1789474

MEDICAL RESEARCH COUNCIL (2002). Cluster randomizedtrials: Methodological and ethical considerations. Technical re-port, MRC Clinical Trials Series. Available at http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002406.

MOULTON, L. (2004). Covariate-based constrained randomizationof group-randomized trials. Clinical Trials 1 297.

MURRAY, D. M. (1998). Design and Analysis of CommunityTrials. Oxford Univ. Press, Oxford.

NEYMAN, J. (1923). On the application of probability theory toagricultural experiments: Essay on principles, section 9. Statist.Sci. 5 465–480. (Translated in 1990.) MR1092986

RAUDENBUSH, S. W. (1997). Statistical analysis and optimaldesign for cluster-randomized trials. Psychological Methods 2173–185.

RAUDENBUSH, S. W., MARTINEZ, A. and SPYBROOK, J. (2007).Strategies for improving precision in group-randomized exper-iments. Educational Evaluation and Policy Analysis 29 5–29.

ROSENBAUM, P. R. (2007). Interference between units in ran-domized experiments. J. Amer. Statist. Assoc. 102 191–200.MR2345537

RUBIN, D. B. (1990). Comments on “On the application of prob-ability theory to agricultural experiments. Essay on principles.Section 9” by J. Splawa-Neyman translated from the Polishand edited by D. M. Dabrowska and T. P. Speed. Statist. Sci. 5472–480. MR1092987

RUBIN, D. B. (1991). Practical implications of modes of statisticalinference for causal effects and the critical role of the assign-ment mechanism. Biometrics 47 1213–1234. MR1157660

SMALL, D., TEN HAVE, T. and ROSENBAUM, P. (2008). Random-ization inference in a group-randomized trial of treatments fordepression: Covariate adjustment, noncompliance and quantileeffects. J. Amer. Statist. Assoc. 103 271–279.

SNEDECOR, G. W. and COCHRAN, W. G. (1989). Statistical Meth-ods, 8th ed. Iowa State Univ. Press, Ames, IA. MR1017246

SOBEL, M. E. (2006). What do randomized studies of housing mo-bility demonstrate?: Causal inference in the face of interference.J. Amer. Statist. Assoc. 101 1398–1407. MR2307573

SOMMER, A., DJUNAEDI, E., LOEDEN, A. A., TARWOTJO, I. J.,WEST, K. P. and TILDEN, R. (1986). Impact of vitamin Asupplementation on childhood mortality, a randomized clinicaltrial. Lancet 1 1169–1173.

THOMPSON, S. G. (1998). Letter to the editor: The merits ofmatching in community intervention trials: A cautionary taleby N. Klar and A. Donner. Statist. Med. 17 2149–2151.

TURNER, R. M., WHITE, I. R. and CROUDACE, T. (2007).Analysis of cluster-randomized cross-over data. Statist. Med.26 274–289. MR2312715

VARNELL, S., MURRAY, D., JANEGA, J. and BLITSTEIN, J.(2004). Design and analysis of group-randomized trials: A re-view of recent practices. American Journal of Public Health 93393–399.

WEI, L. J. (1982). Interval estimation of location difference withincomplete data. Biometrika 69 249–251. MR0655691

WHAT WORKS CLEARINGHOUSE (2006). Evidence standardsfor reviewing studies. Technical report, Institute for Educa-tional Sciences. Available at http://www.whatworks.ed.gov/reviewprocess/standards.html.